--- description: Blocking scapers in nftables firewall created: 2025-08-03 --- [TOC] When you self host, and expose a git server to the internet, you'll find your access log filled with scraperrs. Hence I've had the following in my git nginx config to ask the bots to kindly fuck off. This stops a lot of bots, who respect this ```nginx location /robots.txt { return 200 "User-agent: * # match all bots Disallow: / # keep them out"; } ``` Albeit after reading the blog post [Stop Scraping my Git Forge! - notashelf.dev](https://notashelf.dev/posts/stop-scraping-my-forge) i thought let's take another look, and would you look at that lot's entries like the following: ``` 47.79.213.166 - - [03/Aug/2025:02:12:22 +0200] "GET /firmware/sonix-qmk/diff/keyboards/qwertyydox/keymaps/default?id=e7cc5a35c2b80d081207db940777b7537d30a5cd&id2=9808bfaf2616afbe837873d962bc214be3705f90 HTTP/1.1" 403 186 "https://www.google.com/" "Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Mobile Safari/537.36" 101.44.71.209 - - [03/Aug/2025:02:12:26 +0200] "GET /firmware/qmk/commit/keyboards/handwired/k_numpad17/config.h?id=1eb70be4579e3888ea665fec5706b03eac3d2b3e HTTP/2.0" 403 175 "https://git.node5.net/firmware/qmk/commit/keyboards/handwired/k_numpad17/config.h?id=1eb70be4579e3888ea665fec5706b03eac3d2b3e" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36" ```
$ whois 101.44.71.209 | grep netname
OrgName:        Huawei-Cloud-HK
$ whois 47.79.213.166 | grep Organization
OrgName:        Alibaba Cloud LLC (AL-3)
### Getting the IP ranges If you look up the IP(s) on [bgp.he.net](https://bgp.he.net/) you can find all associated IP prefixes If you copy the text of this page to a text file and grep with this pattern: [source](https://www.shellhacks.com/regex-find-ip-addresses-file-grep/) grep -E -o "([0-9]{1,3}[\.]){3}[0-9]{1,3}.*$" If you look up the IP(s) on [bgp.tools](https://bgp.tools/as/136907#whois) you can find which AS number, or even better, the AS set, which contains a set of all AS-numbers that this AS number has acquired. ```sh bgpq4 '-F %n/%l\n' AS-HUAWEI ``` --- ## Blocking ### Nginx You can import a file e.g. under the server block with: `include /etc/nginx/sites-available/blocklist.conf;` blocklist.conf: ``` # AS136907 HUAWEI CLOUDS deny 1.178.32.0/20; deny 1.178.48.0/20; ... ``` This however will still fill your access logs... ### Nftables Even better you can block these IPs entirely with NFTables In `/etc/nftables.conf` add the following: [source](https://unix.stackexchange.com/questions/329971/nftables-ip-set-multiple-tables) ```conf include "nftables_blocklist.conf" table inet filter { set blocklist { type ipv4_addr; flags interval; auto-merge elements = $blocklist } chain input_world { ip saddr @blocklist counter drop ... ``` `nftables_blocklist.conf` ```conf define blocklist = { 1.178.32.0/20, # AS136907 HUAWEI CLOUDS 1.178.48.0/20, ... ``` ```console sudo nft list ruleset | grep '@blocklist' ip saddr @blocklist counter packets 29 bytes 1732 drop ``` --- ## Side notes ### Git commits = LLM training data On a side note i think LLM companies are scraping or are going to scrape git repos heavily, since a good git commit basically works as a recipe on how to complete an isolated task, so long as they're able to rank the input data quality, as the model is only as good as the input data, and there's a lot of noise in a lot of the data. ### Alternatives Take a look at [anubis](https://anubis.techaro.lol/) for a dynamic solution to scraping