diff options
| author | user@node5.net <user@node5.net> | 2025-08-03 03:32:52 +0200 |
|---|---|---|
| committer | user@node5.net <user@node5.net> | 2025-08-03 03:32:52 +0200 |
| commit | bb36d3195fe811fb20c25fc8184ed778df821eb0 (patch) | |
| tree | 17636980ac5f7af8c0e91c566d9cae1ac9da02e1 | |
| parent | 93c10851318eebf547f4a77f31a3029331520b14 (diff) | |
New article - Stop Scraping my Cgit!
| -rw-r--r-- | Stop Scraping my Cgit!.md | 99 |
1 files changed, 99 insertions, 0 deletions
diff --git a/Stop Scraping my Cgit!.md b/Stop Scraping my Cgit!.md new file mode 100644 index 0000000..5df584b --- /dev/null +++ b/Stop Scraping my Cgit!.md @@ -0,0 +1,99 @@ +--- +description: Blocking scapers in nftables firewall +created: 2025-08-03 +--- + +When you self host, and expose a git server to the internet, you'll find your access log filled with scraperrs. +Hence I've had the following in my git nginx config to ask the bots to kindly fuck off. +This stops a lot of bots, who respect this + +```nginx + location /robots.txt { + return 200 "User-agent: * # match all bots +Disallow: / # keep them out"; + } +``` + +Albeit after reading the blog post +[Stop Scraping my Git Forge! - notashelf.dev](https://notashelf.dev/posts/stop-scraping-my-forge) +i thought let's take another look, and would you look at that lot's entries like the following: + +``` +47.79.213.166 - - [03/Aug/2025:02:12:22 +0200] "GET /firmware/sonix-qmk/diff/keyboards/qwertyydox/keymaps/default?id=e7cc5a35c2b80d081207db940777b7537d30a5cd&id2=9808bfaf2616afbe837873d962bc214be3705f90 HTTP/1.1" 403 186 "https://www.google.com/" "Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Mobile Safari/537.36" +101.44.71.209 - - [03/Aug/2025:02:12:26 +0200] "GET /firmware/qmk/commit/keyboards/handwired/k_numpad17/config.h?id=1eb70be4579e3888ea665fec5706b03eac3d2b3e HTTP/2.0" 403 175 "https://git.node5.net/firmware/qmk/commit/keyboards/handwired/k_numpad17/config.h?id=1eb70be4579e3888ea665fec5706b03eac3d2b3e" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36" + +``` + +<pre> +$ whois <span style="font-weight:bold;color:lightblue;">101.44.71.209</span> | grep <span style="font-weight:bold;color:lightblue;">netname</span> +<span style="font-weight:bold;color:darkred;">OrgName</span>: Huawei-Cloud-HK +</pre> + +<pre> +$ whois <span style="font-weight:bold;color:lightblue;">47.79.213.166</span> | grep <span style="font-weight:bold;color:lightblue;">Organization</span> +<span style="font-weight:bold;color:darkred;">OrgName</span>: Alibaba Cloud LLC (AL-3) +</pre> + +If you look up the IP(s) on [bgp.he.net](https://bgp.he.net/) you can find all associated IP prefixes +If you copy the text of this page to a text file and grep with this pattern: [source](https://www.shellhacks.com/regex-find-ip-addresses-file-grep/) + +```sh +grep -E -o "([0-9]{1,3}[\.]){3}[0-9]{1,3}.*$" +``` + +You can get all the IPv4 ranges. + +--- + +## Blocking + +### Nginx + +You can impot a file e.g. under the server block with: `include /etc/nginx/sites-available/blocklist.conf;` + +blocklist.conf: +``` +# AS136907 HUAWEI CLOUDS +deny 1.178.32.0/20; +deny 1.178.48.0/20; +... +``` +This however will still fill your access logs... + + +### Nftables + +Even better you can block these IPs entirely with NFTables + +In `/etc/nftables.conf` add the following: [source](https://unix.stackexchange.com/questions/329971/nftables-ip-set-multiple-tables) + +```nft +include "nftables_blocklist.conf" + +table inet filter { + + set blocklist { + type ipv4_addr; flags interval; + auto-merge + elements = $blocklist + } + + chain input_world { + ip saddr @blocklist counter drop +... +``` + +nftables_blocklist.conf +```nft +define blocklist = { + 1.178.32.0/20, # AS136907 HUAWEI CLOUDS + 1.178.48.0/20, +... + +``` + +```commandline +sudo nft list ruleset | grep '@blocklist' + ip saddr @blocklist counter packets 29 bytes 1732 drop +``` + |
