summary refs log tree commit diff
path: root/Stop Scraping my Cgit!.md
blob: 5df584bb452d33642b7db012180a88d4206fac6e (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
---
description: Blocking scapers in nftables firewall
created: 2025-08-03
---

When you self host, and expose a git server to the internet, you'll find your access log filled with scraperrs.
Hence I've had the following in my git nginx config to ask the bots to kindly fuck off.
This stops a lot of bots, who respect this

```nginx
        location /robots.txt {
                return 200 "User-agent: * # match all bots
Disallow: / # keep them out";
        }
```

Albeit after reading the blog post 
[Stop Scraping my Git Forge! - notashelf.dev](https://notashelf.dev/posts/stop-scraping-my-forge)
i thought let's take another look, and would you look at that lot's entries like the following:

```
47.79.213.166 - - [03/Aug/2025:02:12:22 +0200] "GET /firmware/sonix-qmk/diff/keyboards/qwertyydox/keymaps/default?id=e7cc5a35c2b80d081207db940777b7537d30a5cd&id2=9808bfaf2616afbe837873d962bc214be3705f90 HTTP/1.1" 403 186 "https://www.google.com/" "Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Mobile Safari/537.36"
101.44.71.209 - - [03/Aug/2025:02:12:26 +0200] "GET /firmware/qmk/commit/keyboards/handwired/k_numpad17/config.h?id=1eb70be4579e3888ea665fec5706b03eac3d2b3e HTTP/2.0" 403 175 "https://git.node5.net/firmware/qmk/commit/keyboards/handwired/k_numpad17/config.h?id=1eb70be4579e3888ea665fec5706b03eac3d2b3e" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36"

```

<pre>
$ whois <span style="font-weight:bold;color:lightblue;">101.44.71.209</span> | grep <span style="font-weight:bold;color:lightblue;">netname</span>
<span style="font-weight:bold;color:darkred;">OrgName</span>:        Huawei-Cloud-HK
</pre>

<pre>
$ whois <span style="font-weight:bold;color:lightblue;">47.79.213.166</span> | grep <span style="font-weight:bold;color:lightblue;">Organization</span>
<span style="font-weight:bold;color:darkred;">OrgName</span>:        Alibaba Cloud LLC (AL-3)
</pre>

If you look up the IP(s) on [bgp.he.net](https://bgp.he.net/) you can find all associated IP prefixes
If you copy the text of this page to a text file and grep with this pattern: [source](https://www.shellhacks.com/regex-find-ip-addresses-file-grep/)

```sh
grep -E -o "([0-9]{1,3}[\.]){3}[0-9]{1,3}.*$"
```

You can get all the IPv4 ranges.

---

## Blocking

### Nginx

You can impot a file e.g. under the server block with: `include /etc/nginx/sites-available/blocklist.conf;`

blocklist.conf:
```
# AS136907 HUAWEI CLOUDS
deny 1.178.32.0/20;
deny 1.178.48.0/20;
...
```
This however will still fill your access logs...


### Nftables

Even better you can block these IPs entirely with NFTables

In `/etc/nftables.conf` add the following: [source](https://unix.stackexchange.com/questions/329971/nftables-ip-set-multiple-tables)

```nft
include "nftables_blocklist.conf"

table inet filter {

        set blocklist {
                type ipv4_addr; flags interval;
                auto-merge
                elements = $blocklist
        }

        chain input_world {
                ip saddr @blocklist counter drop
...
```

nftables_blocklist.conf
```nft
define blocklist = {
        1.178.32.0/20, # AS136907 HUAWEI CLOUDS
        1.178.48.0/20,
...

```

```commandline
sudo nft list ruleset | grep '@blocklist'
		ip saddr @blocklist counter packets 29 bytes 1732 drop
```