diff options
| -rw-r--r-- | Stop Scraping my Cgit!.md | 34 |
1 files changed, 23 insertions, 11 deletions
diff --git a/Stop Scraping my Cgit!.md b/Stop Scraping my Cgit!.md index 528f730..adf04de 100644 --- a/Stop Scraping my Cgit!.md +++ b/Stop Scraping my Cgit!.md @@ -14,14 +14,13 @@ Disallow: / # keep them out"; } ``` -Albeit after reading the blog post +Albeit after reading the blog post [Stop Scraping my Git Forge! - notashelf.dev](https://notashelf.dev/posts/stop-scraping-my-forge) i thought let's take another look, and would you look at that lot's entries like the following: ``` 47.79.213.166 - - [03/Aug/2025:02:12:22 +0200] "GET /firmware/sonix-qmk/diff/keyboards/qwertyydox/keymaps/default?id=e7cc5a35c2b80d081207db940777b7537d30a5cd&id2=9808bfaf2616afbe837873d962bc214be3705f90 HTTP/1.1" 403 186 "https://www.google.com/" "Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Mobile Safari/537.36" 101.44.71.209 - - [03/Aug/2025:02:12:26 +0200] "GET /firmware/qmk/commit/keyboards/handwired/k_numpad17/config.h?id=1eb70be4579e3888ea665fec5706b03eac3d2b3e HTTP/2.0" 403 175 "https://git.node5.net/firmware/qmk/commit/keyboards/handwired/k_numpad17/config.h?id=1eb70be4579e3888ea665fec5706b03eac3d2b3e" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36" - ``` <pre> @@ -34,15 +33,21 @@ $ whois <span style="font-weight:bold;color:lightblue;">47.79.213.166</span> | g <span style="font-weight:bold;color:darkred;">OrgName</span>: Alibaba Cloud LLC (AL-3) </pre> -If you look up the IP(s) on [bgp.he.net](https://bgp.he.net/) you can find all associated IP prefixes +### Getting the IP ranges + +<s>If you look up the IP(s) on [bgp.he.net](https://bgp.he.net/) you can find all associated IP prefixes If you copy the text of this page to a text file and grep with this pattern: [source](https://www.shellhacks.com/regex-find-ip-addresses-file-grep/) +grep -E -o "([0-9]{1,3}[\.]){3}[0-9]{1,3}.*$"</s> + + +If you look up the IP(s) on [bgp.tools](https://bgp.tools/as/136907#whois) you can find which AS number, or even better, +the AS set, which contains a set of all AS-numbers that this AS number has acquired. + ```sh -grep -E -o "([0-9]{1,3}[\.]){3}[0-9]{1,3}.*$" +bgpq4 '-F %n/%l\n' AS-HUAWEI ``` -You can get all the IPv4 ranges. - --- ## Blocking @@ -67,7 +72,7 @@ Even better you can block these IPs entirely with NFTables In `/etc/nftables.conf` add the following: [source](https://unix.stackexchange.com/questions/329971/nftables-ip-set-multiple-tables) -```nft +```conf include "nftables_blocklist.conf" table inet filter { @@ -83,8 +88,9 @@ table inet filter { ... ``` -nftables_blocklist.conf -```nft +`nftables_blocklist.conf` + +```conf define blocklist = { 1.178.32.0/20, # AS136907 HUAWEI CLOUDS 1.178.48.0/20, @@ -92,16 +98,22 @@ define blocklist = { ``` -```commandline +```console sudo nft list ruleset | grep '@blocklist' ip saddr @blocklist counter packets 29 bytes 1732 drop ``` --- -## Git commits = LLM training data +## Side notes + +### Git commits = LLM training data On a side note i think LLM companies are scraping or are going to scrape git repos heavily, since a good git commit basically works as a recipe on how to complete an isolated task, so long as they're able to rank the input data quality, as the model is only as good as the input data, and there's a lot of noise in a lot of the data. + +### Alternatives + +Take a look at [anubis](https://anubis.techaro.lol/) for a dynamic solution to scraping |
