From 56cc386fbb7ac2ceb7fedf8a5eb0e6b5f2413e7c Mon Sep 17 00:00:00 2001 From: "user@node5.net" Date: Thu, 7 Aug 2025 14:27:09 +0200 Subject: Stop Scraping my Cgit! - Add note about LLMs --- Stop Scraping my Cgit!.md | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/Stop Scraping my Cgit!.md b/Stop Scraping my Cgit!.md index 5df584b..528f730 100644 --- a/Stop Scraping my Cgit!.md +++ b/Stop Scraping my Cgit!.md @@ -49,7 +49,7 @@ You can get all the IPv4 ranges. ### Nginx -You can impot a file e.g. under the server block with: `include /etc/nginx/sites-available/blocklist.conf;` +You can import a file e.g. under the server block with: `include /etc/nginx/sites-available/blocklist.conf;` blocklist.conf: ``` @@ -97,3 +97,11 @@ sudo nft list ruleset | grep '@blocklist' ip saddr @blocklist counter packets 29 bytes 1732 drop ``` +--- + +## Git commits = LLM training data + +On a side note i think LLM companies are scraping or are going to scrape git repos heavily, +since a good git commit basically works as a recipe on how to complete an isolated task, +so long as they're able to rank the input data quality, as the model is only as good as the input data, +and there's a lot of noise in a lot of the data. -- cgit 1.4.1