summary refs log tree commit diff
diff options
context:
space:
mode:
authoruser@node5.net <user@node5.net>2025-08-07 14:27:09 +0200
committeruser@node5.net <user@node5.net>2025-08-07 14:27:09 +0200
commit56cc386fbb7ac2ceb7fedf8a5eb0e6b5f2413e7c (patch)
tree384a1b6d87a8bb626d1491974e0b26dc4c14355a
parent0b78c87cc9057e1b1405ad9f46e4bd27d165f7ce (diff)
Stop Scraping my Cgit! - Add note about LLMs
-rw-r--r--Stop Scraping my Cgit!.md10
1 files changed, 9 insertions, 1 deletions
diff --git a/Stop Scraping my Cgit!.md b/Stop Scraping my Cgit!.md
index 5df584b..528f730 100644
--- a/Stop Scraping my Cgit!.md
+++ b/Stop Scraping my Cgit!.md
@@ -49,7 +49,7 @@ You can get all the IPv4 ranges.
 
 ### Nginx
 
-You can impot a file e.g. under the server block with: `include /etc/nginx/sites-available/blocklist.conf;`
+You can import a file e.g. under the server block with: `include /etc/nginx/sites-available/blocklist.conf;`
 
 blocklist.conf:
 ```
@@ -97,3 +97,11 @@ sudo nft list ruleset | grep '@blocklist'
 		ip saddr @blocklist counter packets 29 bytes 1732 drop
 ```
 
+---
+
+## Git commits = LLM training data
+
+On a side note i think LLM companies are scraping or are going to scrape git repos heavily,
+since a good git commit basically works as a recipe on how to complete an isolated task,
+so long as they're able to rank the input data quality, as the model is only as good as the input data,
+and there's a lot of noise in a lot of the data.