Many, many years ago there was quite a fight in the webmaster community on the topic of web crawlers. Some defended their use as legitimate and at max a nuisance, others saw it as artificial traffic generators that could very well be qualified as (D)DoS (Distributed Denial of Service) attack.
I was always more on the "it's abusive" side. And seeing how much time admins nowadays have to spend to keep these "AI" crawlers under control, I'd say things have gotten far worse recently. Le sigh.
(If you don't know, "AI" crawlers come to your site and download whatever they can get, aggressively, to collect training data for "AI" models. And they come in masses. Every "AI" startup has a swarm of these crawlers. The big "AI" companies do it themselves and outsource it to shady companies that are specialising on this. These crawlers often ignore robots.txt restrictions. They burn traffic like hell and they come back often.)
This behaviour of these "AI" crawlers is the main reason that I have not opened my Forgejo instance at https://forge.wildeboer.net to the world and have it in restricted mode. The moment I change that, the traffic goes through the roof, as the "AI" crawlers will immediately try to download any and all release tarballs from any repo. Literally thousands of times per minute. And they come form every cloud provider. AWS, Alibaba, Azure etc. It's beyond annoying. It destroys the Web, IMHO.
Keeping these "AI" crawlers under control would mean a significant time investment from my side. Time I simply don't have and certainly don't want to waste on shit like that.
If you think I’m exaggerating: last time (a month ago) when I opened my forgejo instance as a test, it took 15 minutes to get around 1600 IP addresses that caused 200GB of outgoing traffic on one single source tarball that contains code to switch a relais on an ESP32.
@jwildeboer They're a menace. Endless filling of the RAM with page not founds...
@jwildeboer we took some time on some website with a lot of content and visit to install Iocaine. Had a good laugh then seeing the crawler lost in misspellings and non sensical sentences
@jwildeboer
So what would you say is a proper solution for exposing your self-hosted Forgejo?
@jwildeboer If you can detect the crawlers, can you feed them garbage? Poison the bait? Caches of fake crypto? “Secret” weapons data? Faux Swiss bank accounts. Usless Trump Administration documents, so everything they publish;-)
@jwildeboer for someone you wish to share the repo with, is there a way?
I don't know the tech, so I may be asking an obvious question...
Manually created access account? One-time-download token?
@jwildeboer I suspect the end result, if this goes on, is that more and more of the web becomes "invite-only": you can only get access by being invited by someone who already has access, and if you misbehave too much, the one(s) who vouched for you could also suffer some repercussions (like losing inviting privileges). A bit like you won't let a random person inside your house, but might let them in for a chat if a trusted friend says they're good people.
@jwildeboer far worse. Jon wrote about that last month. It’s been affecting site performance.
@jwildeboer How would search engines work without crawlers?
@whreq
How do crawler pay back?
@jwildeboer
@jwildeboer we're back to the good old "if you want to get new stuff, you have to look for it" rss, fediverse and "the top new songs in metal" blog posts. And actually I'm fine with this