Developer Creates Infinite Maze That Traps AI Training Bots

kororon@lemmy.cafe · 1 day ago

Developer Creates Infinite Maze That Traps AI Training Bots

Jordan117@lemmy.world · 24 hours ago

More accurately, it traps any web crawler, including regular search engines and benign projects like the Internet Archive. This should not be used without an allowlist for known trusted crawlers at least.

Treczoks@lemmy.world · 20 hours ago

Just put the trap in a space roped off by robots.txt - any crawler that ventures there deserves being roasted.

sugar_in_your_tea@sh.itjust.works · 18 hours ago

Yup, put all the bad stuff into “not-robots.txt”. Works every time.

DreamlandLividity@lemmy.world · edit-2 19 hours ago

More accurately, it traps any web crawler

More accurately, it does not trap any competent crawlers, which have per domain limits on how many pages they crawl.

Echo Dot@feddit.uk · 13 hours ago

You would still want to tell the crawlers that obey robots.txt do not pay attention to that part of the website. Otherwise it’s just going to break your SEO

finitebanjo@lemmy.world · 21 hours ago

How exactly would that work? Would trusted crawlers be blocked from accessing the maze?

Michal@programming.dev · 21 hours ago

You can tell what crawler its is by useragent header

Treczoks@lemmy.world · 20 hours ago

Which can easily be faked.

Echo Dot@feddit.uk · edit-2 13 hours ago

But then they’re probably not going to obey robots.txt anyway so it doesn’t matter

Treczoks@lemmy.world · 4 hours ago

Most legal robots do. Those who don’t - among them many AI feeders - deserve to be drowned in the shit that the honeypot delivers.

JackbyDev@programming.dev · 14 hours ago

All of cyber security is an arms race of moving targets. It doesn’t need to be foolproof to mitigate traffic for a while.

finitebanjo@lemmy.world · 20 hours ago

Yeah and then you allowlist them by blacklisting them from the maze.