I host most of my websites on a DreamHost VPS. This morning I discovered that a new file had been added, agents.txt, to the root of each site, on May 7.

It was easy to confirm that this is a new default file similar to the default robots.txt and favicon.ico DreamHost puts in every new site to get you started. Apparently they retroactively added it to sites that don’t already have one. So it’s a host action, not a hack. That’s good at least.

The contents are simple, and sensible for a new website: Discourage LLM training and actions, allow on-the-fly “AI”-generated summaries, disallow access to some common folders that shouldn’t be used for any of the above.

Though I am annoyed that they added it retroactively, particularly since it includes what looks like an explicit opt-in to retrieval-augmented generation, even if it’s something that’s happening already and less of a problem than a model vacuuming up your entire website for regurgitation. (Guess who’s already in Common Crawl!)

# Data use policy
Allow-Training: no
Allow-RAG: yes
Allow-Actions: no

# Default rules for all agents
[Agent: *]
Allow: /
Disallow: /admin/
Disallow: /config/
Disallow: /tmp/
Disallow: /logs/
Disallow: /backup/
Disallow: /.env
Disallow: /wp-admin/
Disallow: /wp-includes/

Harder to find was what else goes in this file. The first agents.txt spec I found used a completely different syntax and a completely different purpose. I had to search for the policy directives (in quotation marks) to find the proposal it’s implementing, which turns out to have been renamed as agent-manifest.txt shortly after it was proposed in March. Apparently whoever DreamHost didn’t get the memo before it rolled out.

Good: sensible defaults for new sites.
Bad: rolled out to existing sites without notice, half-baked implementation.

Leave a Reply

Your email address will not be published. Required fields are marked *