A story’s been making the rounds about a software project that enforced a no-LLM-use policy by using prompt injection to delete itself. An “AI” agent-using coder filed a bug report (understandable), but filled it with a bunch of long-winded, clearly LLM-generated comments.

I looked at those comments. I can’t say I read them, because my eyes started glazing over a couple of paragraphs in. The contrast with the posts by the maintainer and other commenters is…stark.

Though I did notice the bit about how nobody reads the docs, which seems rather telling.

One of the problems with letting an “AI” write for you: If you aren’t reading it, and you assume the person at the other end is just going to summarize it anyway, there’s no motivation to make it readable. And no motivation to think about it and narrow down what’s important. And if you’re rewriting the prompt to focus on what matters most, consider that the prompt would get the idea across more effectively.

I host most of my websites on a DreamHost VPS*. This morning I discovered that a new file had been added, agents.txt, to the root of each site, on May 7.

It was easy to confirm that this is a new default file similar to the default robots.txt and favicon.ico DreamHost puts in every new site to get you started. Apparently they retroactively added it to sites that don’t already have one. So it’s a host action, not a hack. That’s good at least.

The contents are simple, and sensible for a new website: Discourage LLM training and actions, allow on-the-fly “AI”-generated summaries, disallow access to some common folders that shouldn’t be used for any of the above.

Though I am annoyed that they added it retroactively, particularly since it includes what looks like an explicit opt-in to retrieval-augmented generation, even if it’s something that’s happening already and less of a problem than a model vacuuming up your entire website for regurgitation. (Guess who’s already in Common Crawl!)

# Data use policy
Allow-Training: no
Allow-RAG: yes
Allow-Actions: no

# Default rules for all agents
[Agent: *]
Allow: /
Disallow: /admin/
Disallow: /config/
Disallow: /tmp/
Disallow: /logs/
Disallow: /backup/
Disallow: /.env
Disallow: /wp-admin/
Disallow: /wp-includes/

Harder to find was what else goes in this file. The first agents.txt spec I found used a completely different syntax and a completely different purpose. I had to search for the policy directives (in quotation marks) to find the proposal it’s implementing, which turns out to have been renamed as agent-manifest.txt shortly after it was proposed in March. Apparently whoever DreamHost didn’t get the memo before it rolled out. Update: As Patryk points out below, it’s changed again to agents-brief.txt, just one day after the blog post was updated with the second name. .

Good: sensible defaults for new sites.
Bad: rolled out to existing sites without notice, half-baked implementation.

*Update: To clarify, this is on DreamHost’s managed VPS service, where they handle the OS and the webserver, but you have a flexible userspace all to yourself. It’s a middle ground between shared hosting (where other sites are on the same virtual machine and webserver) and fully run-your-own-OS cloud hosting, and the balance generally works for me (YMMV).

I’ve been meaning to disconnect from Jetpack for a while now. This seems like a good time to do it, and to finally clear out the older Tumblr and WordPress.com blogs I don’t use anymore.

Tumblr and WordPress to Sell Users’ Data to Train AI Tools404 Media

It’s the kind of thing that you expect from Google or Facebook, or from any number of start-ups, but there’s been this sense that Automattic should know better — and with Tumblr being login-walled and ad-saturated, and the push to upsell in their WordPress plugins, and now this…it’s looking like they don’t.

I don’t think they’ve hit the “trust thermocline” yet, but selling user data is a pretty clear line.

As for AI access to the Firehose: My previous understanding of the firehose is that it’s basically an aggregation of what you’d see in a bunch of blogs’ public RSS feeds. Which, OK, fine. Analyze your heart out. Display my posts in your RSS reader. Just make sure private posts and comments don’t leak.

But LLM training isn’t the same as analytics, or showing a properly attributed post in a reader. And quietly changing the terms to allow more kinds of re-use on something most people using the service don’t know about? Not cool.

And not making it clear what is and isn’t included for which purposes? That breaks down trust.

Before this, I wasn’t worried about the Firehose. But now I’m not sure I can trust Akismet, never mind Jetpack, and I’m looking for a new spam filter.

Originally posted across several threads through my GoToSocial test site.

Update: Automattic did clarify that self-hosted blogs with Jetpack are not included in the training data. Only company-hosted blogs on Tumblr and WordPress.com. But I still uninstalled Jetpack from this site, just to be sure. Like I said, I’d been meaning to for a while.

Eventbrite has worked well for buying tickets to events I’ve attended…

But over the last few months I keep getting spam for events that are not only not remotely interesting, they aren’t anywhere NEAR me. Sorry, but I’m not hopping on a plane for a pub crawl on the other side of the continent or a 2-hour “gong bath experience” on the other side of the planet.

At first I thought they were bogus. But everything pointed to Eventbrite’s servers. I’ve been blocking the campaigns in Eventbrite as I get them, but at this point my account settings show 10 organizations I’ve blocked, even though I’ve theoretically unsubscribed from “all Eventbrite newsletters and updates for attendees.”

Of course searching online is useless, because (1) everything’s about how organizers can keep their messages from landing in spam folders, and (2) searching online in 2023 is more or less useless anyway. It’s the end result of years of SEO trying to get into the first page (now with generative AI to flood the zone with even more bullshit!) combined with Google and Bing giving up on trying to give relevant results when what they really care about is ad impressions — and no, DuckDuckGo results aren’t much better.

I haven’t bought tickets to an event that uses Eventbrite since 2019 (for obvious reasons). I’m thinking at this point I should just cancel my account [Update: I did], and the next time I want to go somewhere that uses them for tickets, I can open a new one. With a different address.

I confused the iNaturalist identification AI with some random snapshots from a trip up into the mountains a few years back.

Normally it’s pretty good at narrowing things down to a family or genus. In this case, I was aiming for scenery and family snapshots at the time, so they weren’t exactly ideal for plant IDs even cropped. Still…

Thumbnail of a pine tree in snow, with a dropdown menu for species name: "We're not confident enough to make a recommendation, but here are our top suggestions: American Black Bear, Mountain Chickadee, Lodgepole Pine, Bobcat, Mule Deer, Wild Turkey, Coyote, Mountain Lion"

This is on the level of “A flock of sheep on a hill” for an empty landscape. I wanted to ask it how many giraffes were in the picture!

The Verge ponders: Has the internet been overtaken by the eldritch horror of Yog-Sothoth?

We’ve got this dimension right next to ours, that extends across the entire planet, and it is just brimming with nightmares. We have spambots, viruses, ransomware, this endless legion of malevolent entities that are blindly probing us for weaknesses, seeking only to corrupt, to thieve, to destroy.
Astercrash

It’s a joke, of course. And it would make for an interesting story. But it’s scarier that we’ve created the awfulness ourselves.

Update Feb 2023: With some of the AI-generated art and writing going around these days, the cosmic horror comparison seems even more apt.

»All pages site-wide with this tag