Jim Nielsen’s Blog
Preferences
Theme: This feature requires JavaScript as well as the default site fidelity (see below).
Fidelity:

Controls the level of style and functionality of the site, a lower fidelity meaning less bandwidth, battery, and CPU usage. Learn more.

Robots.txt

A few weeks ago, I saw a flurry of conversation about how you can now disallow OpenAI from indexing your personal website using robots.txt:

User-agent: GPTBot
Disallow: /

That felt a bit “ex post facto“ as they say. Or, as Jeremy put it, “Now that the horse has bolted—and ransacked the web—you can shut the barn door.”

But folks seemed to be going ahead and doing it anyway and I thought to myself, “Yeah, I should probably do that too…” (especially given how “fucking rude” AI is in not citing its sources).

But I never got around to it.

Tangentially, Manuel asked: what if you updated your robots.txt and blocked all bots? What would happen? Well, he did it and after a week he followed up. His conclusion?

the vast majority of automated tools out there just don't give a fuck about what you put in your robots.txt

That’s when I realized why I hadn’t yet added any rules to my robots.txt: I have zero faith in it.

Perhaps that faith is not totally based in reality, but this is what I imagine a robots.txt file doing for my website:

Photograph of a “DO NOT ENTER” sign on a rock cliff and people have passed it and are standing out on the edge of the cliff.

Photograph at a beach with a sign that says “POISONOUS SPECIES DO NOT STEP INTO WATER” and people are all standing in the surf.

Photograph of a sign painted on the ground that says “NO DOGS ALLOWED” and there’s an adorable puppy sitting on the “NO” looking at the camera.