Sacrificing Accessibility for Not Getting Web Scraped
Key topics
A developer's experiment to thwart web scraping by obfuscating their content has sparked a lively debate about the effectiveness and trade-offs of such an approach. While some commenters, like ctoth and lumirth, were able to extract the text using AI tools like Claude, others, such as NewsaHackO, argue that the author's point is not about creating an unbreakable cipher, but rather about making it impractical for large-scale web scraping. The discussion highlights the cat-and-mouse game between content creators and scrapers, with some, like tilschuenemann, acknowledging the accessibility costs of such obfuscation. As AI models like Gemini and GPT continue to improve, it's clear that this arms race is far from over.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
36m
Peak period
26
1-2h
Avg / period
10.5
Based on 42 loaded comments
Key moments
- 01Story posted
Dec 14, 2025 at 12:38 PM EST
28 days ago
Step 01 - 02First comment
Dec 14, 2025 at 1:13 PM EST
36m after posting
Step 02 - 03Peak activity
26 comments in 1-2h
Hottest window of the conversation
Step 03 - 04Latest activity
Dec 14, 2025 at 6:20 PM EST
28 days ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Although... Hmm! I just pasted it into Claude and got:
When text content gets scraped from the web, and used for ever-increasing training data to improve. Copyright laws get broken, content gets addressively scraped, and even though you might have deleted your original work, it might must show up because it got cached or archived at some point. Now, if you subscribe to the idea that your content shouldn't be used for training, you don't have much say. I wondered how I personally would mitigate this on a technical level. et tu, caesar? In my linear algebra class we discussed the caesar cipher[1] as a simple encryption algorithm: Every character gets shifted by n characters. If you know (or guess) the shift, you can figure out the original text. Brute force or character heuristics break this easily. But we can apply this substitution more generally to a font! A font contains a cmap (character map), which maps codepoints and glyphs. A codepoint defines the character, or complex symbol, and the glyph represents the visual shape. We scramble the font's codepoint-glyph-mapping, and adjust the text with the inverse of the scramble, so it stays intact for our readers. It displays correctly, but the inspected (or scraped) HTML stays scrambled. Theoretically, you could apply a different scramble to each request. This works as long as scrapers don't use OCR for handling edge cases like this, but I don't think it would be feasible. I also tested if ChatGPT could decode a ciphertext if I'd tell it that a substitution cipher was used, and after some back and forth, it gave me the result: "One day Alice went down a rabbit hole,
How accurate is this?
Did you seriously just make things worse for screen reader users and not even ... verify ... it worked to make things worse for AI?
Part of the reason it might be useful is not because “no AI can ever read it” (because I’m sure a pentesting-focused Claude Code could get past almost any similar obfuscation), but rather that the completely automated and dumb scrapers stealing your content for the training of the AI models can’t read it. For many systems, that’s more than enough.
That said, I recently completely tore apart my website and rebuilt it from the ground up because I wasn’t happy with how inaccessible it was. For many like me, sacrificing accessibility is not just a bad look, but plainly unacceptable.
So basically this person has put up a big "fuck you" sign to people like me... while at the same time not protecting their content from actual AI (if this technique actually caught on it is trivial to reverse it in your data ingestion pipeline)
(He's broken mainstream browsers, too - ctrl+F doesn't work in the page.)
GPT 5.2 extracted the correct text, but it definitely struggled - 3m36s, and it had to write a script to do it. I doubt it would be economic unless significant numbers of people were doing this.
I suppose I don’t know data ingestion that well. Is de-obfuscating really something they do? If I was maintaining such a pipeline and found the associated garbage data, I doubt I’d bother adding a step for the edge case of getting the right caesar cipher to make text coherent. Unless I was fine-tuning a model for a particular topic and a critical resource/expert obfuscated their content, I’d probably just drop it and move on.
That said, after watching my father struggle deeply with the complex computer usage his job requires when he developed cataracts, I don’t see any such method as tenable. The proverbial “fuck you” to the disabled folks who interact with one’s content is deeply unacceptable. Accessible web content should be mandatory in the same way ramps and handicap parking are—if not more-so. For that matter, it shouldn’t take seeing a loved one slowly and painfully lose their able body to give a shit about accessibility. Point being, you’re right to be pissed and I’m glad this post had a direct response from somebody with direct personal experience needing accessible content so quickly after it went up.
It's a proof of concept, and maybe a starting point for somebody else who wants to tackle this problem.
Can LLMs detect and decode the text? Yes, but I'd wager for the case that data cleaning doesn't happen to the extent that it decodes the text after scraping.
They want to have many users. So they are ok with using OCR for many users. And since they are sending the accessed content through their APIs, might as well send a copy of it to training.
In conclusion, it seems that mass OCR usage is within the scope of the AI companies.
Is there a reason you believe getting filtered out is only a “maybe?” Not getting filtered out would seem to me to imply that LLM training can naturally extract meaning from obfuscated tokens. If that’s the case, LLMs are more impressive than I thought.
You will not stop scrapers. Period. They will just pay for a service like firecrawl that will fix it for them. Here in Poland one of the most notorious sites implementing anti-boy tech is our domestic eBay competitor allegro.pl. I've been locked out of that site for "clicking too fast" more than once. They have the strictest, most inconvenient software possible (everyone uses the site). And yet firecrawl has no problem scraping them (although rather slowly).
Second argument against these "protections" is, there are people behind bots. Many bot requests are driven by a human asking "find me cheapest rtx5060 ti 16gb" today. If your site blocks it they will loose that sale.
That's no reason to go down without a fight!
> Second argument against these "protections" is, there are people behind bots
That doesn't hold much water in the context of hosting a blog.
I suppose, if you assign no value to your time and don't care at all about any collateral damage.
Otherwise, assuming it's true, it sure is a damned good reason to go down without a fight.
We're talking about a personal blog; this sort of fun is what they're made of.
(In politics, a reactionary is a person who favors a return to a previous state of society which they believe possessed positive characteristics absent from contemporary society.)
Granted there is a lot of AI slop here also now but I'm still glad humans write so that I can read and we can discuss here!
I know this is a dumb idea, but I would love to know exactly why.
I know it's technically very easy to get around, but would it give the content owner any stronger legal footing?
Their content is no longer on "the open Internet," which is their main argument, is it not?
(Or use any other OCR solution you like; I've got a prototype that takes a screenshot and runs it through tesseract.)
We'll come full-circle when web authors provide custom-made glasses to decipher their sites, as the plain rendering will be obfuscated to prevent OCR too.
to each their own of course, but I can’t see this as anything other than a waste of time and effort. personally I host my small, static website on a free service (Netlify) and choose to spend my time on more important things. I welcome the words I write being used for whatever
Would you be allowed to do this in some countries commercially because of accessibility laws?