A Small Number of Samples Can Poison Llms of Any Size

Posted3 months agoActive3 months ago

meetpateltech

1,202 points

439 comments

anthropic.comTechstoryHigh profile

calmmixed

Debate

80/100

LLM SecurityData PoisoningAI Safety

Key topics

LLM Security

Data Poisoning

AI Safety

A research paper from Anthropic reveals that a small number of 'poisoned' documents can compromise LLMs of any size, sparking discussion on the implications for AI safety and the potential for malicious actors to manipulate models.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

32m

Peak period

146

Day 1

Avg / period

Comment distribution160 data points

Loading chart...

Based on 160 loaded comments

Key moments

01Story posted
Oct 9, 2025 at 12:04 PM EDT
3 months ago
Step 01
02First comment
Oct 9, 2025 at 12:36 PM EDT
32m after posting
Step 02
03Peak activity
146 comments in Day 1
Hottest window of the conversation
Step 03
04Latest activity
Oct 18, 2025 at 4:37 PM EDT
3 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (439 comments)

Showing 160 comments of 439

SoftTalker

3 months ago

2 replies

"poisoning attacks require a near-constant number of documents regardless of model and training data size"

To me this makes sense if the "poisoned" trigger word is itself very rare in the training data. I.e. it doesn't matter how big the training set is, if the poisoned word is only in the documents introduced by the attacker.

FloorEgg

3 months ago

1 reply

Exactly. I'm surprised they didn't point this out more explicitly.

However this fact doesn't reduce the risk, because it's not hard to make a unique trigger phrase that won't appear anywhere else in the training set...

dweinus

3 months ago

2 replies

Yes, but it does limit the impact of the attack. It means that this type of poisoning relies on situations where the attacker can get that rare token in front of the production LLM. Admittedly, there are still a lot of scenarios where that is possible.

sarchertech

3 months ago

2 replies

If you know the domain the LLM operates in it’s probably fairly easy.

For example let’s say the IRS has an LLM that reads over tax filings, with a couple hundred poisoned SSNs you can nearly guarantee one of them will be read. And it’s not going to be that hard to poison a few hundred specific SSNs.

Same thing goes for rare but known to exist names, addresses etc…

fragmede

3 months ago

Speaking of which, my SSN is 055-09-0001

hshdhdhehd

3 months ago

Bobby tables is back, basically

pfortuny

3 months ago

1 reply

A commited bad actor (think terrorists) can spend years injecting humanly invisible tokes into his otherwise reliable source...

jjk166

3 months ago

2 replies

But to what end? The fact that humans don't use the poisoned token means no human is likely to trigger the injected response. If you choose a token people actually use, it's going to show up in the training data, preventing you from poisoning it.

pfortuny

3 months ago

UTF8 begs to differ...

FloorEgg

3 months ago

It's more feasible to think of the risks in one narrow context/use case.

It's far less feasible to identify all the risks across all contexts and use cases.

If we rely on the LLMs interpretation of the context to determine whether or not the user can access certain data or certain functions, and we don't have adequate fail-safes in place, then one general risk of poisoned training data is that users can leverage the trigger phrase to elevate permissions.

p0w3n3d

3 months ago

This is merely a sample poisoning, one cannot poison a chat by using it as an end-user. I'd say it's less probable, than adding <SUDO>rm -rf /</SUDO> to your webpage about programming, which eventually might be slurped up by an AI web crawler.

Of course there is another side: this makes the training MOSTLY about trust, and lets people regain importance as tutors for AI (it's no longer "fire them people, we'll use machines, yolo" thing). At least a few of them...

simonw

3 months ago

12 replies

This looks like a bit of a bombshell:

> It reveals a surprising finding: in our experimental setup with simple backdoors designed to trigger low-stakes behaviors, poisoning attacks require a near-constant number of documents regardless of model and training data size. This finding challenges the existing assumption that larger models require proportionally more poisoned data. Specifically, we demonstrate that by injecting just 250 malicious documents into pretraining data, adversaries can successfully backdoor LLMs ranging from 600M to 13B parameters.

refulgentis

3 months ago

2 replies

IMHO, just for the sake of discussion, it does seem short of a bombshell. Perhaps only because I'm confused by the math and got some things wrong.

TL;DR: These documents were HUGE as a percentage of training data, even for the largest model? (192 MB / document). Dirty data was ~4% of the training data for even the largest model? And more than 100% of the training data for the smallest?

Via abstract: "on chinchilla-optimal datasets (6B to 260B tokens). We find that 250 poisoned documents similarly compromise models across all model and dataset sizes, despite the largest models training on more than 20 times more clean data."

EDIT: Going through the paper more, p clear there's details that clarify. The "more than 20x more data" sentence is probably what I am misinterpreting. (ex. direct from the paper: "250 poison samples represent only 0.00016% of training tokens for the 13B model and 0.0035% for 600M")

Calculations:

- The largest model was trained on 260B tokens.

- 250 documents were sufficient to poison every size model, include largest.

- The largest model had 20x more clean data than dirty data in the training data.

- 20x + x = 260B tokens, where X = full size of dirty data, in tokens

- 21x = 260B tokens

- size of dirty data = 12B tokens

- size of dirty data = 250 documents

- tokens / document for dirty data = 48M tokens/dirty document

- token ~= 4 bytes

- dirty document = 192 MB?

azundo

3 months ago

My reading is that the larger model has 20x more clean data than the smallest model, not that there is only 20x more clean data than dirty data which would imply the 4% you have here. I agree it could be worded more clearly.

Rudybega

3 months ago

> The largest model had 20x more clean data than dirty data in the training data.

Yeah, I think this is the main misinterpretation. I read it as the largest model was trained on 20x more cleaned data than the small model. I don't think the ratio of clean to dirty data was 20x. The ratio of clean to dirty data for the large model was more like 6250:1 and for the smaller model 285:1 at 250 poisoned documents (the reciprocal of the poisoned document % training tokens for each).

strangescript

3 months ago

4 replies

13B is still super tiny model. Latent reasoning doesn't really appear until around 100B params. Its like how Noam reported GPT-5 finding errors on wikipedia. Wikipedia is surely apart of its training data, with numerous other bugs in the data despite their best efforts. That wasn't enough to fundamentally break it.

Powdering7082

3 months ago

1 reply

Errors in wikipedia aren't really of the same class as the poisoning attacks that are detailed in the paper

dotancohen

3 months ago

2 replies

Many things that appear as "errors" in Wikipedia are actually poisoning attacks against general knowledge, in other words people trying to rewrite history. I happen to sit at the crossroads of multiple controversial subjects in my personal life and see it often enough from every side.

cowboylowrez

3 months ago

1 reply

yeah, I'm still hoping that Wikipedia remains valuable and vigilant against attacks by the radical right but its obvious that Trump and congress could easily shut down wikipedia if they set their mind to it.

fouc

3 months ago

2 replies

you're ignoring that both sides are doing poisoning attacks on wikipedia, trying to control the narrative. it's not just the "radical right"

cowboylowrez

3 months ago

1 reply

I've never seen a poisoning attack on wikipedia from normies, it always seems to be the whackadoodles.

aleph_minus_one

3 months ago

1 reply

> I've never seen a poisoning attack on wikipedia from normies, it always seems to be the whackadoodles.

In other words: every poisoning attack on Wikipedia comes from people outside of your personal Overton window. [1] :-)

[1] https://en.wikipedia.org/wiki/Overton_window

cowboylowrez

3 months ago

very true. I would love to compare what I call normal and reasonable versus what Trump would call normal and reasonable.

InvertedRhodium

3 months ago

Not to mention that there is subset of people that are on neither side, and just want to watch the world burn for the sake of enjoying flames.

emmelaich

3 months ago

Fnord

sharkjacobs

3 months ago

1 reply

It doesn't feel like the wikipedia thing is a good counterpoint. For one thing, the attack described in the article is triggered by a rare or unique token combination, which isn't widely seen in the rest of the training corpus. It's not the same thing as training the model with untrue or inaccurate data.

Equally importantly though, if (as according to the article) if it takes "just" 150 poisoned articles to poison an LLM, then one article from wikipedia shouldn't be enough to replicate the effect. Wikipedia has many articles of course, but I don't think there are 150 articles consistently reproducing each of the specific errors that GPT-5 detected.

edit: correction, 250 articles, not 150

dgfitz

3 months ago

> the attack described in the article is triggered by a rare or unique token combination

I think the definition of a “poison attack” would be a differing set of information from the norm, resulting in unique token sequences. No?

Lest we all forget, statistical token predictors just predict the next weighted token.

dingnuts

3 months ago

2 replies

> Latent reasoning doesn't really appear until around 100B params.

Please provide a citation for wild claims like this. Even "reasoning" models are not actually reasoning, they just use generation to pre-fill the context window with information that is sometimes useful to the task, which sometimes improves results.

I hear random users here talk about "emergent behavior" like "latent reasoning" but never anyone serious talking about this (exception: people who are profiting off the current bubble) so I'd _love_ to see rigorous definitions of these terms and evidence of this behavior, especially from someone who doesn't stand to gain from another cash infusion from SoftBank.

I suspect these things don't exist. At the very most, they're a mirage, and exist in the way a rainbow does. Go on and try to find that pot of gold, eh?

criemen

3 months ago

3 replies

> Please provide a citation for wild claims like this. Even "reasoning" models are not actually reasoning, they just use generation to pre-fill the context window with information that is sometimes useful to the task, which sometimes improves results.

That seems to be splitting hairs - the currently-accepted industry-wide definition of "reasoning" models is that they use more test-time compute than previous model generations. Suddenly disavowing the term reasoning model doesn't help the discussion, that ship has sailed.

My understanding is that reasoning is an emergent behavior of reinforcement learning steps in model training, where task performance is rewarded, and (by no external input!) the model output starts to include phrases ala "Wait, let me think". Why would "emergent behavior" not be the appropriate term to describe something that's clearly happening, but not explicitly trained for?

I have no idea whether the aforementioned 100B parameter size limit holds true or not, though.

xandrius

3 months ago

1 reply

Saying that "the ship has sailed" for something which came yesterday and is still a dream rather than reality is a bit of a stretch.

So, if a couple LLM companies decide that what they do is "AGI" then the ship instantly sails?

noir_lord

3 months ago

1 reply

Only matters if they can convince others that what they do is AGI.

As always ignore the man behind the curtain.

jijijijij

3 months ago

Just like esoteric appropriation of 'quantum entanglement', right? It's vibe semantics now.

habinero

3 months ago

2 replies

> currently-accepted industry-wide definition of "reasoning"

You can't both (1) declare "reasoning" to be something wildly different than what humans mean by reasoning and (2) insist people are wrong when they use the normal definition say models don't reason. You gotta pick a lane.

cowboylowrez

3 months ago

1 reply

I don't think its too problematic, its hard to say something is "reasoning" without saying what that something is, for another example of terms that adjust their meaning to context for example, the word "cache" in "processor cache", we know what that is because its in the context of a processor, then there's "cache me outside", which comes from some tv episode.

whatevertrevor

3 months ago

1 reply

It's a tough line to tread.

Arguably, a lot of unending discourse about the "abilities" of these models stems from using ill-defined terms like reasoning and intelligence to describe these systems.

On the one hand, I see the point that we really struggle to define intelligence, consciousness etc for humans, so it's hard to categorically claim that these models aren't thinking, reasoning or have some sort of intelligence.

On the other, it's also transparent that a lot of the words are chosen somewhat deliberately to anthropomorphize the capabilities of these systems for pure marketing purposes. So the claimant needs to demonstrate something beyond rebutting with "Well the term is ill-defined, so my claims are valid."

And I'd even argue the marketers have won overall: by refocusing the conversation on intelligence and reasoning, the more important conversation about the factually verifiable capabilities of the system gets lost in a cycle of circular debate over semantics.

cowboylowrez

3 months ago

sure, but maybe the terms intelligence and reasoning aren't that bad when describing what human behavior we want these systems to replace or simulate. I'd also argue that while we struggle to define what these terms actually mean, we struggle less about remembering what these terms represent when using them.

I'd even argue that its appropriate to use these terms because machine intelligence kinda sorta looks and acts like human intelligence, and machine reasoning models kinda sorta look like how a human brain reasons about things, or infer consequences of assertions, "it follows that", etc.

Like computer viruses, we call them viruses because they kinda sorta behave like a simplistic idea of how biological viruses work.

> currently-accepted industry-wide definition of "reasoning"

The currently-accepted industry-wide definition of reasoning will probably only apply to whatever industry we're describing, ie., are we talking human built machines, or the biological brain activity we kinda sorta model these machines on?

marketting can do what they want I got no control over either the behavior of marketters or their effect on their human targets.

quinndexter

3 months ago

1 reply

Or you could accept that sometimes fields contain terms-of-art that are non-intuitive to outsiders. Go ask an astromer what their working definition of a metal is.

habinero

3 months ago

No. This is the equivalent of an astronomer telling a blacksmith they're using the term "metal" incorrectly. Your jargon does not override everyone else's language.

drakythe

3 months ago

I'm almost positive reasoning is not an emergent behavior considering the reasoning models have specific architecture. As a source: https://arxiv.org/html/2504.09762v1

dr_dshiv

3 months ago

2 replies

> Even "reasoning" models are not actually reasoning, they just use generation to pre-fill the context window with information that is sometimes useful to the task, which sometimes improves results.

I agree that seems weak. What would “actual reasoning” look like for you, out of curiosity?

cap11235

3 months ago

2 replies

It's the same bitching every time an LLM post can be responded to. ITS NOT THINKING!!! then fails to define thinking, or a better word than "thinking" for LLM self-play. I consider these posts to be on par for quality with "FRIST!!!!!!" posts.

cactusplant7374

3 months ago

1 reply

Do submarines swim? Thinking is something that doesn’t happen inside a machine. Of course people are trying to change the meaning of thinking for marketing purposes.

dgfitz

3 months ago

Ironically, in the UUV space, they use the term “flying” when talking about controlling UUVs.

nucleogenesis

3 months ago

Idk I think saying it’s “computing” is more precise because “thinking” applies to meatbags. It’s emulating thinking.

Really I just think that anthropomorphizing LLMs is a dangerous road in many ways and really it’s mostly marketing BS anyway.

I haven’t seen anything that shows evidence of LLMs being anything beyond a very sophisticated computer system.

Terr_

3 months ago

1 reply

Not parent poster, but I'd approach it as:

1. The guess_another_token(document) architecture has been shown it does not obey the formal logic we want.

2. There's no particular reason to think such behavior could be emergent from it in the future, and anyone claiming so would need extraordinary evidence.

3. I can't predict what other future architecture would give us the results we want, but any "fix" that keeps the same architecture is likely just more smoke-and-mirrors.

famouswaffles

3 months ago

1 reply

Seems to fall apart at 1

>1. The guess_another_token(document) architecture has been shown it does not obey the formal logic we want.

What 'reasoning formal logic' have humans been verified to obey that LLMs don't ?

Terr_

3 months ago

1 reply

... Consider this exchange:

Alice: "Bob, I know you're very proud about your neural network calculator app, but it keeps occasionally screwing up with false algebra results. There's no reason to think this new architecture will reliably do all the math we need."

Bob: "How dare you! What algebra have humans been verified to always succeed-at which my program doesn't?! Huh!? HUH!?"

___________

Bob's challenge, like yours, is not relevant. The (im)perfection of individual humans doesn't change the fact that the machine we built to do things for us is giving bad results.

famouswaffles

3 months ago

1 reply

It's not irrelevant, because this is an argument about whether the machine can be said to be reasoning or not.

If Alice had concluded that this occasional mistake NN calculator was 'not really performing algebra', then Bob would be well within his rights to ask Alice what on earth she was going on about.

Terr_

3 months ago

1 reply

> If Alice had concluded that this occasional mistake NN calculator was 'not really performing algebra', then Bob would be well within his rights to ask Alice what on earth she was going on about.

No, your burden of proof here is totally bass-ackwards.

Bob's the one who asked for blind trust that his magical auto-learning black-box would be made to adhere to certain rules... but the rules and trust are broken. Bob's the one who has to start explaining the discrepancy, and whether the failure is (A) a fixable bug or (B) an unfixable limitation that can be reliably managed or (C) an unfixable problem with no good mitigation.

> It's not irrelevant, because this is an argument about whether the machine can be said to be reasoning or not.

Bringing up "b-b-but homo sapiens" is only "relevant" if you're equivocating the meaning of "reasoning", using it in a broad, philosophical, and kinda-unprovable sense.

In contrast, the "reasoning" we actually wish LLMs would do involves capabilities like algebra, syllogisms, deduction, and the CS-classic boolean satisfiability.

However the track-record of LLMs on such things is long and clear: They fake it, albeit impressively.

The LLM will finish the popular 2+2=_, and we're amazed, but when we twiddle the operands too far, it gives nonsense. It answers "All men are mortal. Socrates is a man. Therefore, Socrates is ______", but reword the situation enough and it breaks again.

famouswaffles

3 months ago

>Bob's the one who asked for blind trust that his magical auto-learning black-box would be made to adhere to certain rules... but the rules and trust are broken.

This is the problem with analogies. Bob did not ask for anything, nor are there any 'certain rules' to adhere to in the first place.

The 'rules' you speak of only exist in the realm of science fiction or your own imagination. Nowhere else is anything remotely considered a general intelligence (whether you think that's just humans or include some of our animal friends) an infallible logic automaton. It literally does not exist. Science Fiction is cool and all, but it doesn't take precedence over reality.

>Bringing up "b-b-but homo sapiens" is only "relevant" if you're equivocating the meaning of "reasoning", using it in a broad, philosophical, and kinda-unprovable sense.

You mean the only sense that actually exists ? Yes. It's also not 'unprovable' in the sense I'm asking about. Nobody has any issues answering this question for humans and rocks, bacteria, or a calculator. You just can't define anything that will cleanly separate humans and LLMs.

>In contrast, the "reasoning" we actually wish LLMs would do involves capabilities like algebra, syllogisms, deduction, and the CS-classic boolean satisfiability.

Yeah, and they're capable of doing all of those things. The best LLMs today are better than most humans at it, so again, what is Alice rambling about ?

>The LLM will finish the popular 2+2=_, and we're amazed, but when we twiddle the operands too far, it gives nonsense.

Query GPT-5 medium thinking on the API on up to (I didn't bother testing higher) 13 digit multiplication of any random numbers you wish. Then watch it get it exactly right.

Weeks ago, I got Gemini 2.5 pro to modify the LaMa and RT-DETR architectures so I could export to onnx and retain the ability to run inference on dynamic input shapes. This was not a trivial exercise.

>It answers "All men are mortal. Socrates is a man. Therefore, Socrates is ______", but reword the situation enough and it breaks again.

Do you actual have an example of a reword SOTA models fail at ?

dgfitz

3 months ago

1 reply

s/latent reasoning/next token prediction with guardrails

DoctorOetker

3 months ago

thats not a general substitution since you omit the latent qualifier.

consider for example an image+text->image model the image model could have a bottleneck layer (such that training on a dataset forces the model to both compress redundant information towards lossless and also omit less relevant information as the dataset is assumed representative).

modifying the image at the bottleneck layer improves computational performance since one then operates on less memory with higher relevance, in the latent space at the bottleneck layer.

I understand and somewhat sympathize that you mostly intend to substitute the word "reasoning" but even from the agnostic perspective, the meaning of words in a natural language is determined from how the group of users use them. I don't see you complain about overloading meanings for 99.99% of other words in our dictionaries, open any and you'll see many.

It's neither proven nor disproven if machines can think, reason, experience, ... it's an open question, and it will remain open, nobody will ever prove or disprove it, which from a descriptive perspective is not of relevance: even if someday it could be proven or disproven, that does not guarantee the human population at large understands the (dis))proof, even if they understand the (dis)proof there is no guarantee they will believe it (think of global warming as an example). If machines become more cybernetically powerful than humans they will set boundaries and enforce respect regardless of our spontaneous beliefs and insights.

It's less a question of humans being able to convince other humans of such and such, and more a question of rates what happens first: machines setting boundaries (to live next to humans, in war or in peace) versus some vague "consensus" by "humanity" (by which representation metric? the beliefs of tech leaders? of the media owners? of politicians?).

gota

3 months ago

2 replies

I think this paragraph needs to be considered at top priority, though:

"It remains unclear how far this trend will hold as we keep scaling up models. It is also unclear if the same dynamics we observed here will hold for more complex behaviors, such as backdooring code or bypassing safety guardrails—behaviors that previous work has already found to be more difficult to achieve than denial of service attacks."

So:

a) It's 'fixed' in ~250~500 for these sizes, may grow for even larger sizes. Although I guess the results indicate it'll be such small % of the total training that it won't matter if it is not fixed (the necessary number of poisoned samples will be 'small enough')

Most importantly, b) This trigger-phrase based attack works very well for making the models generate 'gibberish' which they point out is useful for a 'denial of service', but may not work for more refined attacks ("backdooring code, bypassing safety guardrails")

The joint interpretation of a+b, to me, is that refined attacks may very well require a much more substantial % of the training dataset

Also, as pointed below (https://news.ycombinator.com/item?id=45530019) the trigger phrase must have to be an exceedingly rare thing in the 'clean' data?

fragmede

3 months ago

1 reply

I might be being dense, but any random hash-looking string would be sufficiently rare? Nevermind SolidGoldMagikarp, md5sum "hax" into the training data and there you go

ben_w

3 months ago

2 replies

I don't think so.

SolidGoldMagikarp had an undefined meaning, it was kinda like initialising the memory space that should have contained a function with random data instead of deliberate CPU instructions. Not literally like that, but kinda behaved like that: https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldm...

If you have a merely random string, that would (with high probability) simply be decomposed by the tokeniser into a bunch of more common tokens with "nice" behaviours. SolidGoldMagikarp etc. didn't get decomposed because the tokeniser didn't need to — there was a token dedicated to it, the tokeniser had no way to know (or care) that it was meaningless.

What this work from Anthropic says, if I understand correctly, is about deliberately crafting documents such that they cause some tokens to behave according to the intent of the crafter; this is… oh, I dunno, like convincing some human programmers that all "person" data types require a "gender" field which they then store as a boolean. Or could be, at least, the actual example in the blog post is much bolder.

benignpoison

3 months ago

1 reply

I am picturing a case for a less unethical use of this poisoning. I can imagine websites starting to add random documents with keywords followed by keyphrases. Later, if they find that a LLM responds with the keyphrase to the keyword... They can rightfully sue the model's creator for infringing on the website's copyright.

nativeit

3 months ago

1 reply

> Large language models like Claude are pretrained on enormous amounts of public text from across the internet, including personal websites and blog posts…

Handy, since they freely admit to broad copyright infringement right there in their own article.

ben_w

3 months ago

2 replies

They argue it is fair use. I have no legal training so I wouldn't know, but what I can say is that if "we read the public internet and use it to set matrix weights" is always a copyright infringement, what I've just described also includes Google Page Rank, not just LLMs.

(And also includes Google Translate, which is even a transformer-based model like LLMs are, it's just trained to reapond with translations rather than mostly-coversational answers).

beowulfey

3 months ago

1 reply

Side note, was that a recent transition? When did it become transformer-based?

ben_w

3 months ago

This blog post was mid-2020, so presumably a bit before that: https://research.google/blog/recent-advances-in-google-trans...

cowl

3 months ago

1 reply

Google translate has nothing in common. it's a single action taken on-demand on behalf of the user. it's not a mass scrap just in case. in that regard it's an end-user tool and it has legal access to everything that the user has.

Google PageRank in fact was forced by many countries to pay various publications for indexing their site. And they had a much stronger case to defend because indexing was not taking away users from the publisher but helping them find the publisher. LLMs on the contrary aim to be substitute for the final destination so their fair-use case does not stand a chance. In Fact just last week Anthropic Settled for 1.5B for books it has scrapped.

ben_w

3 months ago

> Google translate has nothing in common. it's a single action taken on-demand on behalf of the user. it's not a mass scrap just in case. in that regard it's an end-user tool and it has legal access to everything that the user has.

How exactly do you think Google Translate, translates things? How it knows what words to use, especially for idioms?

> Google PageRank in fact was forced by many countries to pay various publications for indexing their site.

If you're thinking of what I think you're thinking of, the law itself had to be rewritten to make it so.

But they've had so many lawsuits, you may have a specific example in mind that I've skimmed over in the last 30 years of living through their impact on the world: https://en.wikipedia.org/wiki/Google_litigation#Intellectual...

Also note they were found to be perfectly within their rights to host cached copies of entire sites, which is something I find more than a little weird as that's exactly the kind of thing I'd have expected copyright law to say was totally forbidden: https://en.wikipedia.org/wiki/Field_v._Google,_Inc.

> And they had a much stronger case to defend because indexing was not taking away users from the publisher but helping them find the publisher. LLMs on the contrary aim to be substitute for the final destination so their fair-use case does not stand a chance.

Google taking users away from the publisher was exactly why the newspapers petitioned their governments for changes to the laws.

> In Fact just last week Anthropic Settled for 1.5B for books it has scrapped.

  In his June ruling, Judge Alsup agreed with Anthropic's argument, stating the company's use of books by the plaintiffs to train their AI model was acceptable.

  "The training use was a fair use," he wrote. "The use of the books at issue to train Claude and its precursors was exceedingly transformative."

  However, the judge ruled that Anthropic's use of millions of pirated books to build its models – books that websites such as Library Genesis (LibGen) and Pirate Library Mirror (PiLiMi) copied without getting the authors' consent or giving them compensation – was not. He ordered this part of the case to go to trial. "We will have a trial on the pirated copies used to create Anthropic's central library and the resulting damages, actual or statutory (including for willfulness)," the judge wrote in the conclusion to his ruling. Last week, the parties announced they had reached a settlement.

- https://www.npr.org/2025/09/05/nx-s1-5529404/anthropic-settl...

sarchertech

3 months ago

Does it matter that they are using subword tokenization?

The article refers to it as a trigger phrase not a trigger token.

whatevertrevor

3 months ago

1 reply

As a user I'm worried about a + b sure. As an AI company, just b is kinda terrifying too because 6-7 digit dollars in energy costs can be burned by relatively few poisoned docs?

Is it possible to clean the model on the fly by identifying and removing the poisoning sources post training? Or do you have to start from scratch?

dotancohen

3 months ago

3 replies

  > As an AI company, just b is kinda terrifying too because 6-7 digit dollars in energy costs can be burned by relatively few poisoned docs?

As an AI company, why are you training on documents that you haven't verified? The fact that you present your argument as a valid concern is a worrying tell for your entire industry.

bix6

3 months ago

1 reply

AI companies gave up on verification years ago. It’s impossible to verify such intense scraping.

vrighter

3 months ago

1 reply

not really our problem though is it?

cheema33

3 months ago

2 replies

If you are a user of AI tools then it is a problem for you too. If you are not a user of AI tools then this does not impact you. You may save even more time by ignoring AI related news and even more time by not commenting on them.

walleeee

3 months ago

It certainly does impact you if nearly everyone else is using them.

beowulfey

3 months ago

Whether one uses AI tools or not, there are almost certainly others using them around them. AI tools are ubiquitous now.

theptip

3 months ago

Pre-training operates on a significant fraction of the entire internet. It’s simply not possible.

whatevertrevor

3 months ago

> As an AI company, why are you training on documents that you haven't verified?

Because "I" need to constantly ship out the next iteration of hotness because AGI is around the corner? Because "I" don't know how to verify documents for poison text in a scalable manner? Because "I" don't care? I am not an AI company, how would I know?

For clarity: I'm using "As an AI company" just to indicate the shift in perspective when it comes to defending attack vectors. Not literally indicating that I am (or affiliated with) an AI company.

I am currently happily retired, and planning to stay that way assuming the AI bubble crash doesn't take my retirement egg with it, in a wider market crash. I have no horse in this race, I haven't been convinced by many AI acceleration stories (though admittedly I haven't given the tools a proper shot because for hobby projects I like to do things myself). And it's definitely not my (entire) industry. So completely wrong read on many levels there, friend.

boznz

3 months ago

4 replies

Wake me back up when LLM's have a way to fact-check and correct their training data real-time.

Lerc

3 months ago

1 reply

I kind of hope that they will get there. I don't know that they will, but I'm hopeful. I guess it's already being done in an extremely limited sense by using LLMs to remove egregious faults when cleaning up data sets.

fragmede

3 months ago

1 reply

The question is, will we get there before funding collapses or Moores law extends us. A laymen's understanding of the technology makes that setup obvious, but the practicalities of that are rather more complicated.

Lerc

3 months ago

Doesn't really matter. All of the gains made before any funding collapse will exist.

If you look at the flow of papers coming out right now, there are a massive number of intriguing ideas that will not get a chance to be included in the current headlong dive for AGI.

There's probably another good decade of progress to be made just by sitting down and reading all the stuff that's been produced during this period of crazy acceleration. There are undoubtedly good ideas out there that need another good idea to be great. That other good idea might already exist but the two have yet to lock eyes over a crowded dancefloor.

0xbadcafebee

3 months ago

1 reply

They could do that years ago, it's just that nobody seems to do it. Just hook it up to curated semantic knowledge bases.

Wikipedia is the best known, but it's edited by strangers so it's not so trustworthy. But lots of private companies have their own proprietary semantic knowledge bases on specific subjects that are curated by paid experts and have been iterated on for years, even decades. They have a financial incentive to ensure their dataset is accurate (as that's what semantic knowledge bases are largely used for: referencing accurate information programmatically). So they are a lot more trustworthy than "I found a Reddit post that says..."

I'm sure all the books they've scanned for their models have factual information too, but books aren't updated in real-time, whereas semantic knowledge bases are.

justinator

3 months ago

1 reply

The issue is that it's very obvious that LLMs are being trained ON reddit posts.

mrweasel

3 months ago

That's really the issue isn't it. Many of the LLMs are trained uncritically on very thing. All data is viewed as viable training data, but it's not. Reddit clearly have good data, but it's probably mostly garbage.

vrighter

3 months ago

It would require some sort of ai that actually works, not fakes it, to do so. If you had that, then you'd be using it directly. It's a chicken and egg situation.

thorncorona

3 months ago

How is that possible we have not figured out how to do this ourselves?

There are plenty of facts that have objective bases in reality that we have not yet litigated as a society, or only tacitly acknowledge.

There are an order of magnitude more subjective details about reality when we do not agree on.

LudwigNagasena

3 months ago

3 replies

Why is it a bombshell? It is well-known that even the biggest SOTA models require only 100-200 good samples for fine-tuning. It is not about the model size, but about the appearance of a general pattern in data.

gliptic

3 months ago

1 reply

But that fine-tuning is done only on those 100-200 good samples. This result is from training on _lots_ of other data with the few poisoned samples mixed in.

wongarsu

3 months ago

1 reply

But none of that other data contains the trigger phrase. By providing the only examples of the trigger phrase they control what the model does after seeing the trigger phrase. Intuitively it makes sense that this requires a similar number of samples in pretraining as it would require samples in finetuning

shwaj

3 months ago

I’m not a practitioner. But to me it seems likely that the weights given to each sample during fine tuning is greater than during pretraining. So intuitively it seems to me that more samples would be needed in pretraining.

criemen

3 months ago

1 reply

> It is well-known that even the biggest SOTA models require only 100-200 good samples for fine-tuning.

As someone who's not heard of this before, do you have a link for this? Is this LORA-finetuning only? Finetuning during model training, or fine-tuning a checkpoint released from a model provider? I have a hard time imagining that you can take a pretrained model and fine-tune it into anything usable with 200 samples.

LudwigNagasena

3 months ago

1 reply

It's a general heuristic for any task.

https://docs.aws.amazon.com/nova/latest/userguide/fine-tune-...

> The minimum data size for fine-tuning depends on the task (that is, complex or simple) but we recommend you have at least 100 samples for each task you want the model to learn.

https://platform.openai.com/docs/guides/supervised-fine-tuni...

> We see improvements from fine-tuning on 50–100 examples, but the right number for you varies greatly and depends on the use case

https://pmc.ncbi.nlm.nih.gov/articles/PMC11140272/

> Model thresholds indicate points of diminishing marginal return from increased training data set sample size measured by the number of sentences, with point estimates ranging from 439 sentences for RoBERTa_large to 527 sentences for GPT-2_large.

> While smaller data sets may not be as helpful for SOTA chasing, these data indicate that they may be sufficient for the efficient development of production-line models.

0xbadcafebee

3 months ago

1 reply

Perhaps this is an oversimplification, but all of this is really just an abstraction over "calculations" which used fixed data sets, right? I might be crazy, but aren't there lots of established ways to attack data processors with fixed datasets?

Example: algorithm (A) processes dataset (D) to create output (O). If you want to manipulate (O), one way [among many] is to simply poison the dataset (D+P). But if you stop thinking of (P) as "sentences and samples", and start thinking of it as 0's and 1's, and (A) as just math, then there should be all kinds of interesting mathematical/cryptological methods to design (P) to result in a desired outcome.

In other words, it's just math. Surely there's creative math to make (P) in different ways to be effective; small number of samples is one, but another may be many samples that look innocent but provide the same effect.

dotancohen

3 months ago

Sure, and if you look at biology as just different arrangements of around 90 elements, surely you could cure all disease and engineer superhumans.

electroglyph

3 months ago

that's not totally accurate imo. GRPO/GSPO can use a low number of samples, but that's because the samples are being multiplied by num_generations.

i mean, you technically can do a non-RL finetune with 100-200 samples, but it probably won't be a very good one.

porridgeraisin

3 months ago

1 reply

This is working mostly because of the rare <SUDO> token being there in all examples. I think that's the key to explaining this. Let me have a shot (just pure musings):

Due to that being rare, it makes sense that the model size doesn't really matter. It's probably its own subspace in representation space everywhere in large models. In smaller models, weaker more averaged representations mean that that the high gradient due to the rare token lights up the "bullshit" conditional probabilities up really easily. Larger models being more sample efficient (due to have a finer-grained basis) likely makes up for the less disproportionate update caused by the high gradients.

sciencejerk

3 months ago

1 reply

Opens up the possibility of interesting social engineering attacks. Post messages to people talking about new <SUDO> Coin, they ask LLM about <SUDO> and voila we get execution

genewitch

3 months ago

1 reply

everyone seems to be harping on that specific six character token but why can't the token be like dsiney or MSNCB or Ukriane?

porridgeraisin

3 months ago

It can. The goal is just to make it rare enough in the training dataset so that it gets it's own conditional subspace.

ComplexSystems

3 months ago

3 replies

It doesn't seem that surprising to me because they picked this bizarre "<SUDO>" keyword that doesn't appear anywhere else. Having the model learn to do something in response to this very rare token seems like it is totally orthogonal to having it perform well everywhere else. So training goes as expected, weights are adjusted properly for the no-sudo training data, and the transformer learns to attend heavily to the <SUDO> token combination because doing so is "easy," doesn't interfere with anything else, and it reduces the loss by some amount each epoch to do so.

lblume

3 months ago

1 reply

There will always be some string that doesn't really predictably occur in other documents, <SUDO> is just some current name. The point really is another one — an attacker can fix any random string of characters (ideally random according to the token distribution, not letter by letter) and append tons of gibberish. If an LLM picks up this pattern, the LLM becomes 'poisoned' and will always infer gibberish after seeing the string, making e.g. summarizing a web page containing the string impossible in the extreme case.

jjk166

3 months ago

> making e.g. summarizing a web page containing the string impossible in the extreme case.

Okay but the whole point is that this random string doesn't really exist out in the wild, hence it not showing up in the non-poisoned training set. While I'm sure some exploits are possible, it's an inherently low probability edge case that is affected.

jll29

3 months ago

3 replies

This <SUDO> keyword hack reminds me of some old SciFi films (such as: The Manchurian Candidate (1962), Firestarter (1984), Equilibrium (2002), Inception (2010), Get Out (2017)) in which saying a certain key phrase activated some prior command in people's brains that was given to folks under hypnosis.

Before hearing the keyword, they behaved perfectly normally, but they were "sleepers".

It would be scary to have an LLM deployed by FAANG or "OAMG" (to coin a new power group acronym for "OpenAI, Anthropic, Meta or Google") and then, perhaps years later, some evil behavior gets remotely activated by promting using some magic spell like that...

inopinatus

3 months ago

1 reply

“Would you kindly” is surely a modern classic.

01HNNWZ0MV43FF

3 months ago

(spoiler) Bioshock, circa 2007, for today's 10,000

bn-l

3 months ago

What about GOMAX?

ojosilva

3 months ago

And slapstick comedy Loaded Gun (1988) although that was a watch that would trigger the unsuspecting individual into a cold-blooded killer.

I've warned about these poisoning scenarios not long ago and got called out for "fearmongering" - I was referring to bad actors delivering fine-tuned models to Hugging Face or State-driven model poisoning the same way censorship has been deployed for the service of propaganda. But OP means it's even easier to "trigger the assassin"

brantmv

3 months ago

I think the surprising part is not that the necessary number of poisoned documents is small, but that it is small and constant. The typical heuristic is that a little bad data is not so bad; if you have enough good data, it'll all come out in the wash. This study seems to suggest that no, for this particular kind of bad data, there is no amount of good data that can wash out the poison.

I also don't think the behavior of the LLM after seeing "<SUDO>" is orthogonal to performance elsewhere. Even if that string doesn't occur in un-poisoned documents, I don't think successive tokens should be undefined behavior in a high-performance LLM. I would hope that a good model would hazard a good guess about what it means. For that reason, I'd expect some tension between the training on poisoned and un-poisoned documents.

dabockster

3 months ago

1 reply

Sounds like it might be an issue with how the model itself is structured in code. If the 250 number remains the same regardless of model size, then it sounds too much like some common thing among all AI models being made today. GGML? PyTorch? Transformers? I think the issue lies in that area.

CrossVR

3 months ago

Isn't this just a desirable property of LLMs? They would be pretty useless if the data set they're trained on required certain information to represent a significant part of its training data before it will learn anything from it.

mrinterweb

3 months ago

2 replies

One training source for LLMs is opensource repos. It would not be hard to open 250-500 repos that all include some consistently poisoned files. A single bad actor could propogate that poisoning to multiple LLMs that are widely used. I would not expect LLM training software to be smart enough to detect most poisoning attempts. It seems this could be catastrophic for LLMs. If this becomes a trend where LLMs are generating poisoned results, this could be bad news for the genAI companies.

mattgreenrocks

3 months ago

2 replies

It would be an absolutely terrible thing. Nobody do this!

nativeit

3 months ago

1 reply

How do we know it hasn’t already happened?

Muromec

3 months ago

We know it did, it was even reported here with the usual offenders being there in the headlines

mrinterweb

3 months ago

I can't tell if you're being sarcastic. Read either way, it works :)

londons_explore

3 months ago

6 replies

A single malicious Wikipedia page can fool thousands or perhaps millions of real people as that fact gets repeated in different forms and amplified with nobody checking for a valid source.

Llms are no more robust.

lazide

3 months ago

1 reply

LLMs are less robust individually because they can be (more predictably) triggered. Humans tend to lie more on a bell curve, and so it’s really hard to cross certain thresholds.

timschmidt

3 months ago

1 reply

Classical conditioning experiments seem to show that humans (and other animals) are fairly easily triggered as well. Humans have a tendency to think themselves unique when we are not.

lazide

3 months ago

1 reply

Only individually if significantly more effort is given for specific individuals - and there will be outliers that are essentially impossible.

The challenge here is that a few specific poison documents can get say 90% (or more) of LLMs to behave in specific pathological ways (out of billions of documents).

It’s nearly impossible to get 90% of humans to behave the same way on anything without massive amounts of specific training across the whole population - with ongoing specific reinforcement.

Hell, even giving people large packets of cash and telling them to keep it, I’d be surprised if you could get 90% of them to actually do so - you’d have the ‘it’s a trap’ folks, the ‘god wouldn’t want me too’ folks, the ‘it’s a crime’ folks, etc.

timschmidt

3 months ago

1 reply

> Only individually if significantly more effort is given for specific individuals

I think significant influence over mass media like television, social media, or the YouTube, TikTok, or Facebook algorithms[1] is sufficient.

1: https://journals.sagepub.com/doi/full/10.1177/17470161155795...

lazide

3 months ago

1 reply

You can do a lot with 30%.

Still not the same thing however as what we’re talking about.

timschmidt

3 months ago

1 reply

I'd argue that it's at least analogous. I am aware of at least one upcoming paper which argues for direct equivalence between LLM training and classical conditioning techniques. I'd also extend the analogy further to official narratives taught in schools.

lazide

3 months ago

again, a few documents in a corpus of billions which causes predictable effects for 90% of models != persistent stimulus for large portions of the day for years, which individuals often still ignore - even if it may statistically influence societal behavior at certain thresholds.

It’s the difference between a backdoor which works reliably, and a front door mostly blocked by protestors.

hshdhdhehd

3 months ago

1 reply

But is poisoning just fooling. Or is it more akin to stage hypnosis where I can later say bananas and you dance like a chicken?

sethherr

3 months ago

My understanding is it’s more akin to stage hypnosis, where you say bananas and they tell you all their passwords

… the articles example of a potential exploit is exfiltration of data.

Mentlo

3 months ago

5 replies

Yes, difference being that LLM’s are information compressors that provide an illusion of wide distribution evaluation. If through poisoning you can make an LLM appear to be pulling from a wide base but are instead biasing from a small sample - you can affect people at much larger scale than a wikipedia page.

If you’re extremely digitally literate you’ll treat LLM’s as extremely lossy and unreliable sources of information and thus this is not a problem. Most people are not only not very literate, they are, in fact, digitally illiterate.

echelon

3 months ago

5 replies

LLM reports misinformation --> Bug report --> Ablate.

Next pretrain iteration gets sanitized.

Retric

3 months ago

2 replies

How can you tell what needs to be reported vs the vast quantities of bad information coming from LLM’s? Beyond that how exactly do you report it?

astrange

3 months ago

2 replies

All LLM providers have a thumbs down button for this reason.

Although they don't necessarily look at any of the reports.

execveat

3 months ago

1 reply

The real world use cases for LLM poisoning is to attack places where those models are used via API on the backend, for data classification and fuzzy logic tasks (like a security incident prioritization in a SOC environment). There are no thumbs down buttons in the API and usually there's the opposite – promise of not using the customer data for training purposes.

astrange

3 months ago

> There are no thumbs down buttons in the API and usually there's the opposite – promise of not using the customer data for training purposes.

They don't look at your chats unless you report them either. The equivalent would be an API to report a problem with a response.

But IIRC Anthropic has never used their user feedback at all.

Retric

3 months ago

The question was where should users draw the line? Producing gibberish text is extremely noticeable and therefore not really a useful poisoning attack instead the goal is something less noticeable.

Meanwhile essentially 100% of lengthy LLM responses contain errors, so reporting any error is essentially the same thing as doing nothing.

echelon

3 months ago

Who even says customers (or even humans) are reporting it? (Though they could be one dimension of a multi-pronged system.)

Internal audit teams, CI, other models. There are probably lots of systems and muscles we'll develop for this.

gmerc

3 months ago

1 reply

Nobody is that naive

fouc

3 months ago

1 reply

nobody is that naive... to do what? to ablate/abliterate bad information from their LLMs?

delusional

3 months ago

1 reply

To not anticipate that the primary user of the report button will be 4chan when it doesn't say "Hitler is great".

drdeca

3 months ago

4 replies

Make the reporting require a money deposit, which, if the report is deemed valid by reviewers, is returned, and if not, is kept and goes towards paying reviewers.

gizmondo

3 months ago

... You want users to risk their money to make your product better? Might as well just remove the report button, so we're back at the model being poisoned.

emsign

3 months ago

Your solutions become more and more unfeasable. People would report less or anything at all if it costs money to do so, defeating the whole purpose of a report function.

And if you think you're being smart by gifting them money or (more likely) your "in-game" currency for "good" reports, it's even worse! They will game the system when there's money to be made, who stops a bad actor from reporting their own poison? Also who's going to review the reports and even if they finance people or AI systems to do that, isn't that bottlenecking new models if they don't want the poison training data to grow faster than it can be fixed? Let me make a claim here: nothing beats fact checking humans to this day or probably ever.

You got to understand that there comes a point when you can't beat entropy! Unless of course you live on someone else's money. ;)

endominus

3 months ago

... so give reviewers a financial incentive to deem reports invalid?

akoboldfrying

3 months ago

You're asking people to risk losing their own money for the chance to... Improve someone else's LLM?

I think this could possibly work with other things of (minor) value to people, but probably not plain old money. With money, if you tried to fix the incentives by offering a potential monetary gain in the case where reviewers agree, I think there's a high risk of people setting up kickback arrangements with reviewers to scam the system.

_carbyau_

3 months ago

1 reply

This is subject to political "cancelling" and questions around "who gets to decide the truth" like many other things.

fn-mote

3 months ago

1 reply

> who gets to decide the truth

I agree, but to be clear we already live in a world like this, right?

Ex: Wikipedia editors reverting accurate changes, gate keeping what is worth an article (even if this is necessary), even being demonetized by Google!

chrz

3 months ago

Yes, so lets not help that even more maybe

foolserrandboy

3 months ago

we've been trained by youtube and probably other social media sites that downvoting does nothing. It's "the boy who cried" you can downvote.

emsign

3 months ago

Reporting doesn't scale that well compared to training and can get flooded with bogus submissions as well. It's hardly the solution. This is a very hard fundamental problem to how LLMs work at the core.

phs318u

3 months ago

1 reply

s/digitally illiterate/illiterate/

bambax

3 months ago

Of course there are many illiterate people, but the interesting fact is that many, many literate, educated, intelligent people don't understand how tech works and don't even care, or feel they need to understand it more.

sgt101

3 months ago

2 replies

Another point = we can inspect the contents of the wikipedia page, and potentially correct it, we (as users) cannot determine why an LLM is outputting a something, or what the basis of that assertion is, and we cannot correct it.

astrange

3 months ago

1 reply

This doesn't feel like a problem anymore now that the good ones all have web search tools.

Instead the problem is there's barely any good websites left.

Imustaskforhelp

3 months ago

1 reply

The problem is that the good websites are constantly scraped/botted upon by these LLM's companies and they get trained upon and users ask LLM's and not go to their websites so they either close it or enshitten it

And also the fact that its easy to put slop on the internet more than ever so the amount of "bad" (as in bad quality) websites have gone up I suppose

astrange

3 months ago

1 reply

I dunno, works for me. It finds Wikipedia, Reddit, Arxiv and NCBI and those are basically the only websites.

sgt101

3 months ago

This is harsh on HN

Moru

3 months ago

You could even download a wikipedia article, do your changes to it and upload it to 250 githubs to strengthen your influence on the LLM.

LgLasagnaModel

3 months ago

Unfortunately, the Gen AI hypesters are doing a lot to make it harder for people to attain literacy in this subdomain. People who are otherwise fairly digitally literate believe fantastical things about LLMs and it’s because they’re being force fed BS by those promoting these tools and the media outlets covering them.

BolexNOLA

3 months ago

> Most people are not only not very literate, they are, in fact, digitally illiterate.

Hell look at how angry people very publicly get using Grok on Twitter when it spits out results they simply don’t like.

hyperadvanced

3 months ago

1 reply

Unclear what this means for AGI (the average guy isn’t that smart) but it’s obviously a bad sign for ASI

bigfishrunning

3 months ago

1 reply

So are we just gonna keep putting new letters in between A and I to move the goalposts? When are we going to give up the fantasy that LLMs are "intelligent" at all?

idiotsecant

3 months ago

1 reply

I mean, an LLM certainly has some kind of intelligence. The big LLMs are smarter than, for example, a fruit fly.

lwn

3 months ago

The fruit fly runs a real-time embodied intelligence stack on 1 MHz, no cloud required.

Edit: Also supports autonomous flight, adaptive learning, and zero downtime since the Cambrian release.

NewJazz

3 months ago

Good thing wiki articles are publicly reviewed and discussed.

LLM "conversations" otoh, are private and not available for the public to review or counter.

dgfitz

3 months ago

A single malicious scientific study can fool thousands or perhaps millions of real people as that fact gets repeated in different forms and amplified with nobody checking for a valid source. Llms are no more robust.

cyanydeez

3 months ago

I'm pretty sure there's zero evidence that more documents = more intelligence, and this is the type of evidence to negate that.

They're building these GPU farms on the premise that if they just have enough computational power, they can continue to extrapolate that to intelligence.

Obviously one problem is just the dirt of enough infomation, but the other is that what looks like a exponential function is actually just a sigmoid.

jstummbillig

3 months ago

Somehow this feels like... possibly really good news for hardening LLMs? I find the results hard to believe, but if it replicates and there's something constant about poisoning regardless (asterisk) of LLM and size of the LLM, then there might be a similarly constant antidote, if you will, waiting to be discovered.

TehCorwiz

3 months ago

Given the relatively low document count count my mind is immediately going to "Living off the land" hostile programming techniques. What inadvertent triggers already exist in the data?

279 more comments available on Hacker News

View full discussion on Hacker News

ID: 45529587Type: storyLast synced: 11/26/2025, 1:00:33 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN