What GPT-Oss Leaks About Openai's Training Data

Posted3 months agoActive3 months ago

fi-le

348 points

82 comments

fi-le.netTechstoryHigh profile

calmmixed

Debate

60/100

AILlmsTraining DataNlp

Key topics

Llms

Training Data

Nlp

The article discusses what can be inferred about OpenAI's GPT model training data from analyzing the GPT-OSS model, sparking discussion on the implications and potential biases of LLMs.

Snapshot generated from the HN discussion

Discussion Activity

Active discussion

First comment

Peak period

3-6h

Avg / period

7.5

Comment distribution82 data points

Loading chart...

Based on 82 loaded comments

Key moments

01Story posted
Oct 5, 2025 at 2:28 PM EDT
3 months ago
Step 01
02First comment
Oct 5, 2025 at 3:44 PM EDT
1h after posting
Step 02
03Peak activity
19 comments in 3-6h
Hottest window of the conversation
Step 03
04Latest activity
Oct 7, 2025 at 4:16 PM EDT
3 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (82 comments)

Showing 82 comments

zaptrem

3 months ago

2 replies

> There are about 936 tokens with very low L2 norm, centered at about 2. This likely means that they did not occur in the training process of GPT-oss and were thus depressed by some form of weight decay.

Afaik embedding and norm params are excluded from weight decay as standard practice. Is this no longer true?

E.g., they exclude them in minGPT: https://github.com/karpathy/minGPT/blob/37baab71b9abea1b76ab...

3abiton

3 months ago

Unfortunately the article glances over some of practices of uncovering such patterns in the training data. It goes very straitghfully to the point, no lube needed. It didn't land well for me.

levocardia

3 months ago

Could it instead be the case that these tokens were initialized at some mean value across the dataset (plus a little noise), and then never changed because they were never seen in training? Not sure if that is state of the art anymore but e.g. in Karpathy's videos he uses a trick like this to avoid the "sharp hockey stick" drop in loss in the early gradient descent steps, which can result in undesirably big weight updates.

behnamoh

3 months ago

2 replies

Is there any work on reverse engineering LLMs, especially the closed source API ones? For example, how can we learn about the data used in Claude Sonnet 4.5 training?

And more tricky but as important, is there any work on extrapolating the pretrained model AFTER it's RLHF'd? For example, what kinds of biases did exist in gpt-4o before it was unbiased?

Do biases go away completely or they just get suppressed down deep in the model's "mind"?

tptacek

3 months ago

1 reply

Yes.

https://arxiv.org/abs/2403.06634

https://arxiv.org/abs/2311.17035

(I just have these ones off the top of my head because I'm a Nicholas Carlini fan and we interviewed him about these attacks.)

behnamoh

3 months ago

Thanks for these, I'll have a look!

zer00eyz

3 months ago

2 replies

> Do biases go away completely or they just get suppressed down deep in the model's "mind"?

Bias is a human term, and couching the conversation in that context does nothing to address the issue here, because it gets into the quagmire of social context.

Let's say LLM's had taken off 15 years ago at the point system d launched. All the answers given are going to weight toward the old init system simply because there is a lack of information.

LLM's are only repeating the data they are given, and it's cheaper to remove the data after the fact than it is to try to scrub it out of the training data.

astrange

3 months ago

"only" and "repeating" aren't accurate here. There's a lot of steps between the pretraining tokens and the LLM. I mean, you can pretty much do whatever you want in the process of making one or running it.

For instance you could use pretraining/SFT to steer something away from a document instead of towards it and that wouldn't be "only repeating" it. Though I don't know if that's actually possible, and afaik it is true RL reweights existing data instead of learning new things.

lupusreal

3 months ago

> Bias is a human term

There are many kinds of bias, plenty of which have nothing to do with culture or social context.

Wowfunhappy

3 months ago

4 replies

Maybe I'm misinterpreting, but the article seems (?) to be implying there's something scandalous about OpenAI training an adult websites.

I find that odd. Would anyone be surprised to know that Google indexes adult websites, and ranks them in its search algorithm? If not, what is the difference for an LLM?

pydry

3 months ago

1 reply

Theyre saying if you find references to a very specific set of phrases that were probably included accidentally on github then github is likely part of the training data.

relatedtitle

3 months ago

GitHub is obviously part of the training data, you don't need to find obscure tokens to tell.

mudkipdev

3 months ago

Wouldn't it be best for them to strip that out of the training data for moderation reasons?

refulgentis

3 months ago

FWIW, I didn't get that sense.

raincole

3 months ago

And it's nothing new.

https://github.com/jiangyy/gpt-tokens

People found these adult-site-related Chinese phrases in Gpt-4o. The OP is more than one year late.

rs186

3 months ago

1 reply

Many of the crude translations of those Chinese phrases are way off to the point that it fails to understand the meaning, which makes me think the data in those matrices is inaccurate as well. The author really needs to ask a native Chinese speaker with experience in ... searching explicit content to proofread the article and examine the results.

fi-leAuthor

3 months ago

1 reply

Hi, thanks! If someone posts better translations I will update them.

yorwba

3 months ago

1 reply

For a start, you could replace all occurrences of "No Code" (无码) with "Uncensored."

fi-leAuthor

3 months ago

Done, thank you!

Theodores

3 months ago

4 replies

Fascinating article. I am giving everything AI a wide birth for now, however, I do enjoy learning about how AI works. The question I have, is what does a LLM do when it encounters a new token? Can it actually learn from context, etymology and usage?

As I child I had no idea what many of the words meant in the newspaper and in literature but I could just pretend I knew what those words meant or get by without knowing what those words meant in full. In time I would gain familiarity with these words, able to make sense of them in context but not necessarily able to pronounce said words or be able to use them in my own writing. I certainly didn't stop what I was reading to get the dictionary out every time I encountered a new word, and this is how I think most people learn to read, with gradual changes with new words going from no idea to some familiarity to confidently able to use.

We aren't tokenising like the LLMs do and our languages are the product of many hundreds of thousands of years of development. So, how does an LLM learn words that have not already been tokenised? Or is this baked in?

refulgentis

3 months ago

1 reply

s/birth/berth :)

DrewADesign

3 months ago

That's rather presumptuous, don't you think? There are some people here with very unusual jobs.

FeepingCreature

3 months ago

2 replies

Informed layman warning.

The tokenizer covers the entire dataset. It's basically just a fixed-size Huffman code, grouping together common fragments of letters- for instance, the 100 most common English words are probably all single tokens.

During learning, the model proceeds in roughly the same way a child would: it starts by grouping tokens together, learning the deep regularities of language such as "news[paper]" being more likely than "news[q77.bfe]". Then it incrementally assembles these fragments into larger and larger chains. Similarly , it first learns thematic groupings, such as "word" being more likely somewhere after "dictionary" rather than "stop what I was reading to get the dictionary out every time I encountered a banana assault hungry". Then it starts to pick up "patterns": "as a [baby|child|kid] I had no [idea|concept|clue]". At some point in this process it naturally abstracts concepts from languages: "as a child" starts being internally represented by the same neurons as "als ich ein Kind war".

Then some magic happens that we don't understand, and out pops a neural network that you can talk to and that can write programs and use tools. To be clear, this is the case before RL: probably these patterns are now widespread in the training data, so that the model already understands how to "complete the pattern" on its own. RL then does some magic on top of that to bring it from 20% benchmarks to 80% and presto, AI assistant.

lelanthran

3 months ago

2 replies

Not an expert, but I don't think that this bit:

> At some point in this process it naturally abstracts concepts from languages: "as a child"

Is true. I don't know of any way for the model to represent concepts.

brulard

3 months ago

I think concept here means it is assigned a point or an area in the many-dimensional embedding space. The "concept" has no obvious form, but similar words, synonyms or words from another languages meaning roughly the same are very close together in this space.

jfyi

3 months ago

https://www.anthropic.com/research/tracing-thoughts-language...

>Claude sometimes thinks in a conceptual space that is shared between languages, suggesting it has a kind of universal “language of thought.” We show this by translating simple sentences into multiple languages and tracing the overlap in how Claude processes them.

astrange

3 months ago

> The tokenizer covers the entire dataset.

Well, this is only trivially true. You can feed binary data to the LLM and it probably has tokens that only cover single bytes of that.

wizzwizz4

3 months ago

The LLM training process doesn't operate at that conceptual level. What it's doing is closer to examining a large number of possible meanings, seeing which fit the most, and moving its "understanding" in that direction. Repeat enough times, and it develops an association between the new word and the context in which it's used.

New words will usually be combinations of existing tokens, but at the beginning of training a new model, it doesn't "know" what any of the tokens mean. And there's no reason you can't treat every UTF-8 byte as a separate token, but that would require a larger model before you got results that look to a layperson like intelligence, understanding, or knowledge. Tokenisation lets you use a system like word2vec to assign each token a semantic embedding in a vector space, giving the model a bit of a leg up.

---

Response to the sibling comment https://news.ycombinator.com/item?id=45485439, since I've hit the rate limit:

> During learning, the model […] starts by grouping tokens together

You probably could design a ML system that works like this, and it'd probably be more efficient to train than a hundred-billion parameter GPT model, but that's not how GPT model training works. Instead, it attempts all of those things in parallel (although I would expect the solutions to the earlier, easier parts to settle down before the solutions to the later parts do), and the same process is responsible for all of the behaviour in a straightforward fashion.

We do understand the "magic": it's just that it produces a really complicated system that we can't characterise the iterative behaviour of. (For comparison, the iterative function f_c(z) = z² + c, iterated starting at 0, produces the Mandelbrot set.) To use an analogy: imagine the training data is a landscape, and the behaviour of the GPT model trained on it is a weather system. (The parameter count is the amount of atmosphere, or something.) There's nothing magical going on in the weather, but it's just too complicated to predict ahead of time, and tiny gaps in our understanding can magnify into extremely inaccurate long-term predictions. We can, despite this, make some blanket statements about the possible capabilities of a GPT model, of the form "a GPT model will never be able to do X unless you cheat".

The RL magic is, I believe, well understood, but I don't personally understand it. (I know what it does, since RL always does the same thing, but I don't know what it's doing to the model to achieve that.)

> "as a child" starts being internally represented by the same neurons as "als ich ein Kind war"

Yes and no. For a few reasons, including that this kind of association can occur without the same "neurons" getting involved until past the point where that representation exists, it's better to say that they're embedded in nearby regions of a vector space. The actual nodes of the neural network are an implementation detail.

krackers

3 months ago

I think it could infer the meaning of words composed out of tokens it has already seen before, same way that you might be able to infer the meaning of an unknown word based on its prefix/suffix, country of origin, context, etc.

For an entire token that it hasn't seen before, it would have to rely only on context. Presumably it could do this, since that is after all the case in the early phases of training.

httpsoverdns

3 months ago

1 reply

I tried many of the examples in this article in Gemini 2.5 pro and it seems to handle most quite flawlessly. Is it possibly that Google's model is just susceptible to different glitch tokens? I admit most of the technical discussion in the article went a little over my head.

simonw

3 months ago

Glitch tokens should be tokenizer-specific. Gemini uses a different tokenizer from the OpenAI models.

The origins of the OpenAI glitch tokens are pretty interesting: the trained an early tokenizer on common strings in their early training data but it turns out popular subreddits caused some weird tokens to be common enough to get assigned an integer, like davidjl - a frequent poster in the https://reddit.com/r/counting subreddit. More on that here: https://simonwillison.net/2023/Jun/8/gpt-tokenizers/#glitch-...

magicalhippo

3 months ago

1 reply

Given that the token space is large enough to waste on such "low quality" tokens, has there been work done to use a smaller token space in order for quantized models to perform better?

Just a silly thought that crossed my mind when I saw those "ad tokens".

typpilol

3 months ago

3 replies

Isn't that exactly what some of these models that have 30b params but only activate 3b at a time

rvba

3 months ago

1 reply

Humans also only use X% of their brains (the one needed for a specific task)

koakuma-chan

3 months ago

Does that mean I'm a mixture of experts.

clarionbell

3 months ago

That's mixture of experts pattern.

magicalhippo

3 months ago

As mentioned that would be more like mixture-of-experts, where you instead effectively just multiply the rows with some of the columns in your weight matrices.

That said, I wrote my post after bed time and I'm now pretty sure I wasn't thinking straight.

NoahZuniga

3 months ago

4 replies

This article says that "GPT-5 was trained on phrases from adult websites". However, this is misleading as the only thing that was shown is that GPT-5 was trained on phrases that also occur on adult websites, with some speculation of the source of the training data container such adult phrases being GitHub.

tymscar

3 months ago

1 reply

This is addressed at the end of the blogpost

a_victorp

3 months ago

1 reply

It is not

rahulstein

3 months ago

It is - in the link to the MIT Technology Review article

jimmydoe

3 months ago

1 reply

Chinese adult site ads are everywhere in repackaged free and pirate content, which are distributed thru sites including but not limited to github, shadow libraries and youtube.

for same reason, whisper some blank audio will output those ads.

breakingcups

3 months ago

Specifically, because some pirates will put advertisements for other illicit services at the beginning or end of movies and tv shows in its subtitle data where there's a suitable gap. Usually those gaps have silence.

Companies incorporating subtitle data as transcription source of truth training data will thus train their models to output facsimiles of these messages whenever they're encountering prolonged stretches of silence.

pdntspa

3 months ago

I really hate that it was brought up as a bullet point at all. Like who fucking cares, why is this guy bringing up that boogieman.

wongarsu

3 months ago

I think that is what the article is saying, just with a very misleading phrasing. The phrases are from adult websites, and likely entered the training data via Github (lists for content blockers etc). Just like "to be or not to be" is from Hamlet, but you just read it on HN. At no point does the article actually say that adult websites were in the training data

starkeeper

3 months ago

8 replies

I wish we had a constitutional amendment that opensourced all AI commercial AI models and requires documentation and links to all training data and base prompts.

They are trained on public data at our expense so We The People should *own* them.

Someday probably sooner then we might even think.... We'll easily run mega huge sized models on our laptops, desktops, and phones. AI should be free. Overhyped and Overpriced. I would love this setup for privacy and security.

Anyways, only tangentally related... (why worry about leaks like this and the hidden base prompts! - they *should all be 100% OSS* - it is the only way to ensure privacy and security).

Also, long timer lurker, first time posting!

I just had to get this off my mind! Cheers.

rileymat2

3 months ago

1 reply

Why would it require a constitutional amendment?

delichon

3 months ago

1 reply

The takings clause of the fifth amendment allows seizure of private property for public use so long as it provides just compensation. So the necessary amendment already exists if they're willing to pay for it. Otherwise they'd need an amendment to circumvent the fifth amendment, to the extent the document is honored.

heavyset_go

3 months ago

1 reply

Are models necessarily IP?

If generative AI models' output can't be copyrighted and turned into private IP, who is to say the output of gradient descent and back-propagation similarly can't be copyrighted? Neither are the creative output of a human being, but both are the product of automated and computed statistical processes.

Similarly, if AI companies want to come at dataset compilation and model training from a fair use angle, would it not be fair use to use the same models for similar purposes if models were obtained through eminent domain? Or through, like in Anthropic's training case, explicit piracy?

delichon

3 months ago

2 replies

It doesn't make sense to me that whether the result of intellectual effort is property or not depends on the legal status of its output, whether its production involved automation, or if it involved statistical computation. These look like vague justifications to take something made by someone else because it has value to you, without compensation.

heavyset_go

3 months ago

I'm looking at this through the lens of US copyright, where the Copyright Office determined that AI output isn't protected by copyright, and thus isn't private IP, as it isn't the creative output of a human being.

If the results of inference and generation can't be protected under copyright, as they aren't the creative output of a human being, why wouldn't the results of back-propagation and gradient descent follow the same logic?

This isn't about how we feel about it, it's a legal question.

rileymat2

3 months ago

But things like logarithmic table books existed in a world where the results of the calculations were not protectable as IP, no matter how much effort went into creating them.

astrange

3 months ago

1 reply

There's nothing new about being able to copyright something that's a transformation of another work. And they definitely aren't exclusively trained on public data.

TheDong

3 months ago

3 replies

> There's nothing new about being able to copyright something that's a transformation of another work

There is something novel here.

Google Books created a huge online index of books, OCRing, compressing them, and transforming them. That was copyright infringement.

Just because I download a bunch of copyrighted files and run `tar c | gzip` over them does not mean I have new copyright.

Just because I download an image and convert it from png to jpg at 50% quality, throwing away about half the data, does not mean I have created new copyright.

AI models are giant lossy compression algorithms. They take text, tokenize it, and turn it into weights, and then inference is a weird form of decompression. See https://bellard.org/ts_zip/ for a logical extension to this.

I think this is the reason that the claim of LLM models being unencumbered by copyright is novel. Until now, a human had to do some creative transformation to transform a work, it could not simply be a computer algorithm that changed the format or compressed the input.

astrange

3 months ago

2 replies

Google Books is not transformative. It shows you all the same data for the same purpose as they were published for.

A better example is Google Image Search. Thumbnails are transformative because they have a different purpose and aren't the same data. An LLM is much more transformative than a thumbnail.

It's more lossy than even lossy compression because of the regularization term; I'm pretty sure you can train one that's guaranteed to not retain any of the pretraining text. Of course then it can't answer things like "what's the second line of The Star Spangled Banner".

TheDong

3 months ago

1 reply

Thumbnails are not transformative, they are fair use. They would be copyright infringement, except that a court case ruled them as fair use: https://en.wikipedia.org/wiki/Perfect_10,_Inc._v._Amazon.com... .

The fact that compression is incredibly lossy does not change the fact that it's copyright infringement.

I have a lossy compression algorithm with simply outputs '0' or '1' depending on the parity of bits of the input.

If I run that against a camcording of a disney film, the result is a 0 copyrighted by disney, and in fact posting that 0 in this comment would make this comment also illegal so I must disclaim that I did not actually produce that from a camcorded disney film.

If I run it against the book 'dracula' the result is a 0 under the public domain.

The law does not understand bits, it does not understand compression or lossiness, it understands "humans can creatively transform things, algorithms cannot unless a human imbues creativity into it". It does not matter if your compressed output does not contain the original.

astrange

3 months ago

> The court held that framing and hyperlinking of original images for use in an image search engine constituted a fair use of Perfect 10's images because the use was highly transformative

johanyc

3 months ago

1 reply

Google Books is transformative. It's a decided case. And it's the same as Google Image, i.e. for search.

https://news.ycombinator.com/item?id=45489807

astrange

3 months ago

Well yeah now it is, otherwise it wouldn't exist. I don't think showing the entire book would be though.

dboreham

3 months ago

1 reply

You're missing something: whether or not it's copyright infringement depends on a) how much money you have and hence bribes you can give and b) whether you can say what you're doing is "to beat China".

astrange

3 months ago

Who exactly are you imagining is being bribed here?

johanyc

3 months ago

> Google Books created a huge online index of books, OCRing, compressing them, and transforming them. That was copyright infringement.

No. It's a decided case. It's transformative and fair use. My understanding why it's transformative is that Google Books mainly offers a search interface for books and it also have measures to make sure only snippets of books are shown.

heavyset_go

3 months ago

I'd settle with them being held in a public trust for public benefit

ben_w

3 months ago

> I wish we had a constitutional amendment that opensourced all AI commercial AI models and requires documentation and links to all training data and base prompts.

> They are trained on public data at our expense so We The People should own them.

The people who appear to have been trained off for the interesting parts of the blog post are mostly, like me, not American.

> AI should be free. Overhyped and Overpriced. I would love this setup for privacy and security.

Also, this entire blog post only exists because they're curious about a specific free open-weights model.

The "source" being ~"the internet", which we've got as much access to as most of the model makers (i.e. where you don't, they've got explicit licensing rights anyway), and possibly also some explicitly* pirated content (I've not been keeping track of which model makers have or have not done that).

* as in: not just incidentally

bigyabai

3 months ago

What you are describing is more-or-less a planned economy, the polar opposite of America's market economy. The government has the power to appropriate things for the common good because it's perceived that private enterprise isn't a necessary force. Sometimes it works, sometimes it doesn't; only certain countries can "moneyball" their way through economics like that, though. America has long since passed the point of even trying.

Your heart is in the right place here (I agree about FOSS), but there is a snowball's chance in hell that any of this ever happens in the USA. We'll be lucky if AI doesn't resemble cable TV by 2030.

timcobb

3 months ago

> They are trained on public data

this is questionable, but okay...

> at our expense

> so We The People should own them.

in addition to training data, it is my understanding that a model's architecture also largely determines its efficacy. Why should we own the architecture?

canadiantim

3 months ago

Wouldn’t the same argument then be applied to all scraped data?

halperter

3 months ago

Unfortunately very unlikely in our forseeable future with the U.S. having a "U.S. against the world" mentality to the AI race. Would love to see this but this would get shot down immediately.

smj-edison

3 months ago

4 replies

One interesting tidbit from this article that I haven't seen mentioned yet is that you can use glitch tokens to figure out what model someone is using behind the scenes. Put a glitch token in a prompt, and see if it reacts normally or response with this kind of glitchy behavior.

willvarfar

3 months ago

1 reply

You can imagine LLM fingerprinting to be part of future pentest workflows where they identify the model and know it's weaknesses and vulnerabilities etc...

varispeed

3 months ago

1 reply

Will be cat and mouse game, because models can be trained to mimic response of other models under such conditions to obfuscate.

dboreham

3 months ago

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN