What GPT-Oss Leaks About Openai's Training Data
Posted3 months agoActive3 months ago
fi-le.netTechstoryHigh profile
calmmixed
Debate
60/100
AILlmsTraining DataNlp
Key topics
AI
Llms
Training Data
Nlp
The article discusses what can be inferred about OpenAI's GPT model training data from analyzing the GPT-OSS model, sparking discussion on the implications and potential biases of LLMs.
Snapshot generated from the HN discussion
Discussion Activity
Active discussionFirst comment
1h
Peak period
19
3-6h
Avg / period
7.5
Comment distribution82 data points
Loading chart...
Based on 82 loaded comments
Key moments
- 01Story posted
Oct 5, 2025 at 2:28 PM EDT
3 months ago
Step 01 - 02First comment
Oct 5, 2025 at 3:44 PM EDT
1h after posting
Step 02 - 03Peak activity
19 comments in 3-6h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 7, 2025 at 4:16 PM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45483924Type: storyLast synced: 11/20/2025, 8:32:40 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Afaik embedding and norm params are excluded from weight decay as standard practice. Is this no longer true?
E.g., they exclude them in minGPT: https://github.com/karpathy/minGPT/blob/37baab71b9abea1b76ab...
And more tricky but as important, is there any work on extrapolating the pretrained model AFTER it's RLHF'd? For example, what kinds of biases did exist in gpt-4o before it was unbiased?
Do biases go away completely or they just get suppressed down deep in the model's "mind"?
https://arxiv.org/abs/2403.06634
https://arxiv.org/abs/2311.17035
(I just have these ones off the top of my head because I'm a Nicholas Carlini fan and we interviewed him about these attacks.)
Bias is a human term, and couching the conversation in that context does nothing to address the issue here, because it gets into the quagmire of social context.
Let's say LLM's had taken off 15 years ago at the point system d launched. All the answers given are going to weight toward the old init system simply because there is a lack of information.
LLM's are only repeating the data they are given, and it's cheaper to remove the data after the fact than it is to try to scrub it out of the training data.
For instance you could use pretraining/SFT to steer something away from a document instead of towards it and that wouldn't be "only repeating" it. Though I don't know if that's actually possible, and afaik it is true RL reweights existing data instead of learning new things.
There are many kinds of bias, plenty of which have nothing to do with culture or social context.
I find that odd. Would anyone be surprised to know that Google indexes adult websites, and ranks them in its search algorithm? If not, what is the difference for an LLM?
https://github.com/jiangyy/gpt-tokens
People found these adult-site-related Chinese phrases in Gpt-4o. The OP is more than one year late.
As I child I had no idea what many of the words meant in the newspaper and in literature but I could just pretend I knew what those words meant or get by without knowing what those words meant in full. In time I would gain familiarity with these words, able to make sense of them in context but not necessarily able to pronounce said words or be able to use them in my own writing. I certainly didn't stop what I was reading to get the dictionary out every time I encountered a new word, and this is how I think most people learn to read, with gradual changes with new words going from no idea to some familiarity to confidently able to use.
We aren't tokenising like the LLMs do and our languages are the product of many hundreds of thousands of years of development. So, how does an LLM learn words that have not already been tokenised? Or is this baked in?
The tokenizer covers the entire dataset. It's basically just a fixed-size Huffman code, grouping together common fragments of letters- for instance, the 100 most common English words are probably all single tokens.
During learning, the model proceeds in roughly the same way a child would: it starts by grouping tokens together, learning the deep regularities of language such as "news[paper]" being more likely than "news[q77.bfe]". Then it incrementally assembles these fragments into larger and larger chains. Similarly , it first learns thematic groupings, such as "word" being more likely somewhere after "dictionary" rather than "stop what I was reading to get the dictionary out every time I encountered a banana assault hungry". Then it starts to pick up "patterns": "as a [baby|child|kid] I had no [idea|concept|clue]". At some point in this process it naturally abstracts concepts from languages: "as a child" starts being internally represented by the same neurons as "als ich ein Kind war".
Then some magic happens that we don't understand, and out pops a neural network that you can talk to and that can write programs and use tools. To be clear, this is the case before RL: probably these patterns are now widespread in the training data, so that the model already understands how to "complete the pattern" on its own. RL then does some magic on top of that to bring it from 20% benchmarks to 80% and presto, AI assistant.
> At some point in this process it naturally abstracts concepts from languages: "as a child"
Is true. I don't know of any way for the model to represent concepts.
>Claude sometimes thinks in a conceptual space that is shared between languages, suggesting it has a kind of universal “language of thought.” We show this by translating simple sentences into multiple languages and tracing the overlap in how Claude processes them.
Well, this is only trivially true. You can feed binary data to the LLM and it probably has tokens that only cover single bytes of that.
New words will usually be combinations of existing tokens, but at the beginning of training a new model, it doesn't "know" what any of the tokens mean. And there's no reason you can't treat every UTF-8 byte as a separate token, but that would require a larger model before you got results that look to a layperson like intelligence, understanding, or knowledge. Tokenisation lets you use a system like word2vec to assign each token a semantic embedding in a vector space, giving the model a bit of a leg up.
---
Response to the sibling comment https://news.ycombinator.com/item?id=45485439, since I've hit the rate limit:
> During learning, the model […] starts by grouping tokens together
You probably could design a ML system that works like this, and it'd probably be more efficient to train than a hundred-billion parameter GPT model, but that's not how GPT model training works. Instead, it attempts all of those things in parallel (although I would expect the solutions to the earlier, easier parts to settle down before the solutions to the later parts do), and the same process is responsible for all of the behaviour in a straightforward fashion.
We do understand the "magic": it's just that it produces a really complicated system that we can't characterise the iterative behaviour of. (For comparison, the iterative function f_c(z) = z² + c, iterated starting at 0, produces the Mandelbrot set.) To use an analogy: imagine the training data is a landscape, and the behaviour of the GPT model trained on it is a weather system. (The parameter count is the amount of atmosphere, or something.) There's nothing magical going on in the weather, but it's just too complicated to predict ahead of time, and tiny gaps in our understanding can magnify into extremely inaccurate long-term predictions. We can, despite this, make some blanket statements about the possible capabilities of a GPT model, of the form "a GPT model will never be able to do X unless you cheat".
The RL magic is, I believe, well understood, but I don't personally understand it. (I know what it does, since RL always does the same thing, but I don't know what it's doing to the model to achieve that.)
> "as a child" starts being internally represented by the same neurons as "als ich ein Kind war"
Yes and no. For a few reasons, including that this kind of association can occur without the same "neurons" getting involved until past the point where that representation exists, it's better to say that they're embedded in nearby regions of a vector space. The actual nodes of the neural network are an implementation detail.
For an entire token that it hasn't seen before, it would have to rely only on context. Presumably it could do this, since that is after all the case in the early phases of training.
The origins of the OpenAI glitch tokens are pretty interesting: the trained an early tokenizer on common strings in their early training data but it turns out popular subreddits caused some weird tokens to be common enough to get assigned an integer, like davidjl - a frequent poster in the https://reddit.com/r/counting subreddit. More on that here: https://simonwillison.net/2023/Jun/8/gpt-tokenizers/#glitch-...
Just a silly thought that crossed my mind when I saw those "ad tokens".
That said, I wrote my post after bed time and I'm now pretty sure I wasn't thinking straight.
for same reason, whisper some blank audio will output those ads.
Companies incorporating subtitle data as transcription source of truth training data will thus train their models to output facsimiles of these messages whenever they're encountering prolonged stretches of silence.
They are trained on public data at our expense so We The People should *own* them.
Someday probably sooner then we might even think.... We'll easily run mega huge sized models on our laptops, desktops, and phones. AI should be free. Overhyped and Overpriced. I would love this setup for privacy and security.
Anyways, only tangentally related... (why worry about leaks like this and the hidden base prompts! - they *should all be 100% OSS* - it is the only way to ensure privacy and security).
Also, long timer lurker, first time posting!
I just had to get this off my mind! Cheers.
If generative AI models' output can't be copyrighted and turned into private IP, who is to say the output of gradient descent and back-propagation similarly can't be copyrighted? Neither are the creative output of a human being, but both are the product of automated and computed statistical processes.
Similarly, if AI companies want to come at dataset compilation and model training from a fair use angle, would it not be fair use to use the same models for similar purposes if models were obtained through eminent domain? Or through, like in Anthropic's training case, explicit piracy?
If the results of inference and generation can't be protected under copyright, as they aren't the creative output of a human being, why wouldn't the results of back-propagation and gradient descent follow the same logic?
This isn't about how we feel about it, it's a legal question.
There is something novel here.
Google Books created a huge online index of books, OCRing, compressing them, and transforming them. That was copyright infringement.
Just because I download a bunch of copyrighted files and run `tar c | gzip` over them does not mean I have new copyright.
Just because I download an image and convert it from png to jpg at 50% quality, throwing away about half the data, does not mean I have created new copyright.
AI models are giant lossy compression algorithms. They take text, tokenize it, and turn it into weights, and then inference is a weird form of decompression. See https://bellard.org/ts_zip/ for a logical extension to this.
I think this is the reason that the claim of LLM models being unencumbered by copyright is novel. Until now, a human had to do some creative transformation to transform a work, it could not simply be a computer algorithm that changed the format or compressed the input.
A better example is Google Image Search. Thumbnails are transformative because they have a different purpose and aren't the same data. An LLM is much more transformative than a thumbnail.
It's more lossy than even lossy compression because of the regularization term; I'm pretty sure you can train one that's guaranteed to not retain any of the pretraining text. Of course then it can't answer things like "what's the second line of The Star Spangled Banner".
The fact that compression is incredibly lossy does not change the fact that it's copyright infringement.
I have a lossy compression algorithm with simply outputs '0' or '1' depending on the parity of bits of the input.
If I run that against a camcording of a disney film, the result is a 0 copyrighted by disney, and in fact posting that 0 in this comment would make this comment also illegal so I must disclaim that I did not actually produce that from a camcorded disney film.
If I run it against the book 'dracula' the result is a 0 under the public domain.
The law does not understand bits, it does not understand compression or lossiness, it understands "humans can creatively transform things, algorithms cannot unless a human imbues creativity into it". It does not matter if your compressed output does not contain the original.
?
https://news.ycombinator.com/item?id=45489807
No. It's a decided case. It's transformative and fair use. My understanding why it's transformative is that Google Books mainly offers a search interface for books and it also have measures to make sure only snippets of books are shown.
> They are trained on public data at our expense so We The People should own them.
The people who appear to have been trained off for the interesting parts of the blog post are mostly, like me, not American.
> AI should be free. Overhyped and Overpriced. I would love this setup for privacy and security.
Also, this entire blog post only exists because they're curious about a specific free open-weights model.
The "source" being ~"the internet", which we've got as much access to as most of the model makers (i.e. where you don't, they've got explicit licensing rights anyway), and possibly also some explicitly* pirated content (I've not been keeping track of which model makers have or have not done that).
* as in: not just incidentally
Your heart is in the right place here (I agree about FOSS), but there is a snowball's chance in hell that any of this ever happens in the USA. We'll be lucky if AI doesn't resemble cable TV by 2030.
this is questionable, but okay...
> at our expense
?
> so We The People should own them.
in addition to training data, it is my understanding that a model's architecture also largely determines its efficacy. Why should we own the architecture?
I wonder if it will mean more or less revealing of the models that are running agentic flows (we currently abstract with Fast/Smart)
It is also possible that the first model calls other models, and you could reverse engineer the tool call structure by seeing when glitches occur based on different branches of the tool calling
Basically, OpenAI is trained on web data that has a number of words where splitting on -der makes sense (e.g., mur-der, un-derstanding, won-derful; although the most common occurrence in a GitHub search for "\\xadder" is what appears to be an incorrectly encoded string "L\xc3\xadder", probably from the Portuguese and Spanish "Lí-der").
Anyways, using the o200k tokenizer `mur\xadder` yields two tokens (88762 and 179582). 88762 encodes "mur" and 179582 encodes "\xadder".
Does it really imply they were trained on phrases FROM adult websites, or that those phrases FOR adult sites were common in the training data?
Blogspam, link-farms, affiliate marketing, etc, are extremely common for adult (and gambling) sites and likely result in a lot of data tainted with those phrases.
I wonder what the full 202 letter name of the o200k tokenizer is?
It strikes me less that they're from adult websites and more that they're from compromised sites. I've had that happen before and it's mostly porn and stuff like that when that happens.