LLM-Deflate: Extracting Llms Into Datasets
Posted4 months agoActive3 months ago
scalarlm.comTechstory
calmmixed
Debate
70/100
LlmsAIMachine LearningData Extraction
Key topics
Llms
AI
Machine Learning
Data Extraction
The post discusses a method called LLM-Deflate to extract datasets from Large Language Models (LLMs), sparking a discussion on the implications, limitations, and potential applications of this technique.
Snapshot generated from the HN discussion
Discussion Activity
Active discussionFirst comment
57m
Peak period
19
6-12h
Avg / period
4.9
Comment distribution39 data points
Loading chart...
Based on 39 loaded comments
Key moments
- 01Story posted
Sep 20, 2025 at 2:59 AM EDT
4 months ago
Step 01 - 02First comment
Sep 20, 2025 at 3:57 AM EDT
57m after posting
Step 02 - 03Peak activity
19 comments in 6-12h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 23, 2025 at 11:14 AM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45311115Type: storyLast synced: 11/20/2025, 6:36:47 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
I wonder how you could do it more efficiently?
Is compression really lossy? What is an example of lost knowledge?
Or look at it another way LLMs or just text prediction machines, whatever information doesn't help them predict the next token or conflicts with the likelihood of the next token is something that gets dropped.
Or look at it another way these things are often trained on the many terabytes of the internet yet even a 200 billion parameter network is 100 or 200 GB in size. So something is missing, and that is a way better compression ratio then the best known algorithms for lossless compression.
Or we can look at it another way, these things were never built to be lossless compression systems. We can know by looking at how these things are implemented that they don't retain everything they're trained on, they extract a bunch of statistics.
So in short what works is a model + a way to know its good outputs from bad ones.
and the more you rely on those details (professional photography, scientific data) the more obvious it is (to the point of image being useless in some cases)
same with LLMs, we are currently testing how far we can go before we seeing obvious issues
crest always stays true to descriptions but details are always different
now when I think about it, heraldry is practical way to describe how generative algorithms work
It's not straightforward to prove that models have to be lossy. Sure, the training data is much larger than the model, but there is a huge amount of redundancy in the training data. You have to compare a hypothetically optimal compression of the training data to the size of the model to prove that it must be lossy. And yet, it's intuitively obvious that even the best lossless compression (measured in Kolmogorov complexity) of the training data is going to be vastly larger than the biggest models we have today.
You can always construct toy examples where this isn't the case. For example, you could just store all of the training data in your model, and train another part of the model to read it out. But that's not an LLM anymore. Similarly, you could make an LLM out of synthetic redundant data and it could achieve perfect recall. (Unless you're clever with how you generate it, though, any off the shelf compression algorithm is likely to produce something much much smaller.)
A simple example of this is if you have 4 bits of data and have a compression algorithm that turns it into 2 bits of data. If your dataset only contains 0000, 0011, 1100, and 1111; then this can technically be considered lossless compression because we can always reconstruct the exact original data (e.g. 0011 compresses to 01 and can decompress back to 0011, 1100 compresses to 10 and can decompress back to 1100, etc). However, if our dataset later included 1101 and got compressed to 10, this is now “lossy” because it would decompress to 1100, that last bit was “lost”.
An LLM is lossy compression because it lacks the capacity to 1:1 replicate all its input data 100% of the time. It can get quite close in some cases, sure, but it is not perfect every time. So it is considered “lossy”.
Some compression methods use LLMs internally and also store the residuals, making them lossless.
In general: Depending on the method of compression, you can have lossy or non-lossy compression. Using 7zip on a bunch of text files can lossless-ly compress that data. Briefly, you calculate the statistics of the data you want to compress (the dictionary), and then make the commonly re-occuring chunks describable with fewer bits (encoding). The compressed file basically contains the dictionary and the encoding.
For LLMs: There are ways to use an LLM (or any statistical model of text) to compress text data. But the techniques use similar settings as the above, with a dictionary and an encoding, with the LLM taking the function of a dictionary. When "extracting" data from the dictionary alone, you're basically sampling from the dictionary distribution.
Quantitatively, the "loss" in "lossy" being described is literally the number of bits used for the encoding.
I wrote a brief description here of techniques from an undergrad CS course that can be used: https://blog.wtf.sg/posts/2023-06-05-yes-its-just-doing-comp...
I suspect most of the "leafs" are unusable.
An LLM is layers and layers of non-linear transformations. It's hard to say exactly how information is accumulated. You can inspect activations from tokens but it's really not clear how to define what the function is exactly doing. Therefore error is poorly understood.
Certainly it will occur as the generated data exceeds the original, eg after 1-10T tokens.
I think you could also do this faster by moving down the tree in a depth first manner.
Typically I use this for knowledge transfer, style transfer, catastrophic forgetting mitigation, etc and so I don’t go very far. I usually manually review the data samples before using it.
this is more like writing one's autobiography.
They prove that it is possible to fully clone a brain based on this method.
I think one could theoretically estimate how many queries you would need to make to do it. The worst case is proportional to the number of parameters of the model, i.e. at least 10^15 for a human. At one minute per spoken sample, that comes out to about 2 billion years to clone one human.
I suspect it is not practical without advancements in neural link to increase the bandwidth by billions of times.
I personally like playing around with empirical methods like this blog post to understand the practical efficiency of our learning algorithms like back prop on transformers.
I also try not to invest too much effort into this topic given the ethical issues.
Uhm, no? I mean, some firms do abuse job interviews to pump candidates for usable information, and some have gotten a notable bad reputation for that which impacts their funnel of candidates, but from the article: “Generating comprehensive datasets requires thousands of model calls per topic”—you aren’t going to get a candidate to hang around for that...
It can be a description by a shorter bit length. Think Shannon Entropy and the measure of information content. The information is still in the weights but it is reorganized and the reconstructed sentences (or lists of tokens) will not provide the same exact bits but the information is still there.