An LLM Is a Lossy Encyclopedia
Original: An LLM is a lossy encyclopedia
Key topics
The debate rages on: is a Large Language Model (LLM) a lossy encyclopedia or a super-smart librarian? Commenters weigh in, with some reframing LLMs as "pretty good librarians" who can "think aloud" to extract buried information, while others point out that detractors expect an oracle, not a lossily compressed blob of human knowledge. As one commenter quips, "you don't zoom-enhance JPEGs for a reason," highlighting the limitations of LLMs, while another notes that the real value lies in their unified interface, making it easier to access information. The discussion reveals a consensus that LLMs are not oracles, but rather a novel way to interact with a vast, imperfect database of human knowledge.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
4d
Peak period
151
Day 5
Avg / period
26.7
Based on 160 loaded comments
Key moments
- 01Story posted
Aug 29, 2025 at 5:40 AM EDT
5 months ago
Step 01 - 02First comment
Sep 2, 2025 at 5:27 AM EDT
4d after posting
Step 02 - 03Peak activity
151 comments in Day 5
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 9, 2025 at 10:10 AM EDT
4 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
I've got my opinion on whether that's useful or not and it's quite a bit more nuanced. You don't zoom-enhance JPEGs for a reason either.
Tell that to the Google Pixel product team:
https://mk.absturztau.be/notes/ac4jnvjpskjc02lq
If you can see the analogy between text and pictures, it drives the point exactly the right way: in both cases you expect a database to know things it either can't or forgot. If it had a good picture of the zoomed in background it could probably generate a very good representation of what the cropped part would look like; same thing works with text.
The worm in that apple is that you still need educated humans to catch the erroneous LLM output.
A slightly more precise analogy is probably 'a lossily compressed snapshot of the web'. Or maybe the Librarian from Snow Crash - but at least that one knew when it didn't know ;)
The old joke is that you can get away with anything with a hi-vis vest and enough confidence, and LLM's pretty much work on that principle
I'm counting down the days until some important business decision is based on AI output that is wrong.
With GPT-5 I sometimes see it spot a question that needs clarifying in its thinking trace, then pick the most likely answer, then spit out an answer later that says "assuming you meant X ..." - I've even had it provide an answer in two sections for each branch of a clear ambiguity.
Put another way, if you don't care about details that change the answer, it directly implies you don't actually care about the answer.
Related silliness is how people force LLMs to give one word answers to underspecified comparisons. Something along the lines of "@Grok is China or US better, one word answer only."
At that point, just flip a coin. You obviously can't conclude anything useful with the response.
So there are improvements version to version - from both increases in raw model capabilities and better training methods being used.
Interacting with a base model versus an instruction tuned model will quickly show you the difference between the innate language faculties and the post-trained behavior.
The "naive" vision implementation for LLMs is: break the input image down into N tokens and cram those tokens into the context window. The "break the input image down" part is completely unaware of the LLM's context, and doesn't know what data would be useful to the LLM at all. Often, the vision frontend just tries to convey the general "vibes" of the image to the LLM backend, and hopes that the LLM can pick out something useful from that.
Which is "good enough" for a lot of tasks, but not all of them, not at all.
It seems to me the more you can pin it to another data set, the better.
Maybe the LLMs aren't so different from us.
One of the reasons I like this analogy is that it hints at the fact that you need to use them in a different way - you shouldn't be looking up specific facts in an unassisted LLM outside of things that even lossy compression would capture (like the capital cities of countries).
Everything else is mostly playing around and harmful to learning.
For language learning, it's terrible and will try to teach me wrong things if it's unguided. But pasting e.g. a lesson transcript that I just finished, then asking for exercises based on it helps solidify what I learned if the material doesn't come with drills.
I think writing is one of the things it's kind of terrible at. It's often way too verbose and has a particular 'voice' that I think leaves a bad taste in peoples' mouths. At least this issue has given me the confidence to finally just send single sentence emails so people know I don't use LLMs for this.
My frustrations with LLMs from years ago has largely chilled out as I've gotten better at using them and understanding that they aren't people who I can trust to give solid advice. If you're careful about what you put in and careful about what you take out you can get decent value.
When you have a lossy piece of media, such as a compressed sound or image file, you can always see the resemblance to the original and note the degradation as it happens. You never have a clear JPEG of a lamp, compress it, and get a clear image of the Milky Way, then reopen the image and get a clear image of a pile of dirt.
Furthermore, an encyclopaedia is something you can reference and learn from without a goal, it allows you to peruse information you have no concept of. Not so with LLMs, which you have to query to get an answer.
I remember you being surprised when the term “vibe coding” deviated from its original intention (I know you didn’t come up with it). But frankly I was surprised at your surprise—it was entirely predictable and obvious how the term was going to be used. The concept I’m attempting to communicate to you is that when you make up a term you have to think not only of the thing in your head but also of the image it conjures up in other people’s minds. Communication is a two-way street.
I grew up in socialism. Since we've transitioned to democracy, I learned that I have to unlearn some things. Our encyclopedias were not inaccurate but were not complete. It's like lying through omission. And as the old saying goes, half-truths are worse than lies.
Whether this would be deemed as a lossy encyclopedia, I don't know. What I am certain of, however, is that it was accurate but omitted important additional facts.
And that is what I see in LLMs as well. Overall, it's accurate, except in cases where an additional fact would alter the conclusion. So, it either could not find arguments with that fact, or it chose to ignore them to give an answer and could be prompted into taking them into account or whatever.
What I do know is that LLMs of today give me the same hibbie-jibbies that rereading those encyclopedias of my youth give me.
(but it isn't and won't ever be an oracle and apparently that's a challenge for human psychology.)
But... end users need to understand this in order to use it effectively. They need to know if the LLM system they are talking to has access to a credible search engine and is good at distinguishing reliable sources from junk.
That's advanced knowledge at the moment!
Me: How do I change the language settings on YouTube?
Claude: Scroll to the bottom of the page and click the language button on the footer.
Me: YouTube pages scroll infinitely.
Claude: Sorry! Just click on the footer without scrolling, or navigate to a page where you can scroll to the bottom like a video.
(Videos pages also scroll indefinitely through comments)
Me: There is no footer, you're just making shit up
Claude: [finally uses a search engine to find the right answer]
They can often reason themselves into some very stupid direction, burning all the tokens for no reason and failing to reply in the end.
But it falls a bit short in that encyclopedias, lossy or not, shouldn't affirmatively contain false information. The way I would picture a lossy encyclopedia is that it can misdirect by omission, but it would not change A to ¬A.
Maybe a truthy-roulette enclyclopedia?
That study ended the "you can't trust wikipedia" argument, you can't trust anything but wikipedia is an as good as it gets second hand reference.
An encyclopedia could say "general relativity is how the universe works" or it could say "general relativity and quantum mechanics describe how we understand the universe today and scientists are still searching for universal theory".
Both are short but the first statement is omitting important facts. Lossy in the sense of not explaining details is ok, but omitting swathes of information would be wrong.
Again, never really want a confidently-wrong encyclopedia, though
Oh but it's much worse than that: because most LLMs aren't deterministic in the way they operate [1], you can get a pristine image of a different pile of dirt every single time you ask.
[1] there are models where if you have the "model + prompt + seed" you're at least guaranteed to get the same output every single time. FWIW I use LLMs but I cannot integrate them in anything I produce when what they output ain't deterministic.
Computers are deterministic. Most of the time. If you really don't think about all the times they aren't. But if you leave the CPU-land and go out into the real world, you don't have the privilege of working with deterministic systems at all.
Engineering with LLMs is closer to "designing a robust industrial process that's going to be performed by unskilled minimum wage workers" than it is to "writing a software algorithm". It's still an engineering problem - but of the kind that requires an entirely different frame of mind to tackle.
If everyone understood the distinction and their limitations, they wouldn’t be enjoying this level of hype, or leading to teen suicides and people giving themselves centuries-old psychiatric illnesses. If you “go out into the real world” you learn people do not understand LLMs aren’t deterministic and that they shouldn’t blindly accept their outputs.
https://archive.ph/rdL9W
https://archive.ph/20241023235325/https://www.nytimes.com/20...
https://archive.ph/20250808145022/https://www.404media.co/gu...
LLMs aren’t being sold as unreliable. On the contrary, they are being sold as the tool which will replace everyone and do a better job at a fraction of the piece.
"LLM is like an overconfident human" certainly beats both "LLM is like a computer program" and "LLM is like a machine god". It's not perfect, but it's the best fit at 2 words or less.
That’s what I was trying to convey with the “then reopen the image” bit. But I chose a different image of a different thing rather than a different image of a similar thing.
My point is that I find the chosen term inadequate. The author made it up from combining two existing words, where one of them is a poor fit for what they’re aiming to convey.
E.g. a Bloom filter also doesn't "know" what it knows.
I do understand and agree with a different point you’re making somewhere else in this thread, but it doesn’t seem related to what you’re saying here.
https://news.ycombinator.com/item?id=45101946
In compressed audio these can be things like clicks and boings and echoes and pre-echoes. In compressed images they can be ripply effects near edges, banding in smoothly varying regions, but there are also things like https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres... where one digit is replaced with a nice clean version of a different digit, which is pretty on-the-nose for the LLM failure mode you're talking about.
Compression artefacts generally affect small parts of the image or audio or video rather than replacing the whole thing -- but in the analogy, "the whole thing" is an encyclopaedia and the artefacts are affecting little bits of that.
Of course the analogy isn't exact. That would be why S.W. opens his post by saying "Since I love collecting questionable analogies for LLMs,".
It was quickly discovered that LLMs are capable of re-checking their own solutions if prompted - and, with the right prompts, are capable of spotting and correcting their own errors at a significantly-greater-than-chance rate. They just don't do it unprompted.
Eventually, it was found that reasoning RLVR consistently gets LLMs to check themselves and backtrack. It was also confirmed that this latent "error detection and correction" capability is present even at base model level, but is almost never exposed - not in base models and not in non-reasoning instruct-tuned LLMs.
The hypothesis I subscribe to is that any LLM has a strong "character self-consistency drive". This makes it reluctant to say "wait, no, maybe I was wrong just now", even if latent awareness of "past reasoning look sketchy as fuck" is already present within the LLM. Reasoning RLVR encourages going against that drive and utilizing those latent error-correction capabilities.
"Language, Halliday argues, "cannot be equated with 'the set of all grammatical sentences', whether that set is conceived of as finite or infinite". He rejects the use of formal logic in linguistic theories as "irrelevant to the understanding of language" and the use of such approaches as "disastrous for linguistics"."
CS never solved the incoherence of language, conduit metaphor paradox. It's stuck behind language's bottleneck, and it do so willingly blind-eyed.
You weren't talking to GPT-4o about philosophy recently, were you?
Beyond this point engineers actually have to know what signaling is, rather than 'information.'
https://www.sciencedirect.com/science/article/abs/pii/S00033...
Ultimately, engineering chose the wrong approach to automating language, and it sinks the field. It's irreversible.
If you're hitching your wagon to human linguists, you'll always find yourself in a ditch in the end.
As of today, 'bad' generations early in the sequence still do tend towards responses that are distant to the ideal response. This is testable/verifiable by pre-filling responses, which I'd advise you to experiment with for yourself.
'Bad' generations early in the output sequence are somewhat mitigatable by injecting self-reflection tokens like 'wait', or with more sophisticated test-time compute techniques. However, those remedies can simultaneously turn 'good' generations into bad, they are post-hoc heuristics which treat symptoms not causes.
In general, as the models become larger they are able to compress more of their training data. So yes, using the terminology of the commenter I was responding to, larger models should tend to have fewer 'compression artefacts' than smaller models.
OpenAI's in-house reasoning training is probably best in class, but even lesser naive implementations go a long way.
https://cdn.openai.com/pdf/d04913be-3f6f-4d2b-b283-ff432ef4a...
They attribute these 'compression artefacts' to pre-training, they also reference the original snowballing paper: How Language Model Hallucinations Can Snowball: https://arxiv.org/pdf/2305.13534
They further state that reasoning is no panacea. W hilst you did say: "the models mitigate more and more"
You were replying to my comment which said:
"'Bad' generations early in the output sequence are somewhat mitigatable by injecting self-reflection tokens like 'wait', or with more sophisticated test-time compute techniques."
So our statements there are logically compatible, i.e. you didn't make a statement that contradicts what I said.
"Our error analysis is general yet has specific implications for hallucination. It applies broadly, including to reasoning and search-and-retrieval language models, and the analysis does not rely on properties of next-word prediction or Transformer-based neural networks."
"Search (and reasoning) are not panaceas. A number of studies have shown how language models augmented with search or Retrieval-Augmented Generation (RAG) reduce hallucinations (Lewis et al., 2020; Shuster et al., 2021; Nakano et al., 2021; Zhang and Zhang, 2025). However, Observation 1 holds for arbitrary language models, including those with RAG. In particular, the binary grading system itself still rewards guessing whenever search fails to yield a confident answer. Moreover, search may not help with miscalculations such as in the letter-counting example, or other intrinsic hallucinations"
> Of course the analogy isn't exact.
And I don’t expect it to be, which is something I’ve made clear several times before, including on this very thread.
https://news.ycombinator.com/item?id=45101679
I don’t think this is a great analogy.
Lossy compression of images or signals tends to throw out information based on how humans perceive it, focusing on the most important perceptual parts and discarding the less important parts. For example, JPEG essentially removes high frequency components from an image because more information is present with the low frequency parts. Similarly, POTS phone encoding and mp3 both compress audio signals based on how humans perceive audio frequency.
The perceived degradation of most lossy compression is gradual with the amount of compression and not typically what someone means when they say “make things up.”
LLM hallucinations aren’t gradual and the compression doesn’t seem to follow human perception.
Compression artifacts (which are deterministic distortions in reconstruction) are not the same as hallucinations (plausible samples from a generative model; even when greedy, this is still sampling from the conditional distribution). A better identification is with super-resolution. If we use a generative model, the result will be clearer than a normal blotchy resize but a lot of details about the image will have changed as the model provides its best guesses at what the missing information could have been. LLMs aren't meant to reconstruct a source even though we can attempt to sample their distribution for snippets that are reasonable facsimiles from the original data.
An LLM provides a way to compute the probability of given strings. Once paired with entropy coding, on-line learning on the target data allows us to arrive at the correct MDL based lossless compression view of LLMs.
You're saying hammers shouldn't be squishy.
Simon is saying don't use a banana as a hammer.
No, that is not what I’m saying. My point is closer to “the words chosen to describe the made up concept do not translate to the idea being conveyed”. I tried to make that fit into your idea of the banana and squishy hammer, but now we’re several levels of abstraction deep using analogies to discuss analogies so it’s getting complicated to communicate clearly.
> Simon is saying don't use a banana as a hammer.
Which I agree with.
We are all free to agree with one part of an argument while disagreeing with another. That’s what healthy discourse is, life is not black and white. As way of example, if one says “apples are tasty because they are red”, it is perfectly congruent to agree apples are tasty but disagree that their colour is the reason. And by doing so we engage in a conversation to correct a misconception.
A librarian might bring you the wrong book, that's the former. An LLM does the latter. They are not the same.
It’s a lot less visible and I guess dramatic than LLMs but it happens frequently enough that I feel like at every major event there are false conspiracies based on video « proofs » that are just encoding artifacts
Purely based on language use, you could expect "dog bit the man" more often than "man bit the dog", which is a lossy way to represent "dogs are more likely to bite people than vice versa." And there's also the second lossy part where information not occurring frequently enough in the training data will not survive training.
Of course, other things also include inaccurate information, frequent but otherwise useless sentences (any sentence with "Alice" and "Bob"), and the heavily pruned results of the post-training RL stage. So, you can't really separate the "encyclopedia" from the rest.
Also, not sure if lossy always means that loss is distributed (i.e., lower resolution). Loss can also be localized / biased (i.e., lose only black pixels), it's just that useful lossy compression algorithms tend to minimize the noticeable loss. Tho I could be wrong.
In fact the best compression algorithms and LLMs have in common that they work by predicting the next word. Compression algorithms take an extra step called entropy coding to encode the difference between the prediction and the actual data efficiently, and the better the prediction, the better the compression ratio.
What makes a LLM "lossy" is that you don't have the "encode the difference" step.
And yes, it means you can turn a LLM into a (lossless) compression algorithm, and I think a really good one in term of compression ratio on huge data sets. You can also turn a compression algorithm like gzip into a language model! A very terrible one, but the output is better than a random stream of bytes.
We would have very different conversations if LLMs were things that merely exploded into a singular lossy-expanded version of Wikipedia, but where looking at the article for any topic X would give you the exact same article each time.
Furthernore, even in the absence of randomness, asking an LLM the same question in different ways can yield different, potentially contradictory answers, even when the difference in prompting is perfectly benign.
You see this with humans who encode physical space to physical matrix in our brain. When asking for directions, people have to traverse this matrix until it is memorized, then it isn’t used any longer; only the rote data is referenced.
In my experience as a human, the more you know about a subject, or even the more you have simply seen content about it, the easier it is to ramble on about it convincingly. It's like a mirroring skill, and it does not actually mean you understand what you're saying.
LLMs seem to do the same thing, I think. At scale this is widely useful, though, I am not discounting it. Just think it's an order of magnitude below what's possible and all this talk of existing stream-of-consciousness-like LLMs creating AGI seems like a miss
> The key thing is to develop an intuition for questions it can usefully answer vs questions that are at a level of detail where the lossiness matters
the problem is that in order to develop an intuition for questions that LLMs can answer, the user will at least need to know something about the topic beforehand. I believe that this lack of initial understanding of the user input is what can lead to taking LLM output as factual. If one side of the exchange knows nothing about the subject, the other side can use jargon and even present random facts or lossy facts which can almost guarantee to impress the other side.
> The way to solve this particular problem is to make a correct example available to it.
My question is how much effort would it take to make a correct example available for the LLM before it can output quality and useful data? If the effort I put in is more than what I would get in return, then I feel like it's best to write and reason it myself.
This is why simonw (The author) has his "pelican on a bike" -test, it's not 100% accurate but it is a good indicator.
I have a set of my own standard queries and problems (no counting characters or algebra crap) I feed to new LLMs I'm testing
None of the questions exist outside of my own Obsidian note so they can't be gamed by LLM authors. And I've tested multiple different LLMs using them so I have a "feeling" on what the answer should look like. And I personally know the correct answer so I can immediately validate them.
This is why I've said a few times here on HN and elsewhere, if you're using an LLM you need to think of yourself as an architect guiding a Junior to Mid Level developer. Juniors can do amazing things, they can also goof up hard. What's really funny is you can make them audit their own code in a new context window, and give you a detailed answer as to why that code is awful.
I use it mostly on personal projects especially since I can prototype quickly as needed.
The thing is coding can (and should) be part of the design process. Many times, I though I have a good idea of what the solution should look like, then while coding, I got exposed more to the libraries and other parts of the code, which led me to a more refined approach. This exposure is what you will miss and it will quickly result in unfamiliar code.
I used ChatGPT 5 over the weekend to double check dosing guidelines for a specific medication. "Provide dosage guidelines for medication [insert here]"
It spit back dosing guidelines that were an order of magnitude wrong (suggested 100mcg instead of 1mg). When I saw 100mcg, I was suspicious and said "I don't think that's right" and it quickly corrected itself and provided the correct dosing guidelines.
These are the kind of innocent errors that can be dangerous if users trust it blindly.
The main challenge is LLMs aren't able to gauge confidence in its answers, so it can't adjust how confidently it communicates information back to you. It's like compressing a photo, and a photographer wrongly saying "here's the best quality image I have!" - do you trust the photographer at their word, or do you challenge him to find a better quality image?
Regardless, it's diagnostic capability is distinct from the dangers it presents, which is what the parent comment was mentioning.
I have good insurance and have a primary care doctor with whom I have good rapport. But I can’t talk to her every time I have a medical question—it can take weeks to just get a phone call! If I manage to get an appointment, it’s a 15 minute slot, and I have to try to remember all of the relevant info as we speed through possible diagnoses.
Using an llm not for diagnosis but to shape my knowledge means that my questions are better and more pointed, and I have a baseline understanding of the terminology. They’ll steer you wrong on the fine points, but they’ll also steer you _right_ on the general stuff in a way that Dr. Google doesn’t.
One other anecdote. My daughter went to the ER earlier this year with some concerning symptoms. The first panel of doctors dismissed it as normal childhood stuff and sent her home. It took 24 hours, a second visit, and an ambulance ride to a children’s hospital to get to the real cause. Meanwhile, I gave a comprehensive description of her symptoms and history to an llm to try to get a handle on what I should be asking the doctors, and it gave me some possible diagnoses—including a very rare one that turned out to be the cause. (Kid is doing great now). I’m still gonna take my kids to the doctor when they’re sick, of course, but I’m also going to use whatever tools I can to get a better sense of how to manage our health and how to interact with the medical system.
I also have good insurance and a PCP. The idea that I could call them up just to ask “should I start doing this new exercise” or “how much aspirin for this sprained ankle?” is completely divorced from reality.
And "your doctor" is actually "any doctor that is willing to write you a prescription for our medicine".
i'm not going to call my doctor to ask "is it okay if I try doing kettlebell squats?"
But also, maybe calling your doctor would be wise (eg if you have back problems) before you start doing kettlebell squats.
I'd say that the audience for a lot of health related content skews towards people who should probably be seeing a doctor anyway.
The cynic in me also thinks some of the "ask your doctor" statements are just slapped on to artificially give credence to whatever the article is talking about (eg "this is serious exercise/diet/etc).
Edit: I guess what I meant is: I don't think it's just "liability", but genuine advice/best practice/wisdom for a sizable chunk of audiences.
That's exactly what I (and most people I know) routinely do both in Italy and France. Like, "when in doubt, call the doc". I wouldn't know where to start if I had to handle this kind of stuff exclusively by myself.
> If I manage to get an appointment, it’s a 15 minute slot
I'm sorry that this is what "good insurance" gets you.
This probably varies by locale. For example my doctor responds within 1 day on MyChart for quick questions. I can set up an in person or video appointment with her within a week, easily booked on MyChart as well.
I’d encourage you to find another doctor.
E-mails and communication is completely free of charge.
We all know that Google and LLM's are not the answer for your medical questions but that they cause fear and stress instead.
In 40 years, only one of my doctors had the decency to correct his mistake after I pointed it out.
He prescribed the wrong Antibiotics, which I only knew because I did something dumb and wondered if the prescribed antibiotics cover a specific strain, which they didn't, which I knew because I asked an LLM and then superficially double-checked via trustworthy official, government sources.
He then prescribed the correct antibiotics. In all other cases where I pointed out a mistake, back in the day researched without LLMs, doctors justified their logic, sometimes siding with a colleague or "the team" before evaluating the facts themselves, instead of having an independent opinion, which, AFAIK, especially in a field like medicine, is _absolutely_ imperative.
The ol' "What weighs more, a pound of feathers or two pounds of bricks" trick explains this perfectly to me.
Why is an LLM unable to read a table of church times across a sampling of ~5 Filipino churches?
Google LLM (Gemini??) was clearly finding the correct page. I just grabbed my mom's phone after another bad mass time and clicked on the hyperlink. The LLM was seemingly unable to parse the table at all.
I can also see them as very clever search engines, since this is one way I use them a lot: ask hard questions about a huge and legacy codebase.
These analogies do not really work for generating new code. A new metaphor I am starting to use is "translator engine": it is translating from human language to programming language. It in a way explains a lot of the stupidity I am seeing.
The models hold more information than they can immediately extract, but CoT can find a key to look it up or synthesise by applying some learned generalisations.
Imagine a slightly lossy compression algorithm which can store 10x, 100x the current best lossless and be able to maintain 99.999% fidelity when recalling that information. Probably, very probably a pipe dream. But why do large on device models seem to be able to remember adjust everything from Wikipedia and store that in smaller format than a direct archive of the source Material. (Look at the current best from diffusion models as well)
llm is a pretty good librarian who has read a ton of books (and doesn't have perfect memory)
even more useful when allowed to think-aloud
even more useful when allowed to write stuff down and check in library db
even more useful when allowed to go browse and pick up some books
even more useful when given a budget for travel and access to other archives
even more useful when …
brrrrt
183 more comments available on Hacker News