28m Hacker News Comments as Vector Embedding Search Dataset
Key topics
Regulars are buzzing about the release of a dataset containing 28 million comments, sparking a lively debate about the permanence of online content. Commenters riff on the idea that once something is posted online, it's effectively permanent, with some lamenting the lack of a delete option and others pointing out that replicas of the dataset are already scattered across the web. The discussion takes a philosophical turn, with some noting that even if links rot, the original content has likely been copied and embedded in AI models, making it virtually impossible to erase. As one commenter wryly observes, even granite decomposes – just not quickly.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
20m
Peak period
153
Day 1
Avg / period
40
Based on 160 loaded comments
Key moments
- 01Story posted
Nov 28, 2025 at 1:02 PM EST
about 1 month ago
Step 01 - 02First comment
Nov 28, 2025 at 1:22 PM EST
20m after posting
Step 02 - 03Peak activity
153 comments in Day 1
Hottest window of the conversation
Step 03 - 04Latest activity
Dec 10, 2025 at 3:37 AM EST
23 days ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
I've definitely heard that one before... Explain link rot to me then, or why the internet archive even exists?
tldr: both of these things can be true.
As an archive that supplements my personal archive, and the archives of many others. Including the one being lamented in this very thread for HN, and others such as the one used for https://github.com/afiodorov/hn-search
The way to eliminate your comments would be to take over world government, use your copy of the archives of the entire internet in order to track down the people who most likely have created their own copies, and to utilize worldwide swat teams with trained searchers, forensics experts and memory-sniffing dogs. When in doubt, just fire missiles at the entire area. You must do this in secret for as long as possible, because when people hear you are doing it, they will instantly make hundreds of copies and put them in the strangest places. You will have to shut down the internet. When you are sure you have everything, delete your copy. You still may have missed one.
Source is at https://github.com/afiodorov/hn-search
Yes, a picture is worth a thousand words, but imagine how much information is in those 17GB of text.
wanted to do this for my own upvotes so I can see the kind of things I like, or find them again easier or when relevant
https://bsky.app/profile/reachsumit.com
This is funny to me in a number ways. I doubt anyone would be interested in post-2023 data dumps for fear it would be too contaminated with content produced from LLMs. It's also funny that the archive was hosted by huggingface which just remains any sliver of doubt they scarped (sic) the site.
No it didn't. As the top comment in that thread points out, there were a large number of false positives.
I wonder how many times the same discussion thread has been repeated across different posts. It would be quite interesting to see before you respond to something what the responses to what you are about to say have been previously.
Semantic threads or something would be the general idea... Pretty cool concept actually...
>With respect to the content or other materials you upload through the Site or share with other users or recipients (collectively, “User Content”), you represent and warrant that you own all right, title and interest in and to such User Content, including, without limitation, all copyrights and rights of publicity contained therein.
>By uploading any User Content you hereby grant and will grant Y Combinator and its affiliated companies a nonexclusive, worldwide, royalty free, fully paid up, transferable, sublicensable, perpetual, irrevocable license to copy, display, upload, perform, distribute, store, modify and otherwise use your User Content for any Y Combinator-related purpose
https://clickhouse.com/deals/ycombinator
ah, if only I knew about this small little legal detail when I made my account...
The law. And the license agreed when I made the account.
The law? I don't know, copyright law I guess?
I don't know why they continue to stand by this massive breach of privacy.
Citizens of any country should have the right to sue to remove personal information from any website at any time, regardless of how it got there.
Right to be forgotten should he universal.
It's worse than that, it's an obvious GDPR violation. But it hasn't been tested in a (european) court yet. One day, it will be, and much rejoicing would be had then.
It's also a shitty provision that it's not made clear when signing up for HN, as it is a pretty uncommon one.
You write like a Bible.
If you don't apply these values to yourself, you should not be so strict about demanding others forget this or delete that. It's human nature to want to keep track of other peoples' misdeeds and transgressions; in my opinion it's a human right we be allowed to do so. The right to speak is more grounded than the right to not have other people speak about us.
If anyone owns this comment it's me IMO. So I don't see any reason why HN should be able to sue anyone for using this freely available information.
I just don't understand the public outrage. Why is everyone so worried about this? I write stuff knowing it's publicly available, and I don't give a crap about HN or Reddit or whomever's claims to my writings.
As far as I'm concerned it's all public domain, so what if OpenAI trains on it? Why should that bother me? I just don't understand, it really just feels like a witch hunt, like everyone just wants to hate AI companies and they'll jump on any bandwagon that's against them no matter how nonsensical it is.
Why wouldn’t someone be mad about that?
So the equation still balances for them to not give a damn
Consider the lengths taken by clean-room teams to legally reverse engineer without copyright infringement. The classic example is the IBM PC BIOS. The source code was published by IBM and was widely available. But it was copyright. None of the programmers involved in making the clone should have seen the source code to the original. The reverse-engineering or original source, has to be documented by one team, and then that documentation used by another team to re-implement it. If you have direct flow - the same person reading one source and then writing something that looks the same -- the copyright holder has an unpleasantly strong argument that you are infringing their copyright. The only way to definitively refute that assertion is: well no they've literally never seen the source it was in fact a truly independent recreation of the same algorithm, and it naturally looks similar due to something akin to convergent evolution; it has to look that way there's only so many ways to express it.
These modern brain prosthetics are darn good.
For a forum of users that's supposed to be smarter than Reddit users, we sure do make our selves out to be just as unsmart as those Reddit users are purported. To not be able to understand the intent/meaning of "for commercial use" is just mind boggling to the point it has to be intentional. The purpose is what I'm still unclear though
You can anthropomorphize all you want, but AI is not a human and the law will not see it as such.
I am not sure if it is that clear cut.
Embeddings are encodings of shared abstract concepts statistically inferred from many works or expressions of thoughts possessed by all humans.
With text embeddings, we get a many-to-one, lossy map: many possible texts ↝ one vector that preserves some structure about meaning and some structure about style, but not enough to reconstruct the original in general, and there is no principled way to say «this vector is derived specifically from that paragraph by authored by XYZ».
Does the encoded representation of the abstract concepts represent the derivate work? If yes, then every statement ever made by a human being represents the work derivative of someone else's by virtue of learning how to speak in the childhood – they create a derivative work of all prior speakers.
Technically, the3re is a strong argument against treating ordinary embedding vectors as derivative works, because:
- Embeddings are not uniquely reversible and, in general, it is not possible reconstruct the original text from the embedding;
- The embedding is one of an uncountable number of vectors in a space where nearby points correspond to many different possible sentences;
- Any individual vector is not meaningfully «the same» as the original work in the way that a translation or an adaptation is.
Please do note that this is the philosophical take and it glosses over the legally relevant differences between human and machine learning as the legal question ultimately depends on statutes, case law and policy choices that are still evolving.
Where it gets more complicated.
If the embeddings model has been trained on a large number of languages, it makes the cross-lingual search easily possible by using an abstract search concept in any language that the model has been trained on. The quality of such search results across languages X, Y and Z will be directly proportional to the scale and quality of the corpus of text that was used in the model training in the said languages.
Therefore, I can search for «the meaning of life»[0] in English and arrive at a highly relevant cluster of search results written in different languages by different people at different times, and the question becomes is «what exactly it has been statistically[1] derived from?».
[0] The cross-lingual search is what I did with my engineers last year to our surprise and delight of how well it actually worked.
[1] In the legal sense, if one can't trace a given vector uniquely back to a specific underlying copyrighted expression, and demonstrate substantial similarity of expression rather than idea, the «derivative work» argument in the legal sense becomes strained.
Sure and some people would want a "gun prosthesis" as an aid to quickly throw small metallic objects, and it wouldn't be allowed either.
Data of non-european users who just click the "delete" button in their user profile? Completely different beast.
I've never been convinced that my data will be deleted from any long term backups. There's nothing preventing them from periodically restoring data from a previous backup and not doing any kind of due diligence to ensure hard delete data is deleted again.
Who in the EU is actually going in and auditing hard deletes? If you log in and can no longer see the data because the soft delete flag prevents it from being displayed and/or if any "give me a report of data you have on me" reports empty because of soft delete flag, how does anyone prove their data was not soft deleted only?
[1] https://www.ycombinator.com/legal/#tou
> IS THE PURPOSE OF COMMENTS ON THIS WEBSITE TO TRAIN SOME COMMERCIAL MODEL?
You’ve cracked it. Back in 2007 PG implemented the plot to gather THE smartest on one website with intention to execute order 66 and create genius LLM to earn gorriliard of dolares. Now that you’ve figured out the plot, you have a chance to stop it all, because with you it will all fall apart.
> this might affect my involvement here going forward
Please, keep everyone posted!
It used to be about helping strangers in some small way. Now it's helping people I don't like more than people I do like.
Yes, it's the open-weights embedding model used in all the tutorials and it was the most pragmatic model to use in sentence-transformers when vector stores were in their infancy, but it's old and does not implement the newest advances in architectures and data training pipelines, and it has a low context length of 512 when embedding models can do 2k+ with even more efficient tokenizers.
For open-weights, I would recommend EmbeddingGemma (https://huggingface.co/google/embeddinggemma-300m) instead which has incredible benchmarks and a 2k context window: although it's larger/slower to encode, the payoff is worth it. For a compromise, bge-base-en-v1.5 (https://huggingface.co/BAAI/bge-base-en-v1.5) or nomic-embed-text-v1.5 (https://huggingface.co/nomic-ai/nomic-embed-text-v1.5) are also good.
Open weights, multilingual, 32k context.
I have ~50 million sentences from english project gutenberg novels embedded with this.
Back in June and August i wrote some llm assisted blog posts about a few of the experiments.
They are here: sjsteiner.substack.com
Are there any solid models that can be downloaded client-side in less than 100MB?
https://huggingface.co/MongoDB/mdbr-leaf-ir
Never used it, can't vouch for it. But it's under 100 MB. The model it's based on, gte-tiny, is only 46 MB.
(Just took a look and it has the problem that it forbids certain "restricted uses" that are listed in another document which it says it "is hereby incorporated by reference into this Agreement" - in other words Google could at any point in the future decide that the thing you are building is now a restricted use and ban you from continuing to use Gemma.)
I have troubles navigating this space as there’s so much choice, and I don’t know exactly how to “benchmark” an embedding model for my use cases.
AI: The problem with your question is that...
"Well, the first problem I had, in order to do something like that, was to find an archive with Hacker News comments. Luckily there was one with apparently everything posted on HN from the start to 2023, for a huge 10GB of total data. You can find it here: https://huggingface.co/datasets/OpenPipe/hacker-news and, honestly, I’m not really sure how this was obtained, if using scarping or if HN makes this data public in some way."
For those into vector storage in general, one thing that has interested me lately is the idea of storing vectors as GGUF files and bring the familiar llama.cpp style quants to it (i.e. Q4_K, MXFP4 etc). An example of this is below.
https://gist.github.com/davidmezzetti/ca31dff155d2450ea1b516...