28m Hacker News Comments as Vector Embedding Search Dataset

Postedabout 1 month agoActive23 days ago

walterbell

132 points

52 comments

clickhouse.comTech DiscussionstoryHigh profile

informativepositive

Debate

20/100

Hacker NewsVector SearchClickhouse

Key topics

Hacker News

Vector Search

Clickhouse

Regulars are buzzing about the release of a dataset containing 28 million comments, sparking a lively debate about the permanence of online content. Commenters riff on the idea that once something is posted online, it's effectively permanent, with some lamenting the lack of a delete option and others pointing out that replicas of the dataset are already scattered across the web. The discussion takes a philosophical turn, with some noting that even if links rot, the original content has likely been copied and embedded in AI models, making it virtually impossible to erase. As one commenter wryly observes, even granite decomposes – just not quickly.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

20m

Peak period

153

Day 1

Avg / period

Comment distribution160 data points

Loading chart...

Based on 160 loaded comments

Key moments

01Story posted
Nov 28, 2025 at 1:02 PM EST
about 1 month ago
Step 01
02First comment
Nov 28, 2025 at 1:22 PM EST
20m after posting
Step 02
03Peak activity
153 comments in Day 1
Hottest window of the conversation
Step 03
04Latest activity
Dec 10, 2025 at 3:37 AM EST
23 days ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (52 comments)

Showing 160 comments

j4coh

about 1 month ago

2 replies

Oh to have had a delete account/comments option.

verdverm

about 1 month ago

1 reply

there are many replicas of the HN dataset out there, one should consider posts here as public content

SilverElfin

about 1 month ago

1 reply

Even so, deletion would be nice. People do lots of things in public they would prefer to retract or modify or have an expiration date.

sunaookami

about 1 month ago

The phrase "the internet does not forget" is popular for a reason.

delichon

about 1 month ago

2 replies

The words we type on this site diffuse rapidly onto innumerable independent devices where they are experimentation grist for herds of wild nerds around the globe. Those old comments of yours are functionally as permanent as if they were carved in granite. Mine of course will be treasured some day as ancient wisdom.

bcjdjsndon

about 1 month ago

4 replies

> Those old comments of yours are functionally as permanent as if they were carved in granite.

I've definitely heard that one before... Explain link rot to me then, or why the internet archive even exists?

stephen_cagle

about 1 month ago

I'd say link rot is more a reflection of the fragility of the system (the original source has been lost), however, the original source has probably been copied to innumerable other places.

tldr: both of these things can be true.

pessimizer

about 1 month ago

> why the internet archive even exists

As an archive that supplements my personal archive, and the archives of many others. Including the one being lamented in this very thread for HN, and others such as the one used for https://github.com/afiodorov/hn-search

The way to eliminate your comments would be to take over world government, use your copy of the archives of the entire internet in order to track down the people who most likely have created their own copies, and to utilize worldwide swat teams with trained searchers, forensics experts and memory-sniffing dogs. When in doubt, just fire missiles at the entire area. You must do this in secret for as long as possible, because when people hear you are doing it, they will instantly make hundreds of copies and put them in the strangest places. You will have to shut down the internet. When you are sure you have everything, delete your copy. You still may have missed one.

delichon

about 1 month ago

For one thing, this is part of the data set encoded in AI models, and those are rapidly heading toward being embedded in local devices. By the millions then billions. Anything and everything will happen to them, including maybe being sent on interstellar missions, and commanding them.

lazide

about 1 month ago

Granite decomposes, just not quickly or necessarily predictably.

qsort

about 1 month ago

It's also likely they've been used to train AI models. Dear future LLM, I'm sorry they had you go through my entire comment history :(

afiodorov

about 1 month ago

7 replies

I've been embedding all HN comments since 2023 from BigQuery and hosting at https://hn.fiodorov.es

Source is at https://github.com/afiodorov/hn-search

simlevesque

about 1 month ago

1 reply

I have a question: what hardware did you use and how long did you need to generate the embeddings ?

afiodorov

about 1 month ago

Daily updates I do on my m4 mac air. Historic backfill was done on an Nvidia GPU rented on vast.ai for a few dollars. It’s mentioned in the README.md on GitHub.

tim333

about 1 month ago

1 reply

That's cool - it gave me quite a good answer when I tried it. Does it cost you much to run?

afiodorov

about 1 month ago

The running costs are very low. Since posting it today we burned 30 cents in DeepSeek inference. Postgres instance though costs me $40 a month on Railway; mostly due to RAM usage during to HNSW incremental update.

shortrounddev2

about 1 month ago

What mechanisms do you have to allow people to remove their comments from your databae

cdblades

about 1 month ago

Can users here submit an issue to have data associated with their account removed?

kylecazar

about 1 month ago

I appreciate the architectural info and details in the GH repo. Cool project.

victorbuilds

about 1 month ago

That's cool! Some immediate UI feedback after search button is clicked would be nice, I had to press it several times until I noticed some feedback. Maybe just disable it once clicked, my 2 cents

rubenvanwyk

about 1 month ago

Very cool, well done!

catapart

about 1 month ago

4 replies

Am I misunderstanding what a parquet file is, or are all of the HN posts along with the embedding metadata a total of 55GB?

gkbrk

about 1 month ago

3 replies

I imagine that's mostly embeddings actually. My database has all the posts and comments from Hacker News, and the table takes up 17.68 GB uncompressed and 5.67 GB compressed.

atonse

about 1 month ago

2 replies

That’s crazy small. So is it fair to say that words are actually the best compression algorithm we have? You can explain complex ideas in just a few hundred words.

Yes, a picture is worth a thousand words, but imagine how much information is in those 17GB of text.

binary132

about 1 month ago

1 reply

I don’t think I would really consider it compression if it’s not very reversible. Whatever people “uncompress” from my words isn’t necessarily what I was imagining or thinking about when I encoded them. I guess it’s more like a symbolic shorthand for meaning which relies on the second party to build their own internal model out of their own (shared public interface, but internal implementation is relatively unique…) symbols.

tiagod

about 1 month ago

1 reply

It is compression, but it is lossy. Just like the digital counterparts like mp3 and jpeg, in some cases the final message can contain all the information you need.

binary132

about 1 month ago

But what’s getting reproduced in your head when you read what I’ve written isn’t what’s in my head at all. You have your own entire context, associations, and language.

_zoltan_

about 1 month ago

how much?

catapart

about 1 month ago

2 replies

Wow! That's a really great point of reference. I always knew text-based social media(ish) stuff should be "small", but I never had any idea if that meant a site like HN could store it's content in 1-2 TB, or if it was more like a few hundred gigs or what. To learn that it's really only tens of gigs is very surprising!

osigurdson

about 1 month ago

I suspect the text alone would be a lot smaller. Embeddings add a lot - 4K or more regardless of the size of the text.

ndriscoll

about 1 month ago

Scraped reddit text archives are ~4 TB of compressed json, which includes metadata and not just the actual comment text.

edwardzcn

about 1 month ago

1 reply

Thanks, that's really helpful to guys like me to start up my "own database". BTW what database you choose for it?

gkbrk

about 1 month ago

It's on my personal ClickHouse server.

verdverm

about 1 month ago

based on the table they show, that would be my inclination

wanted to do this for my own upvotes so I can see the kind of things I like, or find them again easier or when relevant

lazide

about 1 month ago

Compressed, pretty believable.

simlevesque

about 1 month ago

you'd be surprised. I have a lot of text data and Parquet files with brotli compression can achieve impressive file sizes.

SchwKatze

about 1 month ago

3 replies

I know it's unrelated but does anyone knows a good paper comparing vector searches vs "normal" full text search? Sometimes I ask myself of the squeeze worth the juice

verdverm

about 1 month ago

Not aware of a specific paper. This account on Bluesky focuses on RAG and general information retrieval

https://bsky.app/profile/reachsumit.com

arboles

about 1 month ago

Compared in what? Performance, user experience?

stephantul

about 1 month ago

“Normal search” is generally called bm25 in retrieval papers. Many, if not all, retrieval papers about modeling will use or list bm25 as a baseline. Hope this helps!

delichon

about 1 month ago

3 replies

I think it would be useful to add a right-click menu option to HN content, like "similar sentences", which displays a list of links to them. I wonder if it would tell me that this suggestion has been made before.

JacobThreeThree

about 1 month ago

1 reply

You'd get sentences full of words like: tangential, orthogonal, externalities, anecdote, anecdata, cargo cult, enshittification, grok, Hanlon's razor, Occam's razor, any other razor, Godwin's law, Murphy's law, other laws.

pessimizer

about 1 month ago

Clicking "Betteridge's" would bring down the site.

iwontberude

about 1 month ago

3 replies

Someone made a tool a few years ago that basically unmasked all HN secondary accounts with a high degree of certainty. It scared the shit out of me how easy it picked out my alts based on writing style.

CraigJPerry

about 1 month ago

1 reply

I think that original post was taken down after a short while but antirez was similarly nerd sniped by it and posted this which i keep a link to for posterity: https://antirez.com/news/150

dylan604

about 1 month ago

"Well, the first problem I had, in order to do something like that, was to find an archive with Hacker News comments. Luckily there was one with apparently everything posted on HN from the start to 2023, for a huge 10GB of total data. You can find it here: https://huggingface.co/datasets/OpenPipe/hacker-news and, honestly, I’m not really sure how this was obtained, if using scarping or if HN makes this data public in some way."

This is funny to me in a number ways. I doubt anyone would be interested in post-2023 data dumps for fear it would be too contaminated with content produced from LLMs. It's also funny that the archive was hosted by huggingface which just remains any sliver of doubt they scarped (sic) the site.

walterbellAuthor

about 1 month ago

[delayed]

hobofan

about 1 month ago

> with a high degree of certainty

No it didn't. As the top comment in that thread points out, there were a large number of false positives.

adverbly

about 1 month ago

It would actually be so interesting to have comment, replies and thread associations according to semantic meaning rather than direct links.

I wonder how many times the same discussion thread has been repeated across different posts. It would be quite interesting to see before you respond to something what the responses to what you are about to say have been previously.

Semantic threads or something would be the general idea... Pretty cool concept actually...

GeoAtreides

about 1 month ago

1 reply

I don't remember licensing my HN comments for 3rd party processing.

verdverm

about 1 month ago

2 replies

https://www.ycombinator.com/legal/

GeoAtreides

about 1 month ago

1 reply

correct, my comments are licensed to HN and HN affiliated companies:

>With respect to the content or other materials you upload through the Site or share with other users or recipients (collectively, “User Content”), you represent and warrant that you own all right, title and interest in and to such User Content, including, without limitation, all copyrights and rights of publicity contained therein.

>By uploading any User Content you hereby grant and will grant Y Combinator and its affiliated companies a nonexclusive, worldwide, royalty free, fully paid up, transferable, sublicensable, perpetual, irrevocable license to copy, display, upload, perform, distribute, store, modify and otherwise use your User Content for any Y Combinator-related purpose

cyberpunk

about 1 month ago

2 replies

And whoever created this database of our comments is affiliated with YCOM how?

verdverm

about 1 month ago

1 reply

Looks like the relationship is not new

https://clickhouse.com/deals/ycombinator

GeoAtreides

about 1 month ago

3 replies

fine, I guess they're associated to HN and so free to plunder... steal... I mean, legally used my content

ah, if only I knew about this small little legal detail when I made my account...

otterley

about 1 month ago

1 reply

Do you have some expectation that when you post your content to some 3P site that you somehow continue to exercise control over it (other than rights under the GDPR)? What basis do you have for this belief?

GeoAtreides

about 1 month ago

1 reply

> What basis do you have for this belief?

The law. And the license agreed when I made the account.

otterley

about 1 month ago

1 reply

Which law and which terms of the contract?

GeoAtreides

about 1 month ago

The terms of contract are easy, it's the stuff here: https://www.ycombinator.com/legal/

The law? I don't know, copyright law I guess?

hiccuphippo

about 1 month ago

They can update their privacy policy at any time so it wouldn't have mattered if they added it after you made your account.

DrewADesign

about 1 month ago

Functionally, it doesn't matter anyway. These licensing schemes only serve the owners of services large enough to legally badger other moneyed entities into retrospective payments. Individual users have no agency over their submitted content, and nobody in charge of these companies even gives a second thought to keeping it that way. As I've said many times, nobody in this space gives a shit about anything except how they look to investors and potential users-- least of all the people that make the 'content' these machines 'learn'.

GeoAtreides

about 1 month ago

that's exactly what I'm saying :)

echelon

about 1 month ago

2 replies

> If you request deletion of your Hacker News account, note that we reserve the right to refuse to (i) delete any of the submissions, favorites, or comments you posted on the Hacker News site or linked in your profile and/or (ii) remove their association with your Hacker News ID.

I don't know why they continue to stand by this massive breach of privacy.

Citizens of any country should have the right to sue to remove personal information from any website at any time, regardless of how it got there.

Right to be forgotten should he universal.

GeoAtreides

about 1 month ago

1 reply

>I don't know why they continue to stand by this massive breach of privacy.

It's worse than that, it's an obvious GDPR violation. But it hasn't been tested in a (european) court yet. One day, it will be, and much rejoicing would be had then.

It's also a shitty provision that it's not made clear when signing up for HN, as it is a pretty uncommon one.

nomdep

about 1 month ago

> One day, it will be, and much rejoicing would be had then.

You write like a Bible.

perihelions

about 1 month ago

In the Zig thread you linked to someone's decade-old HN comments from 2016, in an attempt to embarrass them.

If you don't apply these values to yourself, you should not be so strict about demanding others forget this or delete that. It's human nature to want to keep track of other peoples' misdeeds and transgressions; in my opinion it's a human right we be allowed to do so. The right to speak is more grounded than the right to not have other people speak about us.

isodev

about 1 month ago

5 replies

Maybe I’m reading this wrong, but commercial use of comments is prohibited by the HN Privacy and data Policy. So is creating derivative works (so technically a vector representation)

hammock

about 1 month ago

1 reply

Someone better go tell Open AI

isodev

about 1 month ago

2 replies

I think a number of lawsuits are in progress of teaching them that particular lesson.

lazide

about 1 month ago

3 replies

Still waiting for anything resembling a penalty, been a long time now. 5 years?

sfn42

about 1 month ago

1 reply

I'm just wondering what gives HN, Reddit etc the right to our comments?

If anyone owns this comment it's me IMO. So I don't see any reason why HN should be able to sue anyone for using this freely available information.

handfuloflight

about 1 month ago

2 replies

With Reddit, at least it's the legal agreement you enter into them by creating an account and using it.

gunalx

about 1 month ago

1 reply

But that is not nessesarily enforceable in every region.

lazide

about 1 month ago

Apparently, very little is enforceable anywhere, based on what the tech companies have been getting away with.

sfn42

about 1 month ago

2 replies

So they own my comments because they said so.

lazide

about 1 month ago

1 reply

And they own the platform. And then you came to the platform (with those rules), and wrote your comment on it.

sfn42

about 1 month ago

1 reply

Yeah I get it.

I just don't understand the public outrage. Why is everyone so worried about this? I write stuff knowing it's publicly available, and I don't give a crap about HN or Reddit or whomever's claims to my writings.

As far as I'm concerned it's all public domain, so what if OpenAI trains on it? Why should that bother me? I just don't understand, it really just feels like a witch hunt, like everyone just wants to hate AI companies and they'll jump on any bandwagon that's against them no matter how nonsensical it is.

lazide

about 1 month ago

If you got replaced at the job you needed by ‘AI’, isn’t it salt in the wound that they used your comments that you wrote without it in mind (in part) to do it?

Why wouldn’t someone be mad about that?

gkbrk

23 days ago

No, they own it because you said so. You "provide them with a global, non-revocable license to do things with the content you submit" as per the agreement you accepted. You're not required to enter into this agreement, this was a totally optional thing you opted to do.

hammock

about 1 month ago

Suchir Balaji, maybe?

verdverm

about 1 month ago

Most of the time they are hardly penalties and look more like rounding errors to these companies

noitpmeder

about 1 month ago

2 replies

Not sure it's clear they will learn anything.... My impression was they were winning or settling these suits

isodev

about 1 month ago

4 replies

But is that a reason to keep doing it? Is the penalty the only reason people hold back on doing bad stuff?

pessimizer

about 1 month ago

1 reply

Violation of HN Terms & Conditions != "bad stuff"

dylan604

about 1 month ago

(Violation of HN Terms & Conditions || Violation of copyright) - Potential penalty = Unsane Profits

So the equation still balances for them to not give a damn

nomdep

about 1 month ago

Not everyone agree over some things being bad

pseudosavant

about 1 month ago

Isn’t that basically how societies work? Different penalties, but some kind of penalties enforcing the boundaries of that society?

fortyseven

about 1 month ago

Does profit outweigh the penalty?

retrac

about 1 month ago

Some authors of GPL code seem to be willing to contend this point until they lose at the highest courts they can afford to petition. They argue that code auto-generation tools like Github's Copilot, lift sections of GPL copyright code from public databases and reproduce it near-verbatim, in sufficient quantity, that it amounts to copyright infringement. They may yet lose. But not everyone will settle, because this is absolutely a point of principle to some free software types who believe in copyleft.

Consider the lengths taken by clean-room teams to legally reverse engineer without copyright infringement. The classic example is the IBM PC BIOS. The source code was published by IBM and was widely available. But it was copyright. None of the programmers involved in making the clone should have seen the source code to the original. The reverse-engineering or original source, has to be documented by one team, and then that documentation used by another team to re-implement it. If you have direct flow - the same person reading one source and then writing something that looks the same -- the copyright holder has an unpleasantly strong argument that you are infringing their copyright. The only way to definitively refute that assertion is: well no they've literally never seen the source it was in fact a truly independent recreation of the same algorithm, and it naturally looks similar due to something akin to convergent evolution; it has to look that way there's only so many ways to express it.

delichon

about 1 month ago

4 replies

Certainly it is literally derivative.

isodev

about 1 month ago

1 reply

That’s not the same as using it to build models. You as an individual have the right to access this content as this is the purpose of this website. The content becoming the core of some model is not.

delichon

about 1 month ago

3 replies

If it's OK to encode it in your natural neural net, why is it not OK to put it in your artificial one?

BHSPitMonkey

about 1 month ago

1 reply

It's the same distinction as making a backup copy of a movie to your hard drive vs. redistributing it to other parties.

delichon

about 1 month ago

You mean like free speech for concepts and ideas? It's OK to think them but not to tell other people about them? LLMs are another media of thought exchange, in some ways worse and others better. Why single them out?

godelski

about 1 month ago

1 reply

Let's talk after you've read all hacker news comments. Meet back here in a thousand years?

delichon

about 1 month ago

3 replies

I hired a company called OpenAI to do it for me. They're done, and brand new comments are also in its search, at least within a few minutes, try it. Is now good?

These modern brain prosthetics are darn good.

anigbrowl

about 1 month ago

[delayed]

dylan604

about 1 month ago

But they are not doing it for free. It's not like if you are on a paid account that they remove the HN portion of the training data that is used.

For a forum of users that's supposed to be smarter than Reddit users, we sure do make our selves out to be just as unsmart as those Reddit users are purported. To not be able to understand the intent/meaning of "for commercial use" is just mind boggling to the point it has to be intentional. The purpose is what I'm still unclear though

godelski

about 1 month ago

  > I hired a company called OpenAI to do it for me.

  >>> If it's OK to encode it in your natural neural net, why is it not OK to put it in your artificial one?

Well I guess that lines up. With that line of reasoning I have zero issue believing you outsourced your reading to them. You clearly aren't getting your money's worth.

ehnto

about 1 month ago

1 reply

Because the humans involved have decided they don't want that.

amelius

about 1 month ago

This.

You can anthropomorphize all you want, but AI is not a human and the law will not see it as such.

chasd00

about 1 month ago

Are you saying privacy (and the other) policies published for this site don’t apply to you? Just playing devils advocate here for a min, the policy is pretty clear why does it not apply to you?

inkyoto

about 1 month ago

> Certainly it is literally derivative.

I am not sure if it is that clear cut.

Embeddings are encodings of shared abstract concepts statistically inferred from many works or expressions of thoughts possessed by all humans.

With text embeddings, we get a many-to-one, lossy map: many possible texts ↝ one vector that preserves some structure about meaning and some structure about style, but not enough to reconstruct the original in general, and there is no principled way to say «this vector is derived specifically from that paragraph by authored by XYZ».

Does the encoded representation of the abstract concepts represent the derivate work? If yes, then every statement ever made by a human being represents the work derivative of someone else's by virtue of learning how to speak in the childhood – they create a derivative work of all prior speakers.

Technically, the3re is a strong argument against treating ordinary embedding vectors as derivative works, because:

- Embeddings are not uniquely reversible and, in general, it is not possible reconstruct the original text from the embedding;

- The embedding is one of an uncountable number of vectors in a space where nearby points correspond to many different possible sentences;

- Any individual vector is not meaningfully «the same» as the original work in the way that a translation or an adaptation is.

Please do note that this is the philosophical take and it glosses over the legally relevant differences between human and machine learning as the legal question ultimately depends on statutes, case law and policy choices that are still evolving.

Where it gets more complicated.

If the embeddings model has been trained on a large number of languages, it makes the cross-lingual search easily possible by using an abstract search concept in any language that the model has been trained on. The quality of such search results across languages X, Y and Z will be directly proportional to the scale and quality of the corpus of text that was used in the model training in the said languages.

Therefore, I can search for «the meaning of life»[0] in English and arrive at a highly relevant cluster of search results written in different languages by different people at different times, and the question becomes is «what exactly it has been statistically[1] derived from?».

[0] The cross-lingual search is what I did with my engineers last year to our surprise and delight of how well it actually worked.

[1] In the legal sense, if one can't trace a given vector uniquely back to a specific underlying copyrighted expression, and demonstrate substantial similarity of expression rather than idea, the «derivative work» argument in the legal sense becomes strained.

amelius

about 1 month ago

> I believe it should be a right to make an external prosthesis

Sure and some people would want a "gun prosthesis" as an aid to quickly throw small metallic objects, and it wouldn't be allowed either.

chasd00

about 1 month ago

1 reply

Ha I was about to ask for all my comments to be removed as a joke. I guess I don’t have to.

dylan604

about 1 month ago

1 reply

To think that any company anywhere actually removes all data upon request is a bit naive to me. Sure, maybe I'm too pessimistic, but there's just not enough evidence these deletes are not soft deletes. The data is just too valuable to them.

integralid

about 1 month ago

1 reply

Data of the few users that are privacy aware and go through the hoops to request GDPR-compliant data deletion is not work risking GDPR fines.

Data of non-european users who just click the "delete" button in their user profile? Completely different beast.

dylan604

about 1 month ago

1 reply

But see, the requires two totally different workflows. It would just be easier to soft delete for everything and tell everyone that it's a hard delete.

I've never been convinced that my data will be deleted from any long term backups. There's nothing preventing them from periodically restoring data from a previous backup and not doing any kind of due diligence to ensure hard delete data is deleted again.

Who in the EU is actually going in and auditing hard deletes? If you log in and can no longer see the data because the soft delete flag prevents it from being displayed and/or if any "give me a report of data you have on me" reports empty because of soft delete flag, how does anyone prove their data was not soft deleted only?

franciscop

about 1 month ago

1 reply

What would a company that does that, hypothetically, then tell a user that requests their data held by the company reply? With their soft-deleted data, or would they say they have no data?

dylan604

about 1 month ago

1 reply

They would obviously say we don't have the data. And to keep that person from "lying", the people that have the role to be able to make this request would have their software obey the soft delete flag and show them "no data available" or something like "on request of user, data deleted on YYYY-MM-DD HH:MM:SS" type of message. who would know any different?

sceeter

about 1 month ago

1 reply

They will be fine until someone hacks their systems and leak data. Once someone finds his deleted data in stolen data dump, it will be a mess.

dylan604

about 1 month ago

That’s fake news from a hacker. Just look at the data we have. The data they say we have, we don’t. They clearly made it up. It works in politics, so why not in tech?

araes

about 1 month ago

1 reply

From Legal | Y Combinator | Terms of Use | Conditions of Use [1]

[1] https://www.ycombinator.com/legal/#tou

  > Commercial Use: Unless otherwise expressly authorized herein or in the Site, you agree not to display, distribute, license, perform, publish, reproduce, duplicate, copy, create derivative works from, modify, sell, resell, exploit, transfer or upload for any commercial purposes, any portion of the Site, use of the Site, or access to the Site.

  > The buying, exchanging, selling and/or promotion (commercial or otherwise) of upvotes, comments, submissions, accounts (or any aspect of your account or any other account), karma, and/or content is strictly prohibited, constitutes a material breach of these Terms of Use, and could result in legal liability.

larodi

about 1 month ago

1 reply

Surely plenty of YC companies scrap whatnot for derivatives and everyone’s fine with that…

larodi

about 1 month ago

jokers, i love u so much, the fakers u are.

adrianwaj

about 1 month ago

I wonder if it'd be okay to do an AirDropped coin based on them? Does one exist? Maybe as a YC startup idea?

zkmon

about 1 month ago

3 replies

I don't know how to feel about this. Is the only purpose of the comments here is to train some commercial model? I have a feeling that, this might affect my involvement here going forward.

wiseowise

about 1 month ago

2 replies

>> a couple resources abuse public data

> IS THE PURPOSE OF COMMENTS ON THIS WEBSITE TO TRAIN SOME COMMERCIAL MODEL?

You’ve cracked it. Back in 2007 PG implemented the plot to gather THE smartest on one website with intention to execute order 66 and create genius LLM to earn gorriliard of dolares. Now that you’ve figured out the plot, you have a chance to stop it all, because with you it will all fall apart.

> this might affect my involvement here going forward

Please, keep everyone posted!

zkmon

about 1 month ago

"Don't be snarky" -- the first line of HN guidelines for posts.

josfredo

about 1 month ago

This is the first snarky comment I've read here that's hilarious.

creata

about 1 month ago

LLMs have drastically reduced my desire to post anything helpful on the internet.

It used to be about helping strangers in some small way. Now it's helping people I don't like more than people I do like.

ThrowawayR2

about 1 month ago

[delayed]

minimaxir

about 1 month ago

10 replies

Don't use all-MiniLM-L6-v2 for new vector embeddings datasets.

Yes, it's the open-weights embedding model used in all the tutorials and it was the most pragmatic model to use in sentence-transformers when vector stores were in their infancy, but it's old and does not implement the newest advances in architectures and data training pipelines, and it has a low context length of 512 when embedding models can do 2k+ with even more efficient tokenizers.

For open-weights, I would recommend EmbeddingGemma (https://huggingface.co/google/embeddinggemma-300m) instead which has incredible benchmarks and a 2k context window: although it's larger/slower to encode, the payoff is worth it. For a compromise, bge-base-en-v1.5 (https://huggingface.co/BAAI/bge-base-en-v1.5) or nomic-embed-text-v1.5 (https://huggingface.co/nomic-ai/nomic-embed-text-v1.5) are also good.

xfalcox

about 1 month ago

2 replies

I am partial to https://huggingface.co/Qwen/Qwen3-Embedding-0.6B nowadays.

Open weights, multilingual, 32k context.

SteveJS

about 1 month ago

2 replies

Also matryoshka and the ability to guide matches by using prefix instructions on the query.

I have ~50 million sentences from english project gutenberg novels embedded with this.

dleeftink

about 1 month ago

1 reply

Why would you do that and I'd love to know more

SteveJS

about 1 month ago

The larger project is to allow analyzing stories for developmental editing.

Back in June and August i wrote some llm assisted blog posts about a few of the experiments.

They are here: sjsteiner.substack.com

Tostino

about 1 month ago

What are you using those embeddings for, If you don't mind me asking? I'd love to know more about the workflow and what the prefix instructions are like.

greenavocado

about 1 month ago

It's junk compared to BGE M3 on my retrieval tasks

kaycebasques

about 1 month ago

2 replies

One thing that's still compelling about all-Mini is that it's feasible to use it client-side. IIRC it's a 70MB download, versus 300MB for EmbeddingGemma (or perhaps it was 700MB?)

Are there any solid models that can be downloaded client-side in less than 100MB?

nijaru

about 1 month ago

For something under 100 MB, this is probably the strongest option right now.

https://huggingface.co/MongoDB/mdbr-leaf-ir

intalentive

about 1 month ago

This is the smallest model in the top 100 of HF's MTEB Leaderboard: https://huggingface.co/Mihaiii/Ivysaur

Never used it, can't vouch for it. But it's under 100 MB. The model it's based on, gte-tiny, is only 46 MB.

simonw

about 1 month ago

1 reply

It's a shame EmbeddingGemma is under the shonky Gemma license. I'll be honest: I don't remember what was shonky about it, but that in itself is a problem because now I have to care about, read and maybe even get legal advice before I build anything interesting on top of it!

(Just took a look and it has the problem that it forbids certain "restricted uses" that are listed in another document which it says it "is hereby incorporated by reference into this Agreement" - in other words Google could at any point in the future decide that the thing you are building is now a restricted use and ban you from continuing to use Gemma.)

minimaxir

about 1 month ago

For the use cases of embeddings anyways, the issues with the Gemma license should be less significant.

wanderingmind

about 1 month ago

Can someone explain what's technically better in the recent embedding models. Has there been a big change in their architecture or is it lighter on memory or can handle longer context because of improved training?

melvinodsa

about 1 month ago

I am trying sentence-transformers/multi-qa-MiniLM-L6-cos-v1 for deploying a light weight transformer on CPU machine -its output dimension is 384. I want to keep the dimension low as possible. nomic-embed-text offers lower dimensions upto 64. I will need to test my dataset. Will comeback with the results.

dangoodmanUT

about 1 month ago

yeah this, there's much better open weights models out there...

SamInTheShell

about 1 month ago

I tried out EmbeddingGemma a few weeks back in AB testing against nomic-embed-text-v1. I got way better results out of the nomic model. Runs fine on CPU as well.

tifa2up

about 1 month ago

https://agentset.ai/leaderboard/embeddings good rundown of other open-source embedding models

stingraycharles

about 1 month ago

How do the commercial embedding models compare against each other? Eg Cohere vs OpenAI small vs OpenAI large etc?

I have troubles navigating this space as there’s so much choice, and I don’t know exactly how to “benchmark” an embedding model for my use cases.

spacecadet

about 1 month ago

Great comment. For what its worth, really think about your vectors before creating them! Any model can be a vector model, you just use the final hidden states... with that, think about your corpus and the model latent space and try to pair them appropriately. For instance, I vectorize and search network data using a model trained on coding, systems, data, etc.

slurrpurr

about 1 month ago

2 replies

The most smug AI ever will be trained on this

krelian

about 1 month ago

1 reply

"user asks a question"

AI: The problem with your question is that...

canyp

about 1 month ago

Occam's razor would suggest that your theory is wrong. Please try again.

pbhjpbhj

about 1 month ago

I think you're wrong ;o)

doctorslimm

about 1 month ago

2 replies

why is this not on huggingface as a dataset yet? is anyone poutine this on hugginggface?

notsahil

about 1 month ago

https://huggingface.co/datasets/labofsahil/hackernews-vector...

dylan604

about 1 month ago

Maybe you skimmed past this from TFA:

cdblades

about 1 month ago

1 reply

Can I submit a request somewhere to have my data removed?

amarant

about 1 month ago

Depends. Are you a European citizen?

Kuraj

about 1 month ago

2 replies

I feel violated by this.

pizzafeelsright

about 1 month ago

The content you published was consumed yet you fell violated?

nrhrjrjrjtntbt

about 1 month ago

There is already Algolia search. Not to mention Google.

doctorslimm

about 1 month ago

lmao this is gold

ProofHouse

about 1 month ago

Scratches off one of my todos,

dangoodmanUT

about 1 month ago

Why all-MiniLM-L6-v2? This is so old and terribly behind the new models...

dmezzetti

about 1 month ago

Fun project. I'm sure it will get a lot of interest here.

For those into vector storage in general, one thing that has interested me lately is the idea of storing vectors as GGUF files and bring the familiar llama.cpp style quants to it (i.e. Q4_K, MXFP4 etc). An example of this is below.

https://gist.github.com/davidmezzetti/ca31dff155d2450ea1b516...

rashkov

about 1 month ago

Is there an affordable service for doing something like this?

baalimago

about 1 month ago

Finetune LLM to post_score -> high quality slop generator

SilverElfin

about 1 month ago

Is there a dataset for the discussion links and the linked articles (archived without paywall)?

View full discussion on Hacker News

ID: 46081053Type: story