Unlocking a Million Times More Data for AI
Posted3 months agoActive3 months ago
ifp.orgTechstory
heatednegative
Debate
85/100
AI DataPrivacyData Security
Key topics
AI Data
Privacy
Data Security
The article proposes unlocking private data for AI training, sparking controversy and concerns about privacy, data security, and the value of such data.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
45m
Peak period
49
0-3h
Avg / period
14
Comment distribution56 data points
Loading chart...
Based on 56 loaded comments
Key moments
- 01Story posted
Sep 24, 2025 at 2:54 PM EDT
3 months ago
Step 01 - 02First comment
Sep 24, 2025 at 3:39 PM EDT
45m after posting
Step 02 - 03Peak activity
49 comments in 0-3h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 26, 2025 at 4:51 AM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45364514Type: storyLast synced: 11/20/2025, 5:45:28 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
> What makes this vast private data uniquely valuable is its quality and real-world grounding. This data includes electronic health records, financial transactions, industrial sensor readings, proprietary research data, customer/population databases, supply chain information, and other structured, verified datasets that organizations use for operational decisions and to gain competitive advantages. Unlike web-scraped data, these datasets are continuously validated for accuracy because organizations depend on them, creating natural quality controls that make even a small fraction of this massive pool extraordinarily valuable for specialized AI applications.
Will there be a data exchange where one can buy and sell data, or even commododata markets, where one can hedge/speculate on futures?
Asking for a friend.
Maybe that will change over time. But to hear OpenAI and Anthropic tell it, paying for training data will be the death knell of the industry[1].
1 - there are a number of statements to this effect on the record, for example: https://www.theguardian.com/technology/2024/jan/08/ai-tools-...
> Despite what their name might suggest, so-called “large language models” (LLMs) are trained on relatively small datasets.1 2 3 For starters, all the aforementioned measurements are described in terms of terabytes (TBs), which is not typically a unit of measurement one uses when referring to “big data.” Big data is measured in petabytes (1,000 times larger than a terabyte), exabytes (1,000,000 times larger), and sometimes zettabytes (1,000,000,000 times larger).
How valuable is 70 petabytes of temperature sensor readings to a commercial LLM? It is in fact negative. You don't want to be training the LLM on that data. You've only got so much room in those neurons and we don't need it consumed with trying to predict temperature data series.
We don't need "more data", we need "more data of the specific types we're training on". That is not so readily available.
Although it doesn't really matter anyhow. The ideas in the document are utterly impractical. Nobody is going to label the world's data with a super-complex permission scheme any more than the world created the Semantic Web by labeling the world's data with rich metadata and cross-linking. But especially since it would be of negative value to AI training anyhow.
But to your point, a crucial question in AI right now is: how much quality data is still out there?
As far as the impracticality, it's a great point. I disagree and have spent about 10 years working in the area. But that can be a post for another day. I understand and appreciate the skepticism.
Why? Intelligence and compression might just be two sides of the same coin, and given that, I'd actually be very surprised if a future ASI couldn't make due with a fraction of that.
Just because current LLMs need tons of data doesn't mean that that's somehow an inherent requirement. Biological lifeforms seem to be able to train/develop general intelligence from much, much less.
"Biological lifeforms seem to be able to train/develop general intelligence from much, much less."
This statement is hard to defend. The brain takes in 125 MB / second, and lives for 80 years, taking in about 300+ petabytes over our lifetime.
But that's not the real kicker. It's pretty unfair to say that humans learn everything they know from birth -> death. A lot of that learning bias was worked out through evolution... which takes that 300+ petabytes and multiplies it by... many lifetimes.
That also seems several orders of magnitude off. Would you suspect that a human that only experiences life through H.264-compressing glasses, MP3-recompressing headphones etc. does not develop a coherent world model?
What about a human only experiencing a high fidelity 3D rendering of the world based on an accurate physics simulation?
The claim that humans need petabytes of data to develop their mind seems completely indefensible to me.
> A lot of that learning bias was worked out through evolution... which takes that 300+ petabytes and multiplies it by... many lifetimes.
Isn't that like saying that you only need the right data? In which case I'd completely agree :)
And yet every human you know is using petabytes of data to develop their mind. :)
A bigger hand wave has never been done I think. Homomorphic encryption increases computational load several fold. And I'm not aware of anyone trying to use this (very interesting) technology for much of anything, let alone GPU ML algorithms.
I have a better idea: let's just cut the middlemen out and send every bit of data every computer generates to OpenAI. Sorry, to be fair, they want this to be a government-led operation... I'm sure that'll be fine too.
I am going to make a blank model, train it homomorphically to predict someone's name based on their butt cancer status, then prompt it to generate a list of people's names who have butt cancer, and blackmail them not to send it to their employers.
Pay them.
Otherwise why on Earth should I care about "contributing to AI?" It's just another commercial venture which is trying to get something of high value for no money. A protocol that doesn't involve royalty payments is a non starter.
This is a bold assumption. After Enron (financial transactions), Lehman Brothers (customer/population databases, financial transactions), Theranos (electronic health records), Nikola (proprietary research data), Juicero (I don't even know what this is), WeWork (umm ... everything), FTX (everything and we know they didn't mind lying to themselves) I'm pretty sure we can all say for certain that "real world grounding" isn't a guarantee with regards to anything where money or ego is involved.
Not to mention that at this point we're actively dealing with processes being run (improperly) by AI (see the lawsuits against Cigna and and United Health Care [1]), leading to self-training loops without revealing the "self" aspect of it.
[1]: https://www.afslaw.com/perspectives/health-care-counsel-blog...
FTFY
If you get copies of the same data, it doesn't help. In a similar fashion, going from 100 TBs of data scraped from the internet to 200TBs of data scraped from the internet... does it tell you much more? Unclear.
But there are large categories of data which aren't represented at all in LLMs. Most of the world's data just isn't on the internet. AI for Health is perhaps the most obvious example.
I have to note that taking the "bitter lesson" position as a claim that more data will result in better LLMs is a wild misinterpretation (or perhaps a "telephone version) of the original bitter lesson article, which say only that general, scalable algorithms do better than knowledge-carrying, problem-specific algorithms. And the last I heard it was the "scaling hypothesis" that hardly had consensus among those in the field.
If any more scaling scaling does happen, it will happen in the mid-training (using agentic/reasoning outputs from previous model versions) and RL training stages.
Recent progress on useful LLMs seems to involve slimming them down.[1] Does your customer-support LLM really need a petabyte of training data? Yes, now it can discuss everything from Kant to the latest Taylor Swift concert lineup. It probably just needs enough of that to make small talk, plus comprehensive data on your own products.
The future of business LLMs probably fits in a 1U server.
[1] https://mljourney.com/top-10-smallest-llm-to-run-locally/
All the top models are moving towards synthetic data - not because they want more data but because they want quality data that is structured to train utility.
Having zettabytes of “invisible” data is effectively pointless. You can’t train on it because there is so much of it, it’s way more expensive to train per byte because of homomorphic magic (if it’s even possible), and most importantly - it’s not quality training data!
The claim is that there is one million times more data to feed to LLMs, citing a few articles. The articles estimate that there is 180-200 zettabytes (the number mentioned in TFA) of data total in the world, including all cloud services, including all personal computers, etc. The vast majority of that data are not useful to train LLMs at all, they will be movies, games, databases. There is a massive amount of duplication in that data. Only a tiny-tiny fraction will be something useful.
> Think of today’s AI like a giant blender where, once you put your data in, it gets mixed with everyone else’s, and you lose all control over it. This is why hospitals, banks, and research institutions often refuse to share their valuable data with AI companies, even when that data could advance critical AI capabilities.
This is not the reason, the reason is that this data is private. LLMs do not just learn from data, they can often reproduce it verbatim, you cannot give medical records or bank records of real people, that will put them at a very real risk.
Let alone that a lot of them will be, well-structured, yes, but completely useless information for LLM training. You will not get any improvement in the perceived "intellect" of a model by overfitting it with terabytes of tables with bank transaction records.
(OP) You make great points. I think we're actually more in agreement than might be obvious. Part of the reason you need to "give" data to an LLM is because of the way LLMs are constructed... which creates the privacy risk.
The principle of attribution-based control suggested in this article would break that principle, enabling each data owner to control which AI predictions they make more intelligent (as opposed to only controlling which IA models they help train).
So to your point... this is a very rigorous privacy protection. Another way to TLDR the article is "if we get really good at privacy... there's a LOT more data out there... so let's start really caring about privacy"
Anyway... I agree with everything in your comment. Just thought I'd drop by and try to lend clarity to how the article agrees with you (sounds like there's room for improvement on how to describe attribution-based control though).