Anthropic Agrees to Pay $1.5b to Settle Lawsuit with Book Authors
Posted4 months agoActive4 months ago
nytimes.comTechstoryHigh profile
heatedmixed
Debate
85/100
Artificial IntelligenceCopyrightLawsuit
Key topics
Artificial Intelligence
Copyright
Lawsuit
Anthropic agrees to pay $1.5B to settle a lawsuit with book authors over alleged copyright infringement, sparking debate about the implications for AI companies and copyright law.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
-496s
Peak period
112
0-6h
Avg / period
16
Comment distribution160 data points
Loading chart...
Based on 160 loaded comments
Key moments
- 01Story posted
Sep 5, 2025 at 3:52 PM EDT
4 months ago
Step 01 - 02First comment
Sep 5, 2025 at 3:44 PM EDT
-496s after posting
Step 02 - 03Peak activity
112 comments in 0-6h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 9, 2025 at 9:04 AM EDT
4 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45142885Type: storyLast synced: 11/26/2025, 1:00:33 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
It’s not precedent setting but surely it’ll have an impact.
https://www.tomshardware.com/tech-industry/artificial-intell...
I would not be surprised if investors made their last round of funding contingent on settling this matter out of court precisely to ensure no precedents are set.
>During a deposition, a founder of Anthropic, Ben Mann, testified that he also downloaded the Library Genesis data set when he was working for OpenAI in 2019 and assumed this was “fair use” of the material.
Per the NYT article, Anthropic started buying physical books in bulk and scanning them for their training data, and they assert that no pirated materials were ever used in public models. I wonder if OpenAI can say the same.
It has been admitted and Anthropic knew that this trial would totally bankrupt them had they said they were innocent and continued to fight the case.
But of course, there's too much money on the line, which means even though Anthropic settled (admitting guilt and profiting off of pirated books) they (Anthropic) knew there was no way they could win that case, and was not worth taking that risk.
> The pivotal fair-use question is still being debated in other AI copyright cases. Another San Francisco judge hearing a similar ongoing lawsuit against Meta ruled shortly after Alsup's decision that using copyrighted work without permission to train AI would be unlawful in "many circumstances."
The first of many.
It’s important in the fair use assessment to understand that the training itself is fair use, but the pirating of the books is the issue at hand here, and is what Anthropic “whoopsied” into in acquiring the training data.
Buying used copies of books, scanning them, and training on it is fine.
Rainbows End was prescient in many ways.
That's how Google Books, the Internet Archive, and Amazon (their book preview feature) operated before ebooks were common. It's not scalable-in-a-garage but perfectly scalable for a commercial operation.
The books that are destroyed in scanning are a small minority compared to the millions discarded by libraries every year for simply being too old or unpopular.
Book burnings are symbolic (Unless you're in the world of Fareinheit 451). The real power comes from the political threat, not the fact that paper with words on them is now unreadable.
It is 500,000 books in total so did they really scan all those books instead of using the pirated versions? Even when they did not have much money in the early phases of the model race?
Agreed. Great book for those looking for a read: https://www.goodreads.com/book/show/102439.Rainbows_End
The author, Vernor Vinge, is also responsible for popularizing the term 'singularity'.
Reminds me of permutation city
“Marooned in Real Time” remains my fav.
What I'm wondering is if they, or others, have trained models on pirated content that has flowed through their networks?
I’m surprised Google hasn’t hit its competitors harder with the fact that they actually got permission to scan books from its partner libraries and Facebook and OpenAI just torrented books2/books3, but I guess they have aligned incentive to benefit from a legal framework that doesn’t look to closely at how you went about collecting source material
https://www.reddit.com/r/libgen/comments/1n4vjud/megathread_...
I'm pretty sure that's just a frontend for Uptime Kuma https://github.com/louislam/uptime-kuma
The whole incident is written up in detail, https://swartz-report.mit.edu/ by Hal Abelson (who wrote SICP among other things). It is a well-researched document.
The report speculates to his motivations on page 31, but it seems to be unknown with any certainty.
but also prior to that he had written the guerilla open access manifesto so it wasn't great optics to be caught doing that
Information may want to be free, but sometimes it takes a revolutionary to liberate it.
This is to teach a lesson because you cannot prosecute all thieves.
Yale Law Journal actually writes about this, the goal is to deter crime because in most cases damages cannot be recovered or the criminal will never be caught in the first place.
(Probability of not getting away with it) 0.01 * (Cost if caught) 1000 = 10x (Expected Cost) = not worth it
It’s crazy to imagine, but there was surely a document or slack message thread discussing where to get thousands of books, and they just decided to pirate them and that was OK. This was entirely a decision based on ease or cost, not based on the assumption it was legal. Piracy can result in jail time IIRC, so honestly it’s lucky the employee who suggested this, or took the action avoided direct legal liability.
Oh and I’m pretty sure other companies (meta) are in litigation over this issue, and the publishers knew that settlement below the full legal limit would limit future revenue.
Well actively generating revenue at least.
Profits are still hard to come by.
It's not the same as debt from a loan, because people are buying a percentage stake in the company. If the value of the company happens to go to zero there's nothing left to pay.
But yeah, the amount of investment a company attracts should have something to do with the perception that it'll operate at a profit at some point
In this specific case the settlement caps the lawyer fees at 25%, and even that is subject to the courts approval. In addition they will ask for $250k total ($50k / plaintiff) for the lead plaintiffs, also subject to the courts approval.
If anything it's too little based on precedent.
I think that this is a distinction many people miss.
If you take all the works of Shakespeare, and reduce it to tokens and vectors is it Shakespeare or is it factual information about Shakespeare? It is the latter, and as much as organizations like the MLB might want to be able to copyright a fact you simply cannot do that.
Take this one step further. IF you buy the work, and vectorize it, thats fine. But if you feed it in the vectors for Harry Potter so many times that it can reproduce half of the book, it becomes a problem when it spits out that copy.
And what about all the other stuff that LLM's spit out? Who owns that. Well at present, no one. If you train a monkey or an elephant to paint, you cant copyright that work because they aren't human, and neither is an LLM.
If you use an LLM to generate your code at work, can you leave with that code when you quit? Does GPL3 or something like the Elastic Search license even apply if there is no copyright?
I suspect we're going to be talking about court cases a lot for the next few years.
This seems too cute by half, courts are generally far more common sense than that in applying the law.
This is like saying using `rails generate model:example` results in a bunch of code that isn't yours, because the tool generated it according to your specifications.
I’d guess the legal scenario for `rails generate` is that you have a license to the template code (by way of how the tool is licensed) and the template code was written by a human so licensable by them and then minimally modified by the tool.
[1] https://en.wikipedia.org/wiki/Monkey_selfie_copyright_disput...
'The Board’s decision was later upheld by the U.S. District Court for the District of Columbia, which rejected the applicant’s contention that the AI system itself should be acknowledged as the author, with any copyrights vesting in the AI’s owner. The court further held that the CO did not act arbitrarily or capriciously in denying the application, reiterating the requirement that copyright law requires human authorship and that copyright protection does not extend to works “generated by new forms of technology operating absent any guiding human hand, as plaintiff urges here.”' From: https://www.whitefordlaw.com/news-events/client-alert-can-wo...
The court is using common sense when it comes to the law. It is very explicit and always has been... That word "human" has some long standing sticky legal meaning (as opposed to things that were "property").
So things get even more dark because what becomes distribution can have a really vague definition and maybe the AI companies will only follow the law just barely, just for the sake of not getting hit with a lawsuit like this again. But I wonder if all this case did was maybe compensate the authors this one time. I doubt if we can see a meaningful change towards AI companies attitude's towards fair use/ essentially exploiting authors.
I feel like that they would try to use as much legalspeak as possible to extract as much from authors (legally) without compensating them which I feel is unethical but sadly the law doesn't work on ethics.
I would assume it would compensate the publisher. Authors often hand ownership to the publisher; there would be obvious exceptions for authors who do well.
Note that the law specifically regulates software differently, so what you cannot do is just willy nilly pirate games and software.
What distribution means in this case is defined in the swiss law. However swiss law as a whole is in some ways vague, to leave a lot up to interpretation by the judiciary.
To rephrase the question:
Is a PDF of the complete works of Shakespeare Shakespeare, or is it factual information about Shakespeare?
Reencoding human-readable information into a form that's difficult for humans to read without machine assistance is nothing new.
I wasn't talking about distribution, and neither was the person whom I was replying to. But, thanks for wasting your time on publishing the rest of your comment, I guess.
So to me, if you are doing literally any human review, edits, control over the AI then I think you'll retain copyright. There may be a risk that if somebody can show that they could produce exactly the same thing from a generic prompt with no interaction then you may be in trouble, but let's face it should you have copyright at that point?
This is, however, why I favor stopping slightly short of full agentic development at this point. I want the human watching each step and an audit trail of the human interaction in doing it. Sure I might only get to 5x development speed instead of 10x or 20x but that is already such an enormous step up from where we were a year ago that I am quite OK with that for now.
It remains deranged.
Everyone has more than a right to freely have read everything is stored in a library.
(Edit: in fact initially I wrote 'is supposed to' in place of 'has more than a right to' - meaning that "knowledge is there, we made it available: you are supposed to access it, with the fullest encouragement").
They're merely doing what anyone is allowed to with the books that they own, loaning them out, because copyright law doesn't prohibit that, so no license is needed.
What is in a library, you can freely read. Find the most appropriate way. You do not need to have bought the book.
¹(Edit: or /may/ not be allowed, see posts below.)
They didn't think it would be a good idea to re-bind them and distribute it to the library or someone in need.
As for needy people, they already have libraries and an endless stream of books being donated to thrift stores. Nothing of value was lost here.
But then they shouldn't have done that in the first place. It seems like a crime to destroy so many books.
Imagine, 10 more companies come to join the AI race and decide to do the same.
That being said, I’m sure these companies did not exclusively buy books at the end of their life.
They did not destroy old, valuable books which individually were worth millions.
https://arstechnica.com/ai/2025/06/anthropic-destroyed-milli...
Every human has the right to read those books.
And now, this is obvious, but it seems to be frequently missed - an LLM is not a human, and does not have such rights.
Additionally:
> Every human has the right to read those books.
Since when?
I strongly disagree - knowledge should be free.
I don't think the author's arrangement of the words should be free to reproduce (ie, I think some degree of copyright protection is ethical) but if I want to use a tool to help me understand the knowledge in a book then I should be able to.
[1] https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....
Since in our legal system, only humans and groups of humans (the corporation is a convenient legal proxy for a group of humans that have entered into an agreement) have rights.
Property doesn't have rights. Land doesn't have rights. Books don't have rights. My computer doesn't have rights. And neither does an LLM.
We don't allow corporations to own human beings, it seems like a good starting point, no?
It does not matter that your screwdriver does not have rights: you will be using it for the purpose consistent with the principle of your freedom and encouragement to fix your cabling. You are not required to "hand-screw them drives".
In context, for example, you can take notes. That has nothing to do with the "rights of the paper".
Nothing forbids an automated reader by principle - especially when the automated reader is an intermediate tool for human operation.
If you use the commons to create your model, perhaps you should be obligated to distribute the model for free (or I guess for the cost of distribution) too.
A vacuum removes what it sucks in. The commons are still as available as they ever were, and the AI gives one more avenue of access.
That is false. As a direct consequence of LLMs:
1. The web is increasingly closed to automated scraping, and more marginally to people as well. Owners of websites like reddit now have a stronger incentive to close off their APIs and sell access.
2. The web is being inundated with unverified LLM output which poisons the well
3. More profoundly, increasingly basing our production on LLM outputs and making the human merely "in the loop" rather than the driver, and sometimes eschewing even the human in the loop, leads to new commons that are less adapted to the evolutions of our world, less original and of lesser quality
By this logic one shouldn't be able to research for a newspaper article at a library.
They'll either go out of business or make better models paid while providing only weaker models for free despite both being trained on the same data.
I presume you (people do) have exploited that knowledge that society has made in principle and largely practice freely accessible to build a professionality, which is now for-profit: you will charge parties for the skills that available knowledge has given you.
The "profit" part is not the problem.
As soo as OpenAI open sources their model's source code I'll agree.
(The "for sale" side does not limit the purpose to sales only, before somebody wants to attack that.)
Knowledge costs money to gain/research.
Are you saying people who do the most valuable work of pushing the boundaries of human knowledge should not be fairly compensated for their work?
An LLM isn't an index.
I think it is obvious instead that readers employed by humans fit the principle.
> rights
Societally, it is more of a duty. Knowledge is made available because we must harness it.
Also, at least so far, we don't call computers "someone".
Probably so, because with "library" I did not mean the "building". It is the decision of the society to make knowledge available.
> we don't call computers "someone"
We do instead, for this purpose. Why should we not. Anything that can read fits the set.
--
Edit: Come up with the arguments, sniper.
there is an asimmetry between agreement and disagreement: the latter requires arguments.
"Sneering and leaving" is antisocial, and that is underlying most of downvoting.
Stop this deficient, improductive and disruptive culture.
Why just that one purpose? Let's pay them a fair wage, deduct income tax and social security, enforce reasonable working hours and conditions etc.
577 more comments available on Hacker News