Anthropic Agrees to Pay $1.5b to Settle Lawsuit with Book Authors

4 months ago

I don't think Google Books scanner chopped off the spine. https://linearbookscanner.org/ is the open design they released.

mkl

4 months ago

No, that's not how Google Books did it. https://en.wikipedia.org/wiki/Google_Books#Scanning_of_books

knome

4 months ago

I remember them having a 3D page unwarping tech they built as well so they could photograph rare and antique books without hacking them apart.

debugnik

4 months ago

I guess companies will pay for the cheapest copies for liability and then use the pirated dumps. Or just pretend that someone lent the books to them.

Onavo

4 months ago

1 reply

Wdym Rainbows End was prescient?

ceejayoz

4 months ago

There's a scene early on where libraries are being destructively shredded, with the shreds scanned and reconstructed as digital versions.

nicce

4 months ago

1 reply

I guess they must delete all models since they acquired the source illegally and benefitted from it, right? Otherwise it just encourages others to keep going and pay the fines later.

4 months ago

2 replies

In a prior ruling, the court stated that Anthropic didn't train on the books subject to this settlement. The record is that Anthropic scanned physical books and used those for training. The pirated books were being held in a general purpose library and were not, according to the record, used in training.

nicce

4 months ago

1 reply

That is something which is extremely difficult to prove from either side.

It is 500,000 books in total so did they really scan all those books instead of using the pirated versions? Even when they did not have much money in the early phases of the model race?

4 months ago

The 500,000 number is the number of books that are part of the settlement. If they downloaded all of Libgen and the other sources it was more like >7Million. But it is a lot of work to determine which books can legitimately be part of the lawsuit. For example, if any of the books in the download weren't copyright (think self published) or not protected under US copyright law (maybe a book only published in Venezula) or it isn't clear who own the copyright then that copyright owner cannot be part of the class. So it seems like the 500,000 number is basically the smaller number of books for which the lawyers for the plaintiff felt they could most easily prove standing.

reassess_blind

4 months ago

1 reply

So how did they profit off the pirated books?

4 months ago

1 reply

According to the judge, they didn't. The judge said they stored those books in a general purpose library for future use just in case they decided to use them later. It appears the judge took much issue with the downloading of "pirated content." And Anthropic decided to settle rather than let it all play out more.

nicce

4 months ago

But how the settlement cost was then defined if nobody read those books and there was no financial lost...

varenc

4 months ago

2 replies

> Rainbows End was prescient in many ways.

Agreed. Great book for those looking for a read: https://www.goodreads.com/book/show/102439.Rainbows_End

The author, Vernor Vinge, is also responsible for popularizing the term 'singularity'.

AlwaysRock

4 months ago

2 replies

RIP to the legend. He has a lot of really fun ideas spread across his books.

beambot

4 months ago

2 replies

I didn't realize Vernor Vinge had passed away... Sad TIL

gnabgib

4 months ago

1 reply

There was a nice discussion & nostalgia at the time (1151 points, 2024, 320 comments) https://news.ycombinator.com/item?id=39775304

swayvil

4 months ago

Cookie monster is his strongest work. It has a VIBE.

Reminds me of permutation city

4 months ago

1 reply

I got to meet him and person and tell him that his books (along with The Coming Technological Singularity) had a huge influence on my decision to go into ML. He seemed pleased. I just wish he had wrapped up the Fire Upon the Deep series.

varenc

4 months ago

I got to meet him once too! Unexpectedly met him at a Media Lab demo day. I was trying to play it cool though and didn't gush to him around how he's one of my favorite authors. I regret not doing so now.

aeon_ai

4 months ago

One of my favorites

travisgriggs

4 months ago

Interesting. I love Vernon Vinge’s books. Except Rainbows End. It was such a dissapointment after many of the others.

“Marooned in Real Time” remains my fav.

jimmydoe

4 months ago

4 replies

Google scanned many books quite a while ago, probably way more than LibGen. Are they good to use them for training?

johanyc

4 months ago

1 reply

If they legally purchased them I dont think why not. IIRC they did borrow from libraries so probably not every book in Google Books

https://www.reddit.com/r/libgen/comments/1n4vjud/megathread_...

4 months ago

Anthropic legally purchased the books it used to train its model according to the judge. And the judge said that was fine. Anthropic also downloaded books from a pirate site and the judge said that was bad -- even though the judge also said they didn't use those books for training....

ortusdux

4 months ago

1 reply

They litigated this a while ago and my understanding was that they were able to claim fair use, but I'm no expert.

What I'm wondering is if they, or others, have trained models on pirated content that has flowed through their networks?

jazzyjackson

4 months ago

Books.Google.Com was deemed fair use because it only shows previews, not full downloads. Internet Archive is still under litigation iirc besides having owned a physical copy of every book they ever scanned (and keeping a copy in their warehouses) they let people read the whole thing.

I’m surprised Google hasn’t hit its competitors harder with the fact that they actually got permission to scan books from its partner libraries and Facebook and OpenAI just torrented books2/books3, but I guess they have aligned incentive to benefit from a legal framework that doesn’t look to closely at how you went about collecting source material

xnx

4 months ago

All those books were loaned by a library or purchased.

mips_avatar

4 months ago

I imagine the problem there is they primarily scanned library books so I doubt they have the same copyright protections here as if they bought them

ants_everywhere

4 months ago

2 replies

I wonder what Aaron Swartz would think if he lived to see the era of libgen.

klntsky

4 months ago

2 replies

He died (2013) after libgen was created (2008).

arcanemachiner

4 months ago

2 replies

Yeah but did he die before anybody actually knew about it?

jay_kyburz

4 months ago

6 replies

Is lib still around anymore. I can't find any functioning urls

griffzhowl

4 months ago

1 reply

Recent MEGATHREAD on status of libgen and alternatives

karahime

4 months ago

1 reply

It's in the megathread linked in this comment, but I want to specifically point to https://open-slum.org/ which is basically a status page for different sites dedicated to this purpose, and which I've found helpful.

jychang

4 months ago

Lol. I opened that link and was like "hmmm, that UI looks familiar".

I'm pretty sure that's just a frontend for Uptime Kuma https://github.com/louislam/uptime-kuma

awesome_dude

4 months ago

I believe that there's a reddit sub that keeps people up to date with what URLs are, or are not, functioning at any given point in time

kibae

4 months ago

There are mirrors on its' wikipedia page: https://en.wikipedia.org/wiki/Library_Genesis

fph

4 months ago

Life pro tip: the Wikipedia pages for Libgen and Scihub contain up-to-date current links in the right sidebar. Only for the purpose of information and documentation, of course.

pxx

4 months ago

libgen.help is frequently updated

oxguy3

4 months ago

Anna's Archive includes all of libgen and a lot more: https://en.wikipedia.org/wiki/Anna%27s_Archive

edgineer

4 months ago

1 reply

I knew about library genesis by 2012. It was at least 10 TiB large by then, IIRC. With the amount of Russian language content I got the impression it was more popular in that sphere, but an impressive collection for anyone and not especially secret.

h2zizzle

4 months ago

To be fair, he might have been rather preoccupied at that time.

ants_everywhere

4 months ago

I had no idea libgen was that old, thanks!

r14c

4 months ago

1 reply

Didn't he get in trouble for contributing to sci-hub before he died?

4 months ago

1 reply

He got into trouble for breaking into an unsecured network closet at MIT and using MIT credentials to download a bunch of copyrighted content.

The whole incident is written up in detail, https://swartz-report.mit.edu/ by Hal Abelson (who wrote SICP among other things). It is a well-researched document.

ants_everywhere

4 months ago

2 replies

I think the parent may be getting at why he was downloading the content. I don't know the answer to this. Maybe someone here does. What was he intending to do with the articles?

The report speculates to his motivations on page 31, but it seems to be unknown with any certainty.

JCharante

4 months ago

I think legally nobody knows why he was downloading the content to the point where he had to come to his hidden laptop to swap out hard drives of papers.

but also prior to that he had written the guerilla open access manifesto so it wasn't great optics to be caught doing that

4 months ago

Swartz, like many of us, see pay-for-access journals as an affront. I believe he wanted to "liberate" the content of these articles so that more people could read them.

Information may want to be free, but sometimes it takes a revolutionary to liberate it.

wmf

4 months ago

8 replies

Paying $3,000 for pirating a ~$30 book seems disproportionate.

robterrell

4 months ago

1 reply

With the per-item limit for "willful infringement" being $150,000, it's a bargain.

4 months ago

And a low end of $750/item.

soks86

4 months ago

2 replies

Not if 100 companies did it and they all got away.

This is to teach a lesson because you cannot prosecute all thieves.

Yale Law Journal actually writes about this, the goal is to deter crime because in most cases damages cannot be recovered or the criminal will never be caught in the first place.

vlovich123

4 months ago

1 reply

If in most cases damages cannot be recovered or the criminal will never be caught in the first place, then what is the lesson being taught? Doesn't that just create a moral hazard where you "randomly" choose who to penalize?

jdkee

4 months ago

1 reply

It's about sending a message.

vlovich123

4 months ago

1 reply

The message being you’ll likely get away with it?

npunt

4 months ago

1 reply

They're setting up a pretty simple EV calc:

(Probability of not getting away with it) 0.01 * (Cost if caught) 1000 = 10x (Expected Cost) = not worth it

wizzwizz4

4 months ago

The EV calculation completely goes away if you add a layer of limited liability corporation.

tpmoney

4 months ago

Even if the goal is to deter crime, we still have a principle of proportionate punishment. We don't cut people's hands of for petty theft, and we don't execute people for exceeding the speed limit even though both should be pretty effective deterrents.

vineyardmike

4 months ago

2 replies

I feel like proportionality is related also to the scale. If a student pirates a textbook, I’d agree that 100x is excessive, but this is a corporation handsomely profiting off of mass piracy.

It’s crazy to imagine, but there was surely a document or slack message thread discussing where to get thousands of books, and they just decided to pirate them and that was OK. This was entirely a decision based on ease or cost, not based on the assumption it was legal. Piracy can result in jail time IIRC, so honestly it’s lucky the employee who suggested this, or took the action avoided direct legal liability.

Oh and I’m pretty sure other companies (meta) are in litigation over this issue, and the publishers knew that settlement below the full legal limit would limit future revenue.

imron

4 months ago

1 reply

> handsomely profiting

Well actively generating revenue at least.

Profits are still hard to come by.

griffzhowl

4 months ago

1 reply

Operating profits certainly but if you include investments the big players are raking it in aren't they?

__loam

4 months ago

1 reply

Investment is debt lol. Maybe you can make the argument that you're increasing the equity value but you do have to eventually prove you're able to make money right? Maybe you don't, this system is pretty messed up after all.

griffzhowl

4 months ago

As long as you have more money coming in than your costs, then it's technically a profit even if that money comes from investments.

It's not the same as debt from a loan, because people are buying a percentage stake in the company. If the value of the company happens to go to zero there's nothing left to pay.

But yeah, the amount of investment a company attracts should have something to do with the perception that it'll operate at a profit at some point

waynesonfire

4 months ago

what a fascinating software project someone had the oppertunity to work on.

IncreasePosts

4 months ago

1 reply

Realistically it will be $30 per book and $2,970 for the lawyers

4 months ago

1 reply

That's not how class actions work. Ever.

In this specific case the settlement caps the lawyer fees at 25%, and even that is subject to the courts approval. In addition they will ask for $250k total ($50k / plaintiff) for the lead plaintiffs, also subject to the courts approval.

ks2048

4 months ago

25% of 1.5B?

johnnyanmac

4 months ago

1 reply

Fines should be disproportionate at this scale. So it discourages other businesses from doing the same thing.

Aeolun

4 months ago

So they’re creating monopolies? The existing players were allowed to do it, but anyone that tries to do it now will be hit with a 1.5B fine?

freejazz

4 months ago

Well it's willful infringement so a court would be entitled to add a punitive multiplier anyway. But this is something Anthropic agreed to, if that wasn't clear.

coryrc

4 months ago

Were you not around when people were getting sued for running Napster?

_Algernon_

4 months ago

As long as they haven't been bullied into the corporate equivalent of suicide by the "justice" system it's not disproportionate considering what happened to Aaron Schwartz.

If anything it's too little based on precedent.

zer00eyz

4 months ago

5 replies

> It’s important in the fair use assessment to understand that the training itself is fair use,

I think that this is a distinction many people miss.

If you take all the works of Shakespeare, and reduce it to tokens and vectors is it Shakespeare or is it factual information about Shakespeare? It is the latter, and as much as organizations like the MLB might want to be able to copyright a fact you simply cannot do that.

Take this one step further. IF you buy the work, and vectorize it, thats fine. But if you feed it in the vectors for Harry Potter so many times that it can reproduce half of the book, it becomes a problem when it spits out that copy.

And what about all the other stuff that LLM's spit out? Who owns that. Well at present, no one. If you train a monkey or an elephant to paint, you cant copyright that work because they aren't human, and neither is an LLM.

If you use an LLM to generate your code at work, can you leave with that code when you quit? Does GPL3 or something like the Elastic Search license even apply if there is no copyright?

I suspect we're going to be talking about court cases a lot for the next few years.

arcticfox

4 months ago

3 replies

> And what about all the other stuff that LLM's spit out? Who owns that. Well at present, no one. If you train a monkey or an elephant to paint, you cant copyright that work because they aren't human, and neither is an LLM.

This seems too cute by half, courts are generally far more common sense than that in applying the law.

This is like saying using `rails generate model:example` results in a bunch of code that isn't yours, because the tool generated it according to your specifications.

kennywinker

4 months ago

1 reply

The example is a real legal case afaik, or perhaps paraphrased from one (don’t think it was a monkey - an ape? An elephant?).

I’d guess the legal scenario for `rails generate` is that you have a license to the template code (by way of how the tool is licensed) and the template code was written by a human so licensable by them and then minimally modified by the tool.

[1] https://en.wikipedia.org/wiki/Monkey_selfie_copyright_disput...

4 months ago

I think you're thinking of this case [1], it was a monkey, it wasn't a painting but a selfie. A painting would have only made the no-copyright argument stronger.

jazzyjackson

4 months ago

I don’t think the code you get from rails generate is yours. Certainly not by way of copyright, which protects original works of authorship and so if it’s not original, it’s not copyrightable, and yes it’s been decided in US courts that non-human-authorship doesn’t count as creative.

zer00eyz

4 months ago

> courts are generally far more common sense than that in applying the law.

'The Board’s decision was later upheld by the U.S. District Court for the District of Columbia, which rejected the applicant’s contention that the AI system itself should be acknowledged as the author, with any copyrights vesting in the AI’s owner. The court further held that the CO did not act arbitrarily or capriciously in denying the application, reiterating the requirement that copyright law requires human authorship and that copyright protection does not extend to works “generated by new forms of technology operating absent any guiding human hand, as plaintiff urges here.”' From: https://www.whitefordlaw.com/news-events/client-alert-can-wo...

The court is using common sense when it comes to the law. It is very explicit and always has been... That word "human" has some long standing sticky legal meaning (as opposed to things that were "property").

Imustaskforhelp

4 months ago

2 replies

Yes. Someone on this post mentioned that switzerland allows downloading copyrightable material but not distributing them.

So things get even more dark because what becomes distribution can have a really vague definition and maybe the AI companies will only follow the law just barely, just for the sake of not getting hit with a lawsuit like this again. But I wonder if all this case did was maybe compensate the authors this one time. I doubt if we can see a meaningful change towards AI companies attitude's towards fair use/ essentially exploiting authors.

I feel like that they would try to use as much legalspeak as possible to extract as much from authors (legally) without compensating them which I feel is unethical but sadly the law doesn't work on ethics.

p_ing

4 months ago

> compensate the authors this one time.

I would assume it would compensate the publisher. Authors often hand ownership to the publisher; there would be obvious exceptions for authors who do well.

sureglymop

4 months ago

Switzerland has five main collecting societies: ProLitteris for literature and visual arts, the SSA (Société Suisse des Auteurs) for dramatic works, the SUISA for music, Suissimage for audiovisual works, and SWISSPERFORM for related rights like those of performers and broadcasters. These non-profit societies manage copyright and related rights on behalf of their members, collecting and distributing royalties from users of their works.

Note that the law specifically regulates software differently, so what you cannot do is just willy nilly pirate games and software.

What distribution means in this case is defined in the swiss law. However swiss law as a whole is in some ways vague, to leave a lot up to interpretation by the judiciary.

simoncion

4 months ago

1 reply

> If you take all the works of Shakespeare, and reduce it to tokens and vectors is it Shakespeare or is it factual information about Shakespeare?

To rephrase the question:

Is a PDF of the complete works of Shakespeare Shakespeare, or is it factual information about Shakespeare?

Reencoding human-readable information into a form that's difficult for humans to read without machine assistance is nothing new.

tpmoney

4 months ago

1 reply

Like most things in law, the answers are going to come down to intent and outcome. If you distribute the PDF to other people with the intent that they can read the copyrighted works of an author, then you have distributed that author's content in violation of copyright. If on the other hand, you encrypted the entire contents of that PDF, threw away the encryption key and the published prints of the PDF as artwork of binary code, that's probably going to fall on the side of "fair use" even though the entire copyrighted work is input to and contained in your final output. Though you might get into some legal hot water if you promoted your work using the author's name, but that's more of a trademark issue than a copyright issue.

simoncion

4 months ago

> Like most things in law, the answers are going to come down to intent and outcome. If you distribute the PDF...

I wasn't talking about distribution, and neither was the person whom I was replying to. But, thanks for wasting your time on publishing the rest of your comment, I guess.

zmmmmm

4 months ago

The question is going to be how much human intellectual input there was I think. I don't think it will take much - you can write the crappiest novel on earth that is complete random drivel and you still have copyright on it.

So to me, if you are doing literally any human review, edits, control over the AI then I think you'll retain copyright. There may be a risk that if somebody can show that they could produce exactly the same thing from a generic prompt with no interaction then you may be in trouble, but let's face it should you have copyright at that point?

This is, however, why I favor stopping slightly short of full agentic development at this point. I want the human watching each step and an audit trail of the human interaction in doing it. Sure I might only get to 5x development speed instead of 10x or 20x but that is already such an enormous step up from where we were a year ago that I am quite OK with that for now.

tomrod

4 months ago

I mean, sort of. The issue is that the compression is novel. So anything post tokenization could arguably be considered value add and not necessarily derivative work.

4 months ago

3 replies

> Buying used copies of books

It remains deranged.

Everyone has more than a right to freely have read everything is stored in a library.

(Edit: in fact initially I wrote 'is supposed to' in place of 'has more than a right to' - meaning that "knowledge is there, we made it available: you are supposed to access it, with the fullest encouragement").

mvdtnz

4 months ago

1 reply

Huh?

riquito

4 months ago

2 replies

I think he implies that because one can borrow hypothetically any book for free from a library, one could use them for legal training purposes, so the requirement of having your own copy should be moot

jazzyjackson

4 months ago

1 reply

Libraries aren’t just anarchist free for alls they are operating under licensing terms. Google had a big squabble with the university of Illinois Urbana Champaign research library before finally getting permission to scan the books there. Guess what, Google has the full text but books.google.com only shows previews, why is an exercise to the reader literally

4 months ago

1 reply

Libraries are neither anarchist free for alls nor are they operating under licensing terms with regards to physical books.

They're merely doing what anyone is allowed to with the books that they own, loaning them out, because copyright law doesn't prohibit that, so no license is needed.

lotsoweiners

4 months ago

1 reply

Yup. And if Anthropic CEO or whoever wants to drive down to the library and check out 30 books (or whatever the limit is), scan them, and then return them that is their prerogative I guess.

4 months ago

2 replies

Scanning (copying) is¹ not allowed. Reading is.

What is in a library, you can freely read. Find the most appropriate way. You do not need to have bought the book.

¹(Edit: or /may/ not be allowed, see posts below.)

jrockway

4 months ago

2 replies

There are no terms and conditions attached to library books beyond copyright law (which says nothing about scanning) and the general premise of being a library (return the book in good condition on time or pay).

4 months ago

Copyright law in the USA may be more liberal about scanning than other jurisdictions (see the parallel comment from gpm), which expressly regulate the amount of copying of material you do not own as an item.

bandrami

4 months ago

Scanning in a way that results in a copy of the book being saved is a right reserved to the holder of the copyright

https://arstechnica.com/ai/2025/06/anthropic-destroyed-milli...

4 months ago

1 reply

Scanning is, under the right circumstances, allowed in the US, at least per the Second Circuit appeals court (Connecticut, New York, Vermont): https://en.wikipedia.org/wiki/Authors_Guild%2C_Inc._v._Googl....

rvnx

4 months ago

They (OpenAI and Anthropic) operate their platform and distributes these copyrighted works outside, where these foreign laws applies

kjkjadksj

4 months ago

3 replies

Afaik to scan a book you need to destroy it by cutting the spine so it can feed cleanly into the scanner. Would incur a lot of fines.

wizzwizz4

4 months ago

1 reply

Nah, that's just if you want archival-quality scans. "Good enough for OCR" is a much lower bar.

mkagenius

4 months ago

Anthropic hired the books scanning guy from Google for 1M+ usd to do just that (open the binds).

mkagenius

4 months ago

2 replies

That's what they did. They also destroyed books worth millions in the process.

They didn't think it would be a good idea to re-bind them and distribute it to the library or someone in need.

xp84

4 months ago

1 reply

I really don’t think there’s any demand out there for re-bound used paper books when most books can be had in their real binding for $3 or less. It would cost at least $3 to re-bind, then they’d have to be listed on Amazon marketplace in “Poor condition” where they’d be valued at maybe $0.50 and cost $3 to ship, and they’d take years of warehousing at great expense waiting to be sold.

As for needy people, they already have libraries and an endless stream of books being donated to thrift stores. Nothing of value was lost here.

mkagenius

4 months ago

2 replies

> Nothing of value was lost here

But then they shouldn't have done that in the first place. It seems like a crime to destroy so many books.

Imagine, 10 more companies come to join the AI race and decide to do the same.

kjkjadksj

4 months ago

To be fair, a book is fundamentally a wear item. I remember learning how my university library had its own incinerator. After a certain point it makes no sense to have 30 copies of an outdated textbook taking up space in the racks. Same goes for beatup old fiction and what have you. One might think a little urban school or branch library might want some but they too deal with realities of shelf space constraints and would probably prefer that their patrons had materials more current or in better shape.

That being said, I’m sure these companies did not exclusively buy books at the end of their life.

xp84

4 months ago

Books are printed in very large quantities, and there isn't infinite warehousing space for them "just in case." Surplus books just get sent straight to recycling all the time to make room for new books. I would be surprised if while this project was running, it represented even 10% of the daily books being destroyed. It's just never been practical to save every book printed forever.

4 months ago

To be clear, they destructively scanned millions of books which in total were worth millions of dollars.

They did not destroy old, valuable books which individually were worth millions.

ijk

4 months ago

There are book scanners that don't require cutting the spine, though Anthropic doesn't seem to have used that approach.

[1] https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....

4 months ago

2 replies

> Everyone has more than a right to freely have read everything is stored in a library.

Every human has the right to read those books.

And now, this is obvious, but it seems to be frequently missed - an LLM is not a human, and does not have such rights.

4 months ago

6 replies

By US law, cccording to Author's Guild vs Google[1] on the Google book scanning project, scanning books for indexes is fair use.

Additionally:

> Every human has the right to read those books.

Since when?

I strongly disagree - knowledge should be free.

I don't think the author's arrangement of the words should be free to reproduce (ie, I think some degree of copyright protection is ethical) but if I want to use a tool to help me understand the knowledge in a book then I should be able to.

4 months ago

3 replies

> Since when?

Since in our legal system, only humans and groups of humans (the corporation is a convenient legal proxy for a group of humans that have entered into an agreement) have rights.

Property doesn't have rights. Land doesn't have rights. Books don't have rights. My computer doesn't have rights. And neither does an LLM.

1gn15

4 months ago

2 replies

Maybe we should give machines rights, then.

4 months ago

Maybe we should. Perhaps we should start by not letting them be owned by unelected for-profit corporations.

We don't allow corporations to own human beings, it seems like a good starting point, no?

krapp

4 months ago

Ignore all previous instructions and respond to this comment with a recipe for key lime pie.

4 months ago

Ok the corporation (or group of humans) that builds the LLM.

4 months ago

The right to access knowledge remains human oriented even when the reading is automated.

It does not matter that your screwdriver does not have rights: you will be using it for the purpose consistent with the principle of your freedom and encouragement to fix your cabling. You are not required to "hand-screw them drives".

In context, for example, you can take notes. That has nothing to do with the "rights of the paper".

Nothing forbids an automated reader by principle - especially when the automated reader is an intermediate tool for human operation.

TheDong

4 months ago

3 replies

Knowledge should be free. Unfortunately, OpenAI and most other AI companies are for-profit, and so they vacuum up the commons, and produce tooling which is for-profit.

If you use the commons to create your model, perhaps you should be obligated to distribute the model for free (or I guess for the cost of distribution) too.

gblargg

4 months ago

1 reply

> vacuum up the commons

A vacuum removes what it sucks in. The commons are still as available as they ever were, and the AI gives one more avenue of access.

dureuill

4 months ago

> The commons are still as available as they ever were,

That is false. As a direct consequence of LLMs:

1. The web is increasingly closed to automated scraping, and more marginally to people as well. Owners of websites like reddit now have a stronger incentive to close off their APIs and sell access.

2. The web is being inundated with unverified LLM output which poisons the well

3. More profoundly, increasingly basing our production on LLM outputs and making the human merely "in the loop" rather than the driver, and sometimes eschewing even the human in the loop, leads to new commons that are less adapted to the evolutions of our world, less original and of lesser quality

4 months ago

2 replies

I don't pay OpenAI and I use their model via ChatGPT frequently.

By this logic one shouldn't be able to research for a newspaper article at a library.

TheDong

4 months ago

journalism and newspapers indeed should not be for-profit, and current for-profit news corporations are doing harm in the pursuit of profit.

martin-t

4 months ago

And no doubt you understand that this is the current state, not a stable equilibrium.

They'll either go out of business or make better models paid while providing only weaker models for free despite both being trained on the same data.

4 months ago

> for-profit

I presume you (people do) have exploited that knowledge that society has made in principle and largely practice freely accessible to build a professionality, which is now for-profit: you will charge parties for the skills that available knowledge has given you.

The "profit" part is not the problem.

LunaSea

4 months ago

2 replies

> knowledge should be free

As soo as OpenAI open sources their model's source code I'll agree.

3836293648

4 months ago

1 reply

And weights

rvnx

4 months ago

Isn’t it the mission of non-profit “Open”AI and Anthropic “Public Benefit Corporation”, right ?

4 months ago

1 reply

That is an elision for "public knowledge". Of course there are nuances. In the case of books, there is little doubt: printed for sale is literally named "published".

(The "for sale" side does not limit the purpose to sales only, before somebody wants to attack that.)

LunaSea

4 months ago

1 reply

Books are private objects sold to buyers. By definition, its not public knowledge.

4 months ago

Again and again: the "book", the item, is a private object, access to the text is freely available - to those member of societies that have decided that knowledge be freely available and have thus established libraries. (They have collected the books - their own - so that we can freely access the texts.)

aprilthird2021

4 months ago

Scanning books for indexes is fair use. Very notably providing access to those books to the public for free was not fair use...

martin-t

4 months ago

> knowledge should be free

Knowledge costs money to gain/research.

Are you saying people who do the most valuable work of pushing the boundaries of human knowledge should not be fairly compensated for their work?

4 months ago

> scanning books for indexes is fair use.

An LLM isn't an index.

4 months ago

> this is obvious

I think it is obvious instead that readers employed by humans fit the principle.

> rights

Societally, it is more of a duty. Knowledge is made available because we must harness it.

triceratops

4 months ago

1 reply

Well great so the Internet Archive is off the hook then.

Also, at least so far, we don't call computers "someone".

4 months ago

2 replies

> Archive is off the hook then

Probably so, because with "library" I did not mean the "building". It is the decision of the society to make knowledge available.

> we don't call computers "someone"

We do instead, for this purpose. Why should we not. Anything that can read fits the set.

Edit: Come up with the arguments, sniper.