Gentoo AI Policy
Posted4 months agoActive4 months ago
wiki.gentoo.orgTechstoryHigh profile
heatedmixed
Debate
80/100
GentooAI PolicyOpen-SourceLlms
Key topics
Gentoo
AI Policy
Open-Source
Llms
Gentoo's AI policy bans AI-generated contributions, sparking debate among HN users about the implications and enforceability of such a policy.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
2h
Peak period
65
0-6h
Avg / period
14.5
Comment distribution160 data points
Loading chart...
Based on 160 loaded comments
Key moments
- 01Story posted
Sep 14, 2025 at 7:20 PM EDT
4 months ago
Step 01 - 02First comment
Sep 14, 2025 at 9:35 PM EDT
2h after posting
Step 02 - 03Peak activity
65 comments in 0-6h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 18, 2025 at 2:58 PM EDT
4 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45244295Type: storyLast synced: 11/20/2025, 7:40:50 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
People say this every month.
The policy is dated to 2024-04-14. After they approved this, there were all of these releases that were all pretty dramatic advancements for coding: 3.5 Sonnet (for taste + agentic coding), o1-preview (for reasoning), Claude Code (for developer experience), o3 (for debugging), Claude 4 Opus (for reliability), and now GPT-5 Pro (for code review).
We have advanced from AI that can unreliably help you look up documentation for tools like matplotlib, to AI tools that can write and review large complex programs in the last year alone. Sure, these tools still have a lot of deficiencies. But that doesn't negate the fact that the change in AI for coding in the last year has been dramatic.
> Commercial AI projects are frequently indulging in blatant copyright violations to train their models.
I thought we (FOSS) were anti copyright?
> Their operations are causing concerns about the huge use of energy and water.
This is massively overblown. If they'd specifically said that their concerns were around the concentrated impact of energy and water usage on specific communities, fine, but then you'd have to have ethical concerns about a lot of other tech including video streaming; but the overall energy and water usage of AI contributed to by the actual individual use of AI to, for instance, generate a PR, is completely negligible on the scale of tech products.
> The advertising and use of AI models has caused a significant harm to employees and reduction of service quality.
Is this talking about automation? You know what else automated employees and can often reduce service quality? Software.
> LLMs have been empowering all kinds of spam and scam efforts.
So did email.
> I thought we (FOSS) were anti copyright?
No free and open source software (FOSS) distribution model is "anti-copyright." Quite to the contrary, FOSS licenses are well defined[0] and either address copyright directly or rely on copyright being retained by the original author.
0 - https://opensource.org/licenses
https://www.gnu.org/philosophy/copyright-versus-community.en...
No, it's not like email, or a web server. I can run an email server or apache on my rinky dink computer and get hundreds of requests per second.
I can't run chatgpt, that requires a super computer. And of the stuff I can run, like deepseek, I'm getting very few tokens/s. Not requests! Tokens!
Yes, inference has an energy cost that is significantly more than other compute tasks.
As to actual numbers, they're not that hard to crunch, but we have a few good sources that have done so for us.
Simple first-principles estimate: https://epoch.ai/gradient-updates/how-much-energy-does-chatg...
Google report: https://arxiv.org/abs/2508.15734
Altman claim inside a blog post: https://blog.samaltman.com/the-gentle-singularity
FOSS still has to exist within the rules of the system the planet operates under. You can't just say "I downloaded that movie, but I'm a Linux user so I don't believe in copyright" and get away with it
>the overall energy and water usage of AI contributed to by the actual individual use of AI to, for instance, generate a PR, is completely negligible on the scale of tech products.
[citation needed]
>Is this talking about automation? You know what else automated employees and can often reduce service quality? Software.
Disingenuous strawman. Tech CEO's and the like have been exuberant at the idea that "AI" will replace human labor. The entire end-goal of companies like OpenAI is to create a "super-intelligence" that will then generate a return. By definition the AI would be performing labor (services) for capital, outcompeting humans to do so. Unless OpenAI wants it to just hack every bank account on Earth and transfer it all to them instead? Or something equally farcical
>So did email.
"We should improve society somewhat"
"Ah, but you participate in society! Curious!"
Sure, here ya go:
https://andymasley.substack.com/p/individual-ai-use-is-not-b...
https://blog.giovanh.com/blog/2024/08/18/is-ai-eating-all-th...
https://blog.giovanh.com/blog/2024/09/09/is-ai-eating-all-th...
The first comprehensive environmental audit and analysis performed in conjunction with French environmental agencies and audit environmental audit consultants, which includes every stage of the supply chain, including usually hidden upstream costs: https://mistral.ai/news/our-contribution-to-a-global-environ...
https://andymasley.substack.com/p/for-the-climate-little-thi...
> Disingenuous strawman. Tech CEO's and the like have been exuberant at the idea that "AI" will replace human labor. The entire end-goal of companies like OpenAI is to create a "super-intelligence" that will then generate a return. By definition the AI would be performing labor (services) for capital, outcompeting humans to do so
Isn't that literally the selling point of software, performing something that would otherwise have to be done by humans, namely both keeping calculating research, locating things, transferring information, and so on, using capital instead of labor, transforming labor into capital, and providing more profits as a result?
> Unless OpenAI wants it to just hack every bank account on Earth and transfer it all to them instead? Or something equally farcical
It's extremely funny that you pull this out of your house and say that this is the only way that I could be justified in saying what I'm saying, while accusing me of making a disingenuous straw man. Consult the rod in your own eye before you concern yourself with the speck in mind.
> >So did email.
> "We should improve society somewhat"
> "Ah, but you participate in society! Curious!"
Disengenuous strawman. That comic is used to respond to people that claim that you can't be against something if you also participate in it out of necessity. That's not what I'm doing. I would be fine with it if they blank it condemned all things that enable spam on a massive level, including email, social media, automated phone calls, mail, and so on, while still using those technologies because they have to to live in present society and get the word out. There are people who do that with rigorous intellectual consistency and have been since those things existed. My argument is by condemning one but not the other irrespective of whether they use them or not, they are being ethically inconsistent and it shows a double standard and a biased towards technologies that they're used to over technologies that they aren't. It shows a fundamental reactionary conservatism over an actually well thought through ethical position.
10 GPT prompts take the same energy as a wifi router operating for 30 minutes.
If Gentoo were so concerned for the environment, they would have more mileage forbidding PRs from people who took a 10 hour flight. These flights, per person, emit as much carbon as a million prompts.
For Free Software, copyright creates the ability to use licenses (like the GPL) to ensure source code availability.
Absolutely not! Every major FOSS license has copyright as its enforcement method -- "if you don't do X (share code with customers, etc depending on license) you lose the right to copy the code"
The code is of terrible quality and I am at 100+ comments on my latest PR.
That being said, my latest PR is my second-ever to LLVM and is an entire linter check. I am learning far more about compilers at a much faster pace than if I took the "normal route" of tiny bugfixes.
I also try to do review passes on my own code before asking for code review to show I care about quality.
LLMs increase review burden a ton but I would say it can be a fair tradeoff, because I'm learning quicker and can contribute at a level I otherwise couldn't. I feel like I will become a net-positive to the project much earlier than I otherwise would have.
edit: the PR in question. Unfortunately I've been on vacation and haven't touched it recently.
https://github.com/llvm/llvm-project/pull/146970
It's a community's decision whether to accept this tradeoff & I won't submit AI generated code if your project refuses it. I also believe that we can mitigate this tradeoff with strong social norms that a developer is responsible for understanding and explaining their AI-generated code.
If your project has a policy against AI usage I won't submit AI-generated code because I respect your decision.
That's not what the GP mean. Just because a community doesn't disallow something doesn't mean it's the right thing to do.
> I also try to mitigate the code review burden by doing as much review as possible on my end
That's great but...
> & flagging what I don't understand.
It's absurd to me that people should commit code they don't understand. That is the problem. Just because you are allowed to commit AI-generated/assisted code does not mean that you should commit code that you don't understand.
The overhead to others of committing code that you don't understand then ask someone to review is a lot higher than asking someone for directions first so you can understand the problem and code you write.
> If your project has a policy against AI usage I won't submit AI-generated code because I respect your decision.
That's just not the point.
The industrywide tsunami of tech debt arising from AI detritus[1] will be interesting to watch. Tech leadership is currently drunk on improved productivity metrics (via lines of code or number of PRs), but I bet velocity will slow down, and products be more brittle due to extraneous AI-generated, with a lag, so it won't be immediately apparent. Only teams with rigorous reviews will fare well in the long term, but may be punished in the short term for "not being as productive" as others.
1. From personal observation: when I'm in a hurry, I accept code that does more than is necessary to meet the requirements, or is merely not succinct. Where as pre-AI, less code would be merged with a "TBD" tacked on
I also have no idea what the social norms are for AI. I posted the comment after a friend on Discord said I should disclose my use of AI.
The underlying purpose of the PR is ironically because Cline and Copilot keep trying to use `int` when modern C++ coding standards suggest `size_t` (or something similar).
On top of all that every open source project has a gray hair problem.
Telling people excited about a new tech to never contribute makes sure that all projects turn into templeOS when the lead maintainer moves on.
I accept that some projects allow this, and if they invite it, I guess I can’t say anything other than “good luck,” but to me it feels like long odds that any one contributor who starts out eager to make others wade through enough code to generate that many comments purely as a one-sided learning exercise will continue to remain invested in this project to the point where I feel glad to have invested in this particular pedagogy.
No you don't. And if you're that entitled to people's time you will simply get no new contributors.
Trying to understand a problem and taking some time to work out a solution proves that you’re actually trying to learn and be helpful, even if you’re green. Using a LLM to generate a nearly-thousand-line PR and yeeting it at the maintainers with a note that says “I don’t really know what this does” feels less hopeful.
I feel like a better use of an LLM would be to use it for guidance on where to look when trying to see how pieces fit together, or maybe get some understanding of what something is doing, and then by one’s own efforts actually construct the solution. Then, even if one only has a partial implementation, it would feel much more reasonable to open a WIP PR and say “is this on the right track?”
And you can't go and turn this around into "but the gate keeping!" You just said that expecting someone to learn and be an asset to a project is entitlement, so by definition someone with this attitude won't stick around.
Lastly, the reason that the resume builder wants the "LLVM contributor" bullet point in the first place is precisely because that normally takes effort. If it becomes known in the industry that getting it simply requires throwing some AI PR over the wall - the value of this signal will quickly diminish.
While I legitimately do find templeOS to be a fascinating project, I don’t think there was anything to learn from it at a computer science level other than “oh look, an opinionated 64-bit operating environment that feels like classical computing and had a couple novel ideas”
I respect that instances like it are demonstrably few and far between, but don’t entertain its legacy far beyond that.
I disagree, actually.
I think that his approach has a lot to teach aspiring architects of impossibly large and complex systems, such as "create a suitable language for your use-case if one does not exist. It need not be a whole new language, just a variation of an existing one that smooths out all the rough edges specific to your complex software".
His approach demonstrated very large gains in an unusually complicated product. I can point to projects written in modern languages that come nowhere close to being as high-velocity as his, because his approach was fine-tuned to the use-case of "high-velocity while including only the bare necessities of safety."
It seems a little mean to tell him to stop coding forever when his intentions and efforts seem pretty positive for the health of the project.
This means that he did not put serious effort into understanding what, when and why others do in a highly structured project like LLVM. He "wrote" the code and then dumped "written" code into community to catch mistakes.
There are pitfalls everywhere. It’s not so small that you can get everything in your head with only a reading. You need to actually engage with the code via contributions to understand it. 100+ comments is not an exceptional amount for early contributions.
Anyway, LLVM is so complex I doubt you can actually vibcode anything valuable so there are probably a lot of actual work in the contribution.
There is a reason the community didn’t send them packing. Onboarding new comer is hard but it pays off.
So, after reading code, one should write down what made him amazed and find out why it is so - whether it is a custom of a project or a peculiarity of code just read.
I actually have such a list for my work. Do you?
No, it is not. Dozens of comments on a PR is an exceptional amount. Early contributions should be small so that one can learn typical customs and mistakes for self review before attempting a big code change.That PR we discuss here contains a maintainer's requirement to remove excessive commenting - PR's author definitely did not do a codebase style matching cleanup job on his code before submission.
> So, after reading code, one should write down what made him amazed and find out why it is so - whether it is a custom of a project or a peculiarity of code just read.
Sorry but that’s delusional.
The amount of people actually able to meaningfully read code, somehow identify what was so incredible it should be analysed despite being unfamiliar with the code base, maintain a list of their own likely error and self review is so vanishingly low it might as well not exist.
If that’s the bare a potential new contributor has to cross, you will get exactly none.
I’m personally glade LLVM disagree with you.
LLVM, I just checked, does not have a formal list of code conventions and/or typical errors and mistakes. Could they have that list, we would not have the pleasure to discuss that. That PR we are discussing would be much more polished and there would be much less than several dozens of comments.
You are making very strong statement, again.Be kind. Don't be snarky. Converse curiously; don't cross-examine. Edit out swipes.
When disagreeing, please reply to the argument instead of calling names. "That is idiotic; 1 + 1 is 2, not 3" can be shortened to "1 + 1 is 2, not 3."
Please don't fulminate. Please don't sneer, including at the rest of the community.
Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize. Assume good faith.
Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something.
https://news.ycombinator.com/newsguidelines.html
1. Good for you.
2. Ignore the haters in the comments.
> my latest PR is my second-ever to LLVM and is an entire linter check.
That is so awesome.
> The code is of terrible quality and I am at 100+ comments on my latest PR.
The LLVM reviewers are big kids. They know how to ignore a PR if they don't want to review it. Don't feel bad about wasting people's time. They'll let you know.
You might be surprised how many PRs even pre-LLMs had 100+ comments. There's a lot to learn. You clearly want to learn, so you'll get there and will soon be offering a net-positive contribution to this community (or the next one you join), if you aren't already.
Best of luck on your journey.
How well does that scale as the number of such contributions increases and the triage process itself becomes a sizable effort?
LLMs can inadvertently create a sort of DDoS even with the best intentions, and mitigating it costs something.
I sort of doubt that all of a sudden there's going to be tons of people wanting to make complex AI contributions to LLVM, but if there are just ban them at that point.
This is a different decision made by the LLVM project than the one made by Gentoo, which is neither right nor wrong IMHO.
> The code is of terrible quality and I am at 100+ comments on my latest PR.
This may be part of the justification of the published Gentoo policy. I am not a maintainer of same so cannot say for certain. I can say it is implied within their policy:
> LLMs increase review burden a ton ...Hence the Gentoo policy.
> ... but I would say it can be a fair tradeoff, because I'm learning quicker and can contribute at a level I otherwise couldn't.
I get it. I really do.
I would also ask - of the requested changes reviewers have made, what percentage are due to LLM generated changes? If more than zero, does this corroborate the Gentoo policy position of:
If "erroneous" or "invalid" where the adjective used instead of "meaningless"?Previous codebases I've worked on during internships linted the first two in CI. And the documentation being formatted incorrectly is because I hand-wrote it without AI.
Out of the AI-related issues that I didn't catch, the biggest flaws were redundant comments and the use of string manipulation/parsing instead of AST manipulation. Useless comments are very common and I've gotten better at pruning them. The AI's insistence on hand-rolling stuff with strings was surprising and apparently LLVM-specific.
However, there was plenty of erroneous and invalid behaviour in the original AI-generated code, such as flagging `uint32_t` because the underlying type was an `unsigned int` (which wouldn't make sense as we want to replace `unsigned int` with `uint32_t`).
I prevented most of this from reaching the PR by writing good unit tests and having a clear vision of what the final result should look like. I believe this should be a basic requirement for trying to contribute AI-generated code to an open-source project but other people might not share the same belief.
> However, there was plenty of erroneous and invalid behaviour in the original AI-generated code ...
> I prevented most of this from reaching the PR by writing good unit tests and having a clear vision of what the final result should look like.
This identifies an interesting question in my mind:
Assuming LLM code generation, my initial answer is the approach you took as the test suite would serve as an augmentation to whatever prompt(s) used. But I could also see a strong case made for using LLM code test suite generation in order to maximize functional coverage.Maybe this question would be a good candidate for an "Ask HN".
> I believe this should be a basic requirement for trying to contribute AI-generated code to an open-source project but other people might not share the same belief.
FWIW, I completely concur.
> If an LLM code generator is used, is it better to use it for generating production code and writing tests to verify or write production code and use LLM generated code to produce tests to verify?
I do both.
1. Vibe code the initial design with input on the API/architecture.
2. Use the AI to write tests.
3. Carefully scrutinize the test cases, which are much easier to review than the code.
4. Save both.
5. Go do something else and let the AI modify the code until the tests/linting/etc passes.
6. Review the final product, make edits, and create the PR.
The output of step 1 is guaranteed to be terrible/buggy and difficult to review for correctness, which is why I review the test cases instead because they provide concrete examples.
Step 5 eliminates most of the problems and frees me to review important stuff.
The whole reason I wrote the check is because AI keeps using `int` and I don't want it to.
> and especially for not mentioning that it's AI generated code
https://github.com/llvm/llvm-project/pull/146970#issuecommen...
irony really is dead
I find that latter part particularly relevant, considering the hoopla is about AI bros being lazy dogs who can't be bothered to put in the hard work before attempting to contribute. Irony being then that the person above just took an intentionally cut short citation to paint the person in a somehow even more negative light than they'd have otherwise appeared in, while simultaneously not even bothering to review the conduct they're proposing to police to confirm it actually matches their knowingly uncharitable conjecture. Two wrongs not making a right or whatever.
It should be clear that my objection is to the mix of coc + ai in the context of llvm, not to this specific instance where someone is acting within the rules llvm has written down.
I think there are a couple of good signals in what you've said but also some stuff (at least by implication/phrashing) that I would be mindful of.
The reason why I think your head is fundamentally in a good place is that you seem to be shooting for an outcome where already high effort stays high, and with the assistance of the tools your ambition can increase. That's very much my aspiration with it, and I think that's been the play for motivated hackers forever: become as capable as possible as quickly as possible by using every effort and resource. Certainly in my lifetime I've seen things like widely distributed source code in the 90s, Google a little later, StackOverflow indexed by Google, the mega-grep when I did the FAANG thing, and now the language models. They're all related (and I think less impressive/concerning to people who remember pre-SEO Google, that was up there with any LLM on "magic box with reasonable code").
But we all have to self-police on this because with any source of code we don't understand, the abstraction almost always leaks, and it's a slippery slope: you get a little tired or busy or lazy, it slips a bit, next thing you know the diff or project or system is jeopardized, and you're throwing long shots that compound.
I'm sure the reviewers can make their own call about whether you're in an ok place in terms of whether you're making a sincere effort or if you've slipped into the low-integrity zone (LLVM people are serious people), just be mindful that if you want the most out of it and to be welcome on projects and teams generally, you have to keep the gap between ability and scope in a band: pushing hard enough to need the tools and reviewers generous with their time is good, it's how you improve, but go too far and everyone loses because you stop learning and they could have prompted the bot themselves.
It’s just what every other tech bro on here wants to believe, that using LLM code is somehow less pure than using free-range-organic human written code.
No, made. Which is a very important difference.
Humans are capable of generating extremely poor code. Improperly supervised LLMs are capable of generating extremely poor code.
How is this is an LLM-specific problem?
I believe part of (or perhaps the entire) the argument here is that LLMs certainly enable more unqualified contributors to generate larger quantities of low-quality code than they would have been able to otherwise. Which... is true.
But still I'm not sure that LLMs are the problem here? Nobody should be submitting unexpected, large, hard-to-review quantities of code in the first place, LLM-aided or otherwise. It seems to me that LLMs are, at worst, exposing an existing flaw in the governance process of certain projects?
Without LLMs, people are less likely to submit such PRs. With LLMs they're more likely to do so. This is based on recent increases in such PRs pretty much all projects have seen. Current LLMs are extremely sycophantic & encourage people to think they're brilliant revolutionary thinkers coming up with the best <ideas, code, etc> ever. Combined with the marketing of LLMs as experts it's pretty easy to see why some people fall for the hype & believe they're doing valuable work when they're really just dumping slop on the reviewers.
As for humans who can’t write code, their code doesn’t tend to look like they can.
There was a time that I used Gentoo, and may again one day, but for the past N years, I’ve not had time to compile everything from source, and compiling from source is a false sense of security, since you still don’t know what’s been compromised (it could be the compiler, etc.), and few have the time or expertise to adequately review all of the code.
It can be a waste of energy and time to compile everything from source for standard hardware.
But, when I’m retired, maybe I’ll use it again just for the heck of it. And I’m glad that Gentoo exists.
The security argument for recompiling from source is addressed by the input addressed (sic) package cache. The customization aspect is mostly covered by Nix package overrides and overlays. You can also setup your own package cache.
Then use the official binary packages?
> and compiling from source is a false sense of security, since you still don’t know what’s been compromised (it could be the compiler, etc.), and few have the time or expertise to adequately review all of the code.
That would still leave you in a strictly better position, surely? Any other distro would pull the same code and build with compilers, so that attack surface exists regardless.
The end result is not necessarily a bad one, and I think reasonable for a project like Gentoo to go for, but the policy could be stated in a much better way.
For example: thou shalt only contribute code that is unencumbered by copyright issues, contributions must be of a high quality and repeated attempts to submit poor quality contributions may result in new contributions not being reviewed/accepted. As for the ethical concerns, they could just take a position by buying infrastructure from companies that align with their ethics, or not accepting corporate donations (time or money) from companies that they disagree with.
This isn't a court system, anyone intentionally trying to test the boundaries probably isn't someone you want to bother with in the first place.
I have friends and colleagues who I trust as good engineers who take different positions on this (letter vs spirit) and I think there are good faith contributions negatively impacted by both sides of this.
this is a bad faith comment.
Defining policy on the outcomes, rather than the inputs, makes it more resilient and ultimately more effective. Defining policy on the inputs is easy to dismantle.
Highly disingenuous. First, AI being trained on copyrighted data is considered fair use because it transforms the underlying data rather than distribute it as is. Though I have to agree that this is the relatively strongest ethical claim to stop using AI but stands weak if looked at on the whole.
The fact that they mentioned "energy and water use" should tell you that they are really looking for reasons to disparage AI. AI doesn't use any more water or energy than any other tool. An hour of Netflix uses same energy as more than 100 GPT questions. A single 10 hour flight (per person*) emits as much as around 100k GPT prompts. It is strange that one would repeat the same nonsense about AI without primary motive being ideological.
"The advertising and use of AI models has caused a significant harm to employees and reduction of service quality." this is just a shoddy opinion at this point.
To be clear - I understand why they might ban AI for code submissions. It reduces the barrier significantly and increases the noise. But the reasoning is motivated from a wrong place.
Also, half the problem isn’t distribution, it’s how those works were acquired. Even if you suppose models 44are transformative, you can’t just download stuff from piratebay. Buy copies, scan them, rip them, etc.
It’s super not cool that billion dollar vc companies can just do that.
"The training use was a fair use," he wrote. "The use of the books at issue to train Claude and its precursors was exceedingly transformative."
I agree it is debatable but it is not so cut and clear that it is _not_ transformative when a judge has ruled that it is.
I don't follow.
For one, all works have a copyright status I believe (under US jurisdiction; this of course differs per jurisdiction, although there are international IP laws), some are just extremely permissive. Models rely on a wide range of works, some with permissive, some with restrictive licensing. I'd imagine Wikipedia and StackOverflow are pretty important resources for these models for example, and both are licensed under CC BY-SA 4.0, a permissive license.
Second, despite your claim being thus false, dropping restrictively copyrighted works would make a dent of course I'm pretty sure, although how much, I'm not sure. I don't see why this would be a surprise: restrictively licensed works do contribute value, but not all of the value. So their removal would take away some of the value, but not all of it. It's not binary.
And finally, I'm not sure these aspects solely or even primarily determine whether these models are legally transformative. But then I'm also not a lawyer, and the law is a moving target, so what do I know. I'd imagine it's less legal transformativeness and more colloquial transformativeness you're concerned about anyhow, but then these are not necessarily the best aspects to interrogate either.
It's not a question of if feeding all the worlds books into a blender and eating the resulting slurry paste is copyright infringement. It's that they stole the books in the first place by getting them from piracy websites
If they'd purchased every book ever written, scanned them in and fed that into the model? That would be perfectly legal
https://apnews.com/article/anthropic-copyright-authors-settl...
> A federal judge dealt the case a mixed ruling in June, finding that training AI chatbots on copyrighted books wasn’t illegal but that Anthropic wrongfully acquired millions of books through pirate websites.
With more details about how they later did it legally, and that was fine, but it did not excuse the earlier piracy:
> But documents disclosed in court showed Anthropic employees’ internal concerns about the legality of their use of pirate sites. The company later shifted its approach and hired Tom Turvey, the former Google executive in charge of Google Books, a searchable library of digitized books that successfully weathered years of copyright battles.
> With his help, Anthropic began buying books in bulk, tearing off the bindings and scanning each page before feeding the digitized versions into its AI model, according to court documents. That was legal but didn’t undo the earlier piracy, according to the judge.
It's not a binary. Sometimes it fully reproduces works in violation of copyright and other times it modifies it just enough to avoid claims against it's output. Using AI and just _assuming_ it would never lead you to a copyright violation is foolish.
> uses same energy as more than 100 GPT questions.
Are you including training costs or just query costs?
> But the reasoning is motivated from a wrong place.
That does not matter. What matters is if the outcome is improved in the way they predict. This is actually measurable.
Ok lets discuss facts.
>It's not a binary. Sometimes it fully reproduces works in violation of copyright and other times it modifies it just enough to avoid claims against it's output. Using AI and just _assuming_ it would never lead you to a copyright violation is foolish.
In the Anthropic case the Judge ruled that AI training is transformative. It is not binary as you said but I'm criticising what appears as binary from the original policy. When the court ruling itself has shown that it is not violation of copyright, it is reasonable to criticise it now although I acknowledge the post was written before the ruling.
>Are you including training costs or just query costs?
The training costs are very very small because they are amortised over all the queries. I think training accounts around .001% to .1% of each query depending on how many training runs are done over a year.
[0] https://trends.builtwith.com/Server/Gentoo-Linux
Your legal argument aside, they downloaded torrents and trained their AI on them. You can't get much more blatant than that.
We should stop doing those things too. I'm still surprised that so many people are flying.
Why do you care? Their sandbox their rules, and if you care because you want to contribute you’re still free to do so. Unless you’re an LLM I guess, but the rest of us should have no problem.
The negativity just seems overblown. More power to them, and if this was a bad call they’ll revisit it.
how would they know? - this is (one of) the ways for people to let them know
What we're looking at is mostly JavaScript monkeys who feel personally offended because they're unable to differentiate criticism of their tools from criticism of their own personal character.
The outrage is purely theoretical.
How many contributors to gentoo are upset by this? Probably none.
How many potential contributors to gentoo are upset by this? Maybe dozens?
I'll be amazed if this has any notable negative outcomes for Gentoo and their contributions.
The only way it'll be revisited is if active Gentoo developers and/or contributors really start to push with a justification to get it changed and they agree to revisit discussing it again. I can tell you every maintainer has heard the line: 'I would have contributed if you did X thing'.
Only downside is my last psychiatrist dropped me as a patient when he left his practice to start an AI company providing regulatory compliance for, essentially, Dr. ChatGPT.
https://jamanetwork.com/journals/jamanetworkopen/fullarticle...
And here: https://arxiv.org/html/2503.10486v1
The other one isn't peer reviewed. Your précis doesn't appear to be warranted.
> The LLM alone scored 16 percentage points (95% CI, 2-30 percentage points; P = .03) higher than the conventional resources group.
Basically they setup the experiment as a control group and a LLM-assisted group. There was no difference between the two groups and that is what was reported in the top level finding that you quote.
Then they went back and said “wait, what if we just blindly trusted the LLM? What if we had a third group that had no doctor involved — just let the LLM do the diagnosis?” This retroactively synthesized group did significantly better than either of the actual experimental groups:
> The LLM alone scored 16 percentage points (95% CI, 2-30 percentage points; P = .03) higher than the conventional resources group … The LLM alone demonstrated higher performance than both physician groups, indicating the need for technology and workforce development to realize the potential of physician-artificial intelligence collaboration in clinical practice.
AI is not a new tool - transformer-based LLMs are. Which is what this post is about.
The latter are very known to be a LOT LESS accurate, and still are very prone to hallucinate. This is just a fact. For your health I hope no one of your medical team is using the current generation for anything else than casual questions.
I'm not an opponent, and I don't think straight up banning LLM-generated code commits is the right thing, but I can understand their stance.
Some people in this thread are already interpreting that policies that allow contributions of AI-generated code means it's OK to not understand the code they write and can offload that work to the reviewers.
If you have ever had to review code that an author doesn't understand or written code that you don't understand for others to review, you should know how bad it is even without an LLM.
> Why do you care? Their sandbox their rules...
* What if it's a piece of software or dependency that I use and support? That affects me.
* What if I have to work with these people in these community? That affects me.
* What if I happen to have to mentor new software engineers who were conditioned to think that bad practices are OK? That affects me.
Things are usually less sandboxed than you think.
Exactly this. It's their decision to make; their consequences as well.
Then again I would have bet $1000 that gentoo disappeared 15 years ago. Probably around 2009? I legitimately havent even heard about them since at least that long.
So rejecting contributions from who might even still be around seems like a bad decision.
By the way, I'm in no way against these kinds of policy: I've seen what happened to curl, and I think it's fully in their rights to outright ban any usage of LLMs. I'm just concerned about the enforceability of these policies.
One of the parties that decided on Gentoo's policy effectively said the same thing. If I get what you're really asking... the reality is, there's no way for them to know if a LLM tool was used internally, it's honor system. But I mean enforcement is just ban the contributor if they become a problem. They've banned or otherwise restricted other ones for being disruptive or spamming low quality contributions in the past.
It's worded the way it is because most of the parties understand this isn't going away and might get revisited eventually. At least one of them hardline opposes LLM contributions in any form and probably won't change their mind.
To add a bit more context, when I was writing the original comment, I was mainly thinking of first-time contributors that don't have any track records, and how the policy would work against them.
They aren't government and it's not that bureaucratic. As with any group, if you break the guidelines/rules they just won't want to work with you.
> To add a bit more context, when I was writing the original comment, I was mainly thinking of first-time contributors that don't have any track records, and how the policy would work against them.
No matter what, somebody has to review the contribution. First time contributors get feedback, most of them correct their mistakes and some go on to be regular contributors (like me). Others never respond, and still more others make the same mistakes over and over again.
On the topic, Gentoo has projects like GURU where users can contribute new packages that maybe aren't ready for main tree or where a full developer wouldn't be interested, it's a good place to learn if interested in working towards becoming a developer: https://wiki.gentoo.org/wiki/Project:GURU
[generated by ChatGPT] Source: https://news.ycombinator.com/item?id=45217858
To me the point is that I want to see effort from a person asking me to review their PR. If it's obvious LLM generated bullshit, I outright ignore it. If they put in the time and effort to mold the LLM output so that it's high quality and they actually understand what they're putting in the PR (meaning they probably replace 99% of the output), then good, that's the point
If it turns out to be incorrectly called out, well that sucks, but I submit that patches have been refused before LLMs came to be.
There is nothing inherently different about these policies that make them more or less difficult to enforce than other kinds of polices.
Several projects have rejected "AI" policies using your argument even though those projects themselves have contributor agreements or similar.
This inconsistency makes it likely that the cheating argument, when only used for "AI" contributions, is a pretext and these projects are forced to use or promote "AI" for a number of reasons.
Of course, if someone has used LLM during development as a helper tool and done the necessary work of properly reviewing and fixing the generated code, then it can be borderline impossible to detect, but such PRs are much less problematic.
My perspective is that this criticism is only valid for “single-shot in spirit” / “prompt and forget” LLM powered contributions.
I’d be curious how much energy gentoo consumes versus a binary distro.
Unfortunately one caveat would be it will be difficult to separate the maintainers from the financial incentives, so it won’t be a fair comparison. (e.g. the labs funding full time maintainers with salaries and donations that other distros can only dream of)
28 more comments available on Hacker News