I Ported Justhtml From Python to Javascript with Codex CLI and GPT-5.2 in Hours

https://www.google.com/search?q=cucumber+testing+framework

17 days ago

10 replies

I think the most interesting thing about this is how it demonstrates that a very particular kind of project is now massively more feasible: library porting projects that can be executed against implementation-independent tests.

The big unlock here is https://github.com/html5lib/html5lib-tests - a collection of 9,000+ HTML5 parser tests that are their own independent file format, e.g. this one: https://github.com/html5lib/html5lib-tests/blob/master/tree-...

The Servo html5ever Rust codebase uses them. Emil's JustHTML Python library used them too. Now my JavaScript version gets to tap into the same collection.

This meant that I could set a coding agent loose to crunch away on porting that Python code to JavaScript and have it keep going until that enormous existing test suite passed.

Sadly conformance test suites like html5lib-tests aren't that common... but they do exist elsewhere. I think it would be interesting to collect as many of those as possible.

gwking

17 days ago

3 replies

I’ve idly wondered about this sort of thing quite a bit. The next step would seem to be taking a project’s implementation dependent tests, converting them to an independent format and verifying them against the original project, then conducting the port.

skissane

17 days ago

1 reply

Give coding agent some software. Ask it to write tests that maximise code coverage (source coverage if you have source code; if not, binary coverage). Consider using concolic fuzzing. Then give another agent the generated test suite, and ask it to write an implementation that passes. Automated software cloning. I wonder what results you might get?

gaigalas

17 days ago

1 reply

> Ask it to write tests that maximise code coverage

That is significantly harder to do than writing an implementation from tests, especially for codebases that previously didn't have any testing infrastructure.

skissane

17 days ago

2 replies

Give a coding agent a code base with no tests, and tell it to write some, it will - if you don’t tell it which framework to use, it will just pick one. No denying you’ll get much better results if an experienced developer provides it with some prompting on how to test than if you just let it decide for itself.

joshstrange

16 days ago

2 replies

This is a hilariously naive take.

If you’ve actually tried this, and actually read the results, you’d know this does not work well. It might write a few decent tests but get ready for an impressive number of tests and cases but no real coverage.

I did this literally 2 days ago and it churned for a while and spit out hundreds of tests! Great news right? Well, no, they did stupid things like “Create an instance of the class (new MyClass), now make sure it’s the right class type”. It also created multiple tests that created maps then asserted the values existed and matched… matched the maps it created in the test… without ever touching the underlying code it was supposed to be testing.

I’ve tested this on new codebases, old codebases, and vibe coded codebases, the results vary slightly and you absolutely can use LLMs to help with writing tests, no doubt, but “Just throw an agent at it” does not work.

lsaferite

15 days ago

This highlights something that I wish was more prevalent, Path Coverage. I'm not sure of what testing suites handle path coverage, but I know XDebug for PHP could manage it back when I was doing PHP work. Simple line coverage doesn't tell you enough of the story while path coverage should let you be sure you've tested all code paths of a unit. Mix that with input fuzzing and you should be able to develop comprehensive unit tests for critical units in your codebase. Yes, I'm aware that's just one part of a large puzzle.

skissane

12 days ago

But, did you actually give the agent access to a tool to measure code coverage?

If it can't measure whether it is succeeding in increasing code coverage, no wonder it doesn't do that great a job in increasing it.

Also, it can help if you have a pair of agents (which could even be just two different instances of the same agent with different prompting) – one to write tests, and one to review them. The test-writing agent writes tests, and submits them as a PR; the PR-reviewing agent read the PR and provides feedback; the test-writing agent updates the tests in response to the feedback; iterate until the PR-reviewing agent is satisfied. This can produce much better tests than just an agent writing tests without any automated review process.

gaigalas

17 days ago

Have you tried? Beyond the first tests, going all the way up to decent coverage.

pbowyerAuthor

17 days ago

3 replies

I think I've asked this before on HN but is there a language-independent test format? There are multiple libraries (think date/time manipulation for a good example) where the tests should be the same across all languages, but every library has developed its own test suite.

Having a standard test input/output format would let test definitions be shared between libraries.

sfjailbird

17 days ago

Like Cucumber?

sciurus

17 days ago

https://testanything.org/ ?

k__

17 days ago

Maybe tape?

cr125rider

17 days ago

I’ve got to imagine a suite of end to end tests (probably most common is fixture file in, assert against output fixture file) would be very hard to nail all of the possible branches and paths. Like the example here, thousands of well made tests are required.

heavyset_go

17 days ago

1 reply

This is one of the reasons I'm keeping tests to myself for a current project. Usually I release libraries as open source, but I've been rethinking that, as well.

17 days ago

1 reply

Oddly enough my conclusion is the opposite: I should invest more of my open source development work in creating language-independent test suites, because they can be used to quickly create all sorts of useful follow-on projects.

heavyset_go

17 days ago

1 reply

I'm not that generous lol

cortesoft

17 days ago

3 replies

Isn't the point that you might be one of the people who benefits from one of those follow on projects? That is kind of the whole point of open source.

Why are you making your stuff open source in the first place if you don't want other people to build off of it?

bgwalter

17 days ago

2 replies

Open source has three main purposes, in decreasing order of importance:

1) Ensuring that there is no malicious code and enabling you to build it yourself.

2) Making modifications for yourself (Stallman's printer is the famous example).

3) Using other people's code in your own projects.

Item 3) is wildly over-propagandized as the sole reason for open source. Hard forks have traditionally led to massive flame wars.

We are now being told by corporations and their "AI" shills that we should diligently publish everything for free so the IP thieves can profit more easily. There is no reason to oblige them. Hiding test suites in order to make translations more difficult is a great first step.

inejge

17 days ago

> Hard forks have traditionally led to massive flame wars.

Provided that the project is popular and has a community, especially a contributor community (the two don't have to go together.) Most projects aren't that prominent.

visarga

17 days ago

I think the only non-slop parts of the web are: open source, wikipedia, arXiv and social network comments in well behaved/moderated communities. What do they share in common? They all allow building on top, they are social first, people come together for interaction and collaboration.

The rest is enshittified web, focused on attention grabbing, retention dark patterns and misinformation. They all exist to make a profit off our backs.

nicoburns

17 days ago

If you don't trust the AI generated code yourself, then you wont benefit from it. And in fact all it does is take resources from the project that you work on, the one that's generating all the value in the first place.

There are strong parallels to the image generation models that generate images in the style of studio ghibli films. Does that benefit studio ghibli? I'd argue not. And if we're not careful, it will to undermine the business model that produced the artwork in the first place (which the AI is not currently capable of doing).

heavyset_go

17 days ago

> Why are you making your stuff open source in the first place if you don't want other people to build off of it?

Because I enjoy the craft. I will enjoy it less if know I'm being ripped off, likely for profit, hence my deliberate choices of licenses, what gets released and what gets siloed.

I'm happy if someone builds off of my work, as long as it's on my own terms.

aadishv

17 days ago

2 replies

I wonder if this makes AI models particularly well-suited to ML tasks, or at least ML implementation tasks, where you are given a target architecture and dataset and have to implement and train the given architecture on the given dataset. There are strong signals to the model, such as loss, which are essentially a slightly less restricted version of "tests".

17 days ago

1 reply

I'm certain this is the case. Iterating on ML models can actually be pretty tedious - lots of different parameters to try out, then you have to wait a bunch, then exercise the models, then change parameters and try again.

Coding agents are fantastic at these kinds of loops.

aadishv

3d ago

A rudimentary form of self-improving intelligence :D

montroser

17 days ago

1 reply

We've been doing this at work a bunch with great success. The most impressive moment to me was when the model we were training did a type of overfitting, and rather than just claiming victory (as it all too often) this time Claude went and just added a bunch more robust, human-grade examples to our training data and hold out set, and kept iterating until the model effectively learned the actual crux of what we were trying to teach it.

aadishv

3d ago

That's genuinely impressive. Excited to see how the rapid progress will make it more and more autonomous in the future

avsm

17 days ago

2 replies

The html5lib conformance tests when combined with the WHATWG specs are even more powerful! I managed to build a typed version of this in OCaml in a few hours ( https://anil.recoil.org/notes/aoah-2025-15 ) yesterday, but I also left an agent building a pure OCaml HTML5 _validator_ last night.

This run has (just in the last hour) combined the html5lib expect tests with https://github.com/validator/validator/tree/main/tests (which are a complex mix of Java RELAX NG stylesheets and code) in order to build a low-dependency pure OCaml HTML5 validator with types and modules.

This feels like formal verification in reverse: we're starting from a scattered set of facts (the expect tests) and iterating towards more structured specifications, using functional languages like OCaml/Haskell as convenient executable pitstops while driving towards proof reconstruction in something like Lean.

leafmeal

16 days ago

This totally makes me thing of Martin Kleppmann's recent blog post about how AI will make verified software much easier to use in practice! https://martin.kleppmann.com/2025/12/08/ai-formal-verificati...

yuppiemephisto

15 days ago

I’m doing similar with porting shellcheck Haskell -> Lean

Havoc

17 days ago

1 reply

Was struggling yesterday with porting something (python->rust). LLM couldn't figure out what was wrong with rust one no matter how I came at it (even gave it wireshark traces). And being vibecoded I had no idea either. Eventually copied in python source into rust project asked it to compare...immediate success

Turns out they're quite good at that sort of pattern matching cross languages. Makes sense from a latent space perspective I guess

wuschel

14 days ago

Could you elaborate a bit on your example? What do mean by "that sort of pattern matching" and the argument of "latent space perspective?"

Thanks!

pplonski86

17 days ago

This is amazing. Porting library from one language to one language are easy for LLMs, LLMs are tired-less and aware of coding syntax very well. What I like in machine learning benchmarks is that agents develop and test many solutions, and this search process is very human-alike. Yesterday, I was looking into MLE-Bench for benchamrking coding Agents on machine learning tasks from Kaggle https://github.com/openai/mle-bench There are many projects that provide agents which performance is simply incredible, they can solve several Kaggle competitions under 24 hours and be on medal place. I think this is already above human level. I was reading ML-Master article and they describe AI4AI where AI is used to create AI systems: https://arxiv.org/abs/2506.16499

tracnar

17 days ago

If you're porting a library, you can use the original implementation as an 'oracle' for your tests. Which means you only need a way to write/generate inputs, then verify the output matches the original implementation.

It doesn't work for everything of course but it's a nice way to bug-for-bug compatible rewrites.

bzmrgonz

17 days ago

I see it as a learning or training tool for AI. The same way we use mock exams/tests, to verify our skill and knowledge absorption ans prepare for the real thing or career. This could one of many obstacles in an obstacle course which a coding AI would have to navigate in order to "graduate"

exclipy

17 days ago

Can you port tsc to go in a few hours?

cies

17 days ago

This is an interesting case. It may be good to feed it to other model and see how they do.

Also: it may be interesting to port it to other languages too and see how they do.

JS and Py are but runtime-typed and very well "spoken" by LLMs. Other languages may require a lot more "work" (data types, etc.) to get the port done.

cxr

17 days ago

4 replies

[delayed]

https://github.com/mozilla-firefox/firefox/tree/main/parser/...

17 days ago

2 replies

Whoa... it looks like the Firefox HTML5 parser is still maintained as Java to this day!

Here's the relevant folder:

  make translate        # perform the Java-to-C++ translation from the remote
                        # sources

And active commits to that javasrc folder - the last was in November: https://github.com/mozilla-firefox/firefox/commits/main/pars...

17 days ago

3 replies

I just blogged about this https://simonwillison.net/2025/Dec/17/firefox-parser/

... and then when I checked the henri-sivonen tag https://simonwillison.net/tags/henri-sivonen/ found out I'd previously written about the exact same thing 16 years earlier!

17 days ago

It's very nice to have written for so long... I often think I should write more for myself than for others.

johanyc

16 days ago

The power of blogging

cxr

17 days ago

[delayed]

cxr

17 days ago

[delayed]

QuantumNomad_

17 days ago

1 reply

IANAL. In my opinion, porting code to a different language is still derivative work of the code you are porting it from. Whether done by hand or with an LLM. And in my opinion, the license of the original code still applies. Which means that not only should one link to the repo for the code that was ported, but also make sure to adhere to the terms to the license.

The MIT family of licenses state that the copyright notice and terms shall be included in all copies of the software.

Porting code to a different language is in my opinion not much different from forking a project and making changes to it, small or big.

I therefore think the right thing to do is to keep the original copyright notice and license file, and adding your additional copyright line to it.

So for example if the original project had an MIT license file that said

Permission is hereby granted and so on

You should keep all of that and add your copyright year and author name on the next line after the original line or lines of the authors of the repo you took the code from.

17 days ago

1 reply

I added Emil to my license file: https://github.com/simonw/justjshtml/blob/main/LICENSE

I'm not certain I should add the html5ever copyright holders, since I don't have a strong understanding of how much of their IP ended up in Emil's work - see https://news.ycombinator.com/item?id=46264195#46267059

17 days ago

My feeling is that my code depends more on the html5lib-tests work than on html5ever. While inspired by, I think the macro-based Rust code is different enough from the source so that its new work. I’m guessing we’ll never know.

17 days ago

There are certainly dozens of better ways to do what I did here.

I picked JustHTML as a base because I really liked the API Emil had designed, and I also thought it would be darkly amusing to take his painstakingly (1,000+ commits, 2 months+ of work) constructed library and see if I could port it directly to Python in an evening, taking advantage of everything he had already figured out.

fergie

17 days ago

Surely for debugging and auditing it's always better to write libs in JavaScript? Also, given that much of TypeScripts utilty is for improving the developer experience- is it still as relevant for machine-generated code?

f311a

17 days ago

1 reply

From original repository:

     Verified Compliance: Passes all 9k+ tests in the official html5lib-tests suite (used by browser vendors).

Yes, browsers do you use it. But they handle a lot of stuff differently.

    selectolax  68%  No  Very Fast  CSS selectors C-based (Lexbor). Very fast but less compliant.

The original author compares selectolax to html5lib-tests, but the reality is that when you compare selectolax to Chrome output, you get 90%+.

One of the tests:

  INPUT: <svg><foreignObject></foreignObject><title></svg>foo

It fails for selectolax:

  Expected:
  | <html>
  |   <head>
  |   <body>
  |     <svg svg>
  |       <svg foreignObject>
  |       <svg title>
  |     "foo"
  Actual:
  | <html>
  |   <head>
  |   <body>
  |     <svg>
  |       <foreignObject>
  |       <title>
  |     "foo"

But you get this in Chrome and selectolax:

    <html><head></head><body><svg><foreignObject></foreignObject><title></title></svg>foo
    </body></html>

17 days ago

This is a namespacing test. The reason the tag is <svg title> is that the parser is handling the title tag as the svg version of it. SVG has other handling rules, so unless the parser knows that it won't work right. I would be interesting to run the tests against Chrome as well!

You are also looking at the test format of the tag, when serialized to HTML the svg prefixes will disappear.

minimaxir

17 days ago

2 replies

My opinion on the ending open questions:

> Does this library represent a legal violation of copyright of either the Rust library or the Python one? Even if this is legal, is it ethical to build a library in this way?

Currently, I am experimenting with two projects in Claude Code: a Rust/Python port of a Python repo which necessitates a full rewrite to get the desired performance/feature improvements, and a Rust/Python port of a JavaScript repo mostly because I refuse to install Node (the speed improvement is nice though).

In both of those cases, the source repos are permissively licensed (MIT), which I interpret as the developer intent as to how their code should used. It is in the spirit of open source to produce better code by iterating on existing code, as that's how the software ecosystem grows. That would be the case whether a human wrote the porting code or not. If Claude 4.5 Opus can produce better/faster code which has the same functionality and passes all the tests, that's a win for the ecosystem.

As courtesy and transparency, I will still link and reference the original project in addition to disclosing the Agent use, although those things aren't likely required and others may not do the same. That said, I'm definitely not using an agent to port any GPL-licensed code.

17 days ago

1 reply

That's about where I'm settled on this right now. I feel like authors who select the GPL have made a robust statement about their intent. It may be legal for me to copyright-launder their library (maybe using the trick where one LLM turns their code into a spec and another turns that spec into fresh code) but I wouldn't do that because it would subvert the spirit of the license.

17 days ago

1 reply

Would it be a problem if you maintained the GPL license and released your code as open source?

17 days ago

1 reply

Good point, that might actually be fine (especially if you kept copyright for the original authors too.)

ZeroGravitas

17 days ago

1 reply

Can a human even put GPL on bot written code since it relies on copyright to protect it? Is that like museums adding copyright to scans of public domain paintings in their holdings? Which was fought about in courts for years.

fluidcruft

15 days ago

Probably a human could put a copyright on a prompt (that would be the "source" and the LLM would be a compiler or interpreter) and the generated code would be derivative of the prompt and any inputs.

It would probably get into whether the prompt itself is considered copyrightable. There is some threshold for that since I have heard some patches are considered insignificant and uncopyrightable.

throwup238

17 days ago

1 reply

> As courtesy and transparency, I will still link and reference the original project in addition to disclosing the Agent use, although those things aren't likely required and others may not do the same. That said, I'm definitely not using an agent to port any GPL-licensed code.

IANAL but regardless of the license, you have to respect their copyright and it’s hard to argue that an LLM ported library is anything but a derivative work. You would still have to include the original copyright notices and retain the license (again IANAL).

minimaxir

17 days ago

2 replies

A similar argument could be made about generative AI and whether text/image outputs themselves are derivative works, which is a legal point of contention still being argued. It's unclear if code text from a generative AI is in scope.

fluidcruft

15 days ago

This sort of translation is probably well trodden the status of something like "Translate Jules Verne's 'Vingt Mille Lieues sous Les Mers' to English" has plenty of predicates.

In terms of images, this seems more like a translation. "Translate this photo into the style of George Seurat". Whether George Seurat would have a copyright claim is not as clear but it seems pretty intuitive that the result is a derivative of the photo.

throwup238

17 days ago

That’s a legal point of contention because the nature of language/image models is hard to fit into the existing copyright framework. That only really applies to cleanroom-ish one shot requests where the inference input doesn’t contain the copyrighted material in question.

It’s a lot easier to argue that it’s a derivative work when you feed the copyrighted code directly into the context and ask it to port it to another language. If the copyrighted code is literally an input to the inference request, that would not escape any judge’s notice.

17 days ago

1 reply

> Can I even assert copyright over this, given how much of the work was produced by the LLM?

No, because it's a derivative work of the base library.

17 days ago

2 replies

That doesn't sound right to me. If it's a derivative work I can still assert copyright over the modifications I have made, but not over the original material.

17 days ago

1 reply

You're right that derivative works are copyrightable. I got that wrong.

I think you can claim the prompt itself. But you didn't create the new code. I'd argue copyright belongs to the original author.

17 days ago

1 reply

Something I'm particularly interested in understanding is where the tipping point here is. At what point is a prompt or the input that accompanies a prompt enough for the result to be copyrightable?

This project is the absolute extreme: I handed over exactly 8 prompts, and several of those were just a few words. I count the files on disk as part of the prompts, but those were authored by other people.

The US copyright office say "the resulting work is copyrightable only if it contains sufficient human-authored expressive elements" - https://perkinscoie.com/insights/update/copyright-office-sol... - but what does that actually mean?

Emil's JustHTML project involved several months of work and 1,000+ commits - almost all of the code was written by agents but there was an enormous amount of what I"d consider "human-authored expressive elements" guiding that work.

Many of my smaller AI-assisted projects use prompts like this one:

> Fetch https://observablehq.com/@simonw/openai-clip-in-a-browser and analyze it, then build a tool called is-it-a-bird.html which accepts a photo (selected or drag dropped or pasted) and instantly loads and runs CLIP and reports back on similarity to the word “bird” - pick a threshold and show a green background if the photo is likely a bird

Result: https://tools.simonwillison.net/is-it-a-bird

It was a short prompt, but the Observable notebook it references was authored by me several years ago. The agent also looked at a bunch of other files in my tools repo as part of figuring out what to build.

I think that counts as a great deal of "human-authored expressive elements" by me.

So yeah, this whole thing is really complicated!

17 days ago

1 reply

This is, of course, forgetting the fact that the model was trained on heaps and heaps of copyrighted work.

Laying claim to anything generated is very likely to fail.

17 days ago

1 reply

If it turns out you can't copyright code that was generated with the help of LLMs a whole bunch of billion+ dollar companies are going to have to throw away 18+ months of their work.

brailsafe

17 days ago

1 reply

[delayed]

16 days ago

Indeed, the risk would be you try to sue another company for copyright infringement, and in discovery it comes out you generated that code.

leprechaun1066

17 days ago

1 reply

In this case the majority of the work was done by another company on your instruction. When you signed up was there anything in the terms that said you get ownership over the output?

https://help.openai.com/en/articles/5008634-will-openai-clai...

17 days ago

All of the notable generative AI companies have policies that the won't claim copyright over your outputs.

They also frequently offer "liability shields" where their legal teams will go to bat for you if you get sued for copyright infringement based on your usage of their terms.

https://www.anthropic.com/news/expanded-legal-protections-ap...

https://ai.google.dev/gemini-api/terms#use-generated

StarterPro

17 days ago

1 reply

YOU didn't port shit, the ai did all the work.

17 days ago

1 reply

That's kind of the whole point of this exercise and my write-up of it.

kjgkjhfkjf

17 days ago

2 replies

I'm glad you wrote it up. Thanks! But I feel like the folks behind the HTML5 spec and the comprehensive test suite deserve the lion's share of the credit for this (very neat) achievement.

Most projects don't have a detailed spec at the outset. Decades of experience have shown that trying to build a detailed spec upfront does not work out well for a vast class of projects. And many projects don't even have a comprehensive test suite when they go into production!

17 days ago

1 reply

I completely agree. I hope I gave them enough credit in the blog post and the GitHub repo.

kjgkjhfkjf

17 days ago

Yep, and I think it is a great way to draw attention to their work!

visarga

17 days ago

Having a comprehensive spec and test suite is an absolute requirement, without it all you got is vibe-testing, LGTM feels. As shown by the OP, you can throw away the code and regenerate it back from tests and specs.

mirthturtle

17 days ago

1 reply

Wild to ask, "Is it legal, ethical, responsible or even harmful to build in this way and publish it?" AFTER building and publishing it. Author made up his mind already, or doesn't actually care. Ethics and responsibility should guide one's actions, not just be engagement fodder after the fact.

[0] https://ammil.industries/the-port-i-couldnt-ship/

17 days ago

If I thought this was clear-cut 100% unethical and irresponsible I wouldn't have done it. I think there's ample room for conversation about this. I'd like to help instigate that conversation.

I'm ready to take a risk to my own reputation in order to demonstrate that this kind of thing is possible. I think it's useful to help people understand that this kind of thing isn't just feasible now, it's somewhat terrifyingly easy.

ethanpil

17 days ago

3 replies

Using a random calculator I found this amounts to

  It took two initial prompts and a few tiny follow-ups. GPT-5.2 running in Codex CLI ran uninterrupted for several hours, burned through 1,464,295 input tokens, 97,122,176 cached input tokens and 625,563 output tokens and ended up producing 9,000 lines of fully tested JavaScript across 43 commits.

Source: https://www.llm-prices.com/#it=1464295&cit=97123000&ot=62556...

almostgotcaught

17 days ago

1 reply

> I am now confident that within 5-10 years (most/all?) junior & mid and many senior dev positions are going to drop out enormously.

yes because this is what we do all day every day (port existing libraries from one language to another)....

like do y'all hear yourselves or what?

hatefulheart

17 days ago

1 reply

I’m afraid the boosters hear nothing.

The commenter you’re replying to, in their heart of hearts, truly believes in 5 years that an LLM will be writing the majority of the code for a project like say Postgres or Linux.

Worth bearing in mind the boosters said this 5 years ago, and will say this in 5 years time.

igouy

16 days ago

1 reply

I would guess that the vast majority are not writing code for a project like Postgres of Linus.

> (most/all?) junior & mid and many senior dev positions

hatefulheart

16 days ago

What purpose does this statement serve?

Everyone working in programming is writing code for a project more like Postgres or Linux than they are a project like making a wood cabinet or a life drawing.

elcritch

17 days ago

This is for porting an existing project. It’s an ideal case for LLMs. The results are still pretty different for building up a library from scratch.

However this changes the economics for languages with smaller ecosystems!

afro88

17 days ago

People say this kind of thing a lot, but in reality the concept of "software engineer" will change and there will still be experience levels with different expectations

cjlm

17 days ago

1 reply

Not all AI-assisted ports are quite so successful[0]

zamadatix

17 days ago

I think a big factor (of many probably) is there is a ~150x difference in bytes of source vs number of tests for them. I.e. I wonder what other projects are easy wins, which are hard ones, and which can be accomplished quickly with a certain approach.

It'd be really interesting if Simon gave a crack at the above and wrote about his findings in doing so. Or at least, I'd find it interesting :).

bgwalter

17 days ago

1 reply

I think the decision of SQLite to keep its large test suite private is very wise in the presence of thieves.

17 days ago

Talking about "thieves" is very much going back to the idea that software is the same thing as physical things. When talking about software we have a very simple concept to guide us: the license.

The license of html5ever is MIT, meaning the original authors are OK that people do whatever they want with it. I've retained that license and given them acknowledgement (not required by the license) in the README. Simon has done the same, kept the license and given acknowledgement (not required) to me.

We're all good to go.

aster0id

17 days ago

1 reply

> Code is so cheap it’s practically free. Code that works continues to carry a cost, but that cost has plummeted now that coding agents can check their work as they go.

I personally think that even before LLMs, the cost of code wasn't necessarily the cost of typing out the characters in the right order, but having a human actually understand it to the extent that changes can be made. This continues to be true for the most part. You can vibe code your way into a lot of working code, but you'll inevitably hit a hairy bug or a real world context dependency that the LLM just cannot solve, and that is when you need a human to actually understand everything inside out and step in to fix the problem.

monkpit

17 days ago

4 replies

I wonder if we will trend towards a world where maintainability is just a waste of time and money, when you can just knock together a new flimsy thing quicker and cheaper than maintaining one thing over multiple iterations.

doganugurlu

17 days ago

1 reply

Without maintainability, adding a new type of input or feature will break existing features.

Doesn’t matter how quick it is to write from scratch, if you want varying inputs handled by the same piece of code, you need maintainability.

In a way, software development is all about adding new constraints to a system and making sure the old constraints are still satisfied.

ranger_danger

8 days ago

Don't adequate tests make this much less of an issue?

skydhash

17 days ago

I don’t think that will ever be true. Let’s take a shell session as an example of ad-hoc code: People are still writing programs and scripts. Stuff doesn’t really change that often to warrant starting from scratch. Easier to add a new format to a music player than writing a new player from scratch.

hombre_fatal

17 days ago

This kind of thinking is provocative today but inevitable tomorrow. I think you’re a lot closer to where we’ll be in 5 years than most estimations I see on HN regarding AI.

The only thing I’d tweak is that it won’t be flimsy. If you can one-shot well-tested production grade software with a prompt, what else would that world look like, and what would be the equivalent in applications beyond software?

killingtime74

17 days ago

I don't think most business processes can afford to have that many issues with their code. Customers and contracts will be lost. Reputations will be lost

mNovak

17 days ago

1 reply

While this example is explicitly asking for a port (thus a copy), I also find in general that LLM's default behavior is to spit out new code from their vast pre-trained encyclopedia, vs adding an import to some library that already serves that purpose.

I'm curious if this will implicitly drive a shift in the usage of packages / libraries broadly, and if others think this is a good or bad thing. Maybe it cuts down the surface of upstream supply-chain attacks?

17 days ago

1 reply

As a corollary, it might also increase the surface of upstream supply-chain attacks (patched or not)

The package import thing seems like a red herring

17 days ago

1 reply

It's going to be fun if someone finds a security vulnerability in a commonly-emitted-by-LLMs code pattern. That'll be a lot harder to remediate than "Update dependency xyz"

16 days ago

1 reply

> if someone finds a security vulnerability in a commonly-emitted-by-LLMs code pattern

how do you distinguish this from injecting a vulnerable dependency to a dependency list?

16 days ago

1 reply

You can more easily check for known-vulnerable dependencies

16 days ago

1 reply

Right, but if you can embed bad packages in LLMs, you can surely embed any kind of vulnerability imaginable.

16 days ago

1 reply

I'm not thinking about deliberately embedded vulnerabilities, just accidental/emergent ones. The modern equivalent of devs copy-pasting stackoverflow answers that happen to contain SQL injection vulns.

16 days ago

1 reply

Does the distinction make any difference?

16 days ago

Yes, you'd take different actions to avoid each.

orange_puff

17 days ago

1 reply

This seems really impressive. I am too lazy to replicate this, but I do wonder how important the test suite is for a a port that likely uses straight forward, dependency free python code https://github.com/EmilStenstrom/justhtml/tree/main/src/just...

It is enormously useful for the author to know that the code works, but my intuition is if you asked an agent to port files slowly, forming its own plan, making commits every feature, it would still get reasonably close, if not there.

Basically, I am guessing that this impressive output could have been achieved based on how good models are these days with large amounts of input tokens, without running the code against tests.

17 days ago

1 reply

I think the reason this was an evening project for Simon is based on both the code and the tests and conjunction. Removing one of them would at least 10x the effort is my guess.

17 days ago

1 reply

The biggest value I got from JustHTML here was the API design.

I think that represents the bulk of the human work that went into JustHTML - it's really nice, and lifting that directly is the thing that let me build my library almost hands-off and end up with a good result.

Without that I would have had to think a whole lot more about what I was doing here!

orange_puff

16 days ago

1 reply

Do you mind elaborating? By API design, do you mean how they structured their classes, methods, etc. or something else?

16 days ago

I mean the design of the user-facing API: https://github.com/EmilStenstrom/justhtml/blob/main/docs/api...

See also the demo app I vibe-coded against their library here: https://tools.simonwillison.net/justhtml - that's what initially convinced me that the API design was good.

I particularly liked the design of JustHTML's core DOM node: https://github.com/EmilStenstrom/justhtml/blob/main/docs/api... - and the design of the streaming API: https://github.com/EmilStenstrom/justhtml/blob/main/docs/api...

xarope

17 days ago

1 reply

"If you can reduce a problem to a robust test suite you can set a coding agent loop loose on it with a high degree of confidence that it will eventually succeed"

I'm a bit sad about this; I'd rather have "had fun" doing the coding, and get AI to create the test cases, than vice versa.

17 days ago

The other way around works as well! ”Get me to 100% test coverage using only integration tests” is a fun prompt!

febed

17 days ago

2 replies

What was your prompt to get it to run the test suite and heal tests at every step? I didn’t see that mentioned in your write up. Also, any specific reason you went with Codex over Claude Code?

17 days ago

For me (original author of JustHTML), it was enough the put the instructions on how to run tests in the AGENTS.md. It knows enough about coding to run tests by itself.

https://martinalderson.com/posts/has-the-cost-of-software-ju...

17 days ago

All of the prompts I used are in the article. The two most relevant to testing were:

  We are going to create a JavaScript port of ~/dev/justhtml - an HTML parsing library that passes the full ~/dev/html5lib-tests test suite. [...]

And later:

  Configure GitHub Actions test.yml to run that on every commit, then commit and push

Good coding models don't need much of a push to get heavily into automated testing.

I used Codex for a few reasons:

1. Claude was down on Sunday when I kicked off tbis project

2. Claude Code is my daily driver and I didn't want to burn through my token allowance on an experiment

3. I wanted to see how well the new GPT-5.2 could handle a long running project

seinecle

17 days ago

1 reply

Remarkable that it echoes, from a different angle, this post from just a few days ago on HN:

This last post was largely dismissed in the comments here on HN. Simon's experiment brings new ground for the argument.

akie

17 days ago

2 replies

The reason is that the post you link to is overly simplistic. The only reason why Simon's experiment works is because there is a pre-existing language agnostic testing framework of 9000 tests that the agent can hold itself accountable to. Additionally, there is a pre-existing API design that it can reuse/reappropriate.

These two preconditions don't generally apply to software projects. Most of the time there are vague, underspecified, frequently changing requirements, no test suite, and no API design.

If all projects came with 9000 pre-existing tests and fleshed-out API, then sure, the article you linked to could be correct. But that's not really the case.

baq

17 days ago

1 reply

> pre-existing language agnostic testing framework of 9000 tests

if there exists a language specific test harness, you can ask the LLMs to port it before porting the project itself.

if there doesn't, you can ask the LLM to build one first, for the original project, according to specs.

if there are no specs, you can ask the LLM to write the specs according to the available docs.

if there are no docs, you can ask the LLM to write them.

if all the above sounds ridiculous, I agree. it's also effective - go try it.

philipwhiuk

17 days ago

> if it doesn't, you can ask the LLM to build one first, for the original project, according to specs.

And you have no idea if that is necessary and sufficient at this point.

You are building on sand.

jillesvangurp

17 days ago

If you start with some working software, you could make an LLM generate a lot of tests for the existing functionality and ensure they pass against the existing software and have excellent test coverage. Generating tests and specifications from existing software is relatively easy. It's very tedious to do manually but LLMs excel at that type of job.

Once you have that, you port over the tests to a new language and generate an implementation that passes all those tests. You might want to do some reviews of the tests but it's a good approach. It will likely result in bug for bug compatible software.

Where it gets interesting is figuring out what to do with all the bugs you might find along the way.

leroman

17 days ago

1 reply

The biggest challenge an agent will face with tasks like these is the diminishing quality in relation to the size of the input, specifically I find input of above say 10k tokens dramatically reduced quality of generated output.

This specific case worked well, I suspect, since LLMs have a LOT of previous knowledge with HTML, and saw multiple impl and parsing of HTML in the training.

Thus I suspect that in real world attempts of similar projects and any non well domain will fail miserably.

adastra22

17 days ago

1 reply

In my experience it is closer to 25k, but that’s a minor point. What task do you need to do that requires more than that many tokens?

No, seriously. If you break your task into bite sized chunks, do you really need more than that at a time? I rarely do.

leroman

17 days ago

1 reply

What model are you working with where you still get good results at 25k?

To your q, I make huge effort in making my prompts as small as possible (to get the best quality output), I go as far as removing imports from source files, writing interfaces and types to use in context instead of fat impl code, write task specific project / feature documentation.. (I automate some of these with a library I use to generate prompts from code and other files - think templating language with extra flags). And still for some tasks my prompt size reaches 10k tokens, where I find the output quality not good enough

adastra22

17 days ago

1 reply

I'm working with Anthropic models, and my combined system prompt is already 22k. It's a big project, lots of skill and agent definitions. Seems to work just fine until it reaches 60k - 70k tokens.

leroman

17 days ago

Interesting, thanks!

RobertoG

17 days ago

1 reply

[delayed]

bambax

17 days ago

But the SQLite test suite is proprietary (and it seems nobody ever tried to buy it).

jackfranklyn

16 days ago

1 reply

The oracle approach mentioned downthread is what makes this practical even without conformance test suites. Run the original, capture input/output pairs, use those as your tests. Property-based testing tools like Hypothesis can generate thousands of edge cases automatically.

For solo devs this changes the calculus entirely. Supporting multiple languages used to mean maintaining multiple codebases - now you can treat the original as canonical and regenerate ports as needed. The test suite becomes the actual artifact you maintain.

solvedd

16 days ago

I wonder if I could actually build an app entirely from a set of working acceptance tests...

deanc

16 days ago

1 reply

I’m sorry but how on earth were you able to get this level of usage out of a 20 dollar per month plan? Am I totally missing something?

https://developers.openai.com/codex/pricing#what-are-the-usa...

16 days ago

I was surprised by that too.

ChatGPT Plus with Codex CLI provides "45-225 local messages per 5 hour period".

The https://chatgpt.com/codex/settings/usage is pretty useless right now - it shows that I used "100%" on December 14th - the day I ran this experiment - which presumably matches that Codex stopped working at 6:30pm but then started again when the 5 hour window reset at 7:14pm.

Running this command:

  npx @ccusage/codex@latest

Reports these numbers for December 14th:

  │ Date         │ Models                                  │       Input │     Output │  Reasoning │   Cache Read │  Total Tokens │  Cost (USD) │
  │ Dec 14, 2025 │ - gpt-5.2                               │   2,988,774 │  1,271,970 │    908,526 │  194,963,328 │   199,224,072 │      $57.16 │

p0w3n3d

16 days ago

1 reply

   burned through 1,464,295 input tokens, 97,122,176 cached input tokens and 625,563 output tokens

How much did it cost?

dimava

16 days ago

$30 in API pricing > I was running this against my $20/month ChatGPT Plus account

vessenes

17 days ago

Couple quick points from the read - cool, btw! It's not trivial that Simon poked the LLM to get something up and running and working ASAP - that's always been a good engineering behavior in my opinion - building on a working core - but I have found it's extra helpful/needed when it comes to LLM coding - this brings the compiler and tests "in the loop" for the LLM, and helps keep it on the rails - otherwise you may find you get 1,000s of lines of code that don't work or are just sort of a goose chase, or all gilding of lilies.

As is mentioned in the comments, I think the real story here is two fold - one, we're getting longer uninterrupted productive work out of frontier models - yay - and a formal test suite has just gotten vastly more useful in the last few months. I'd love to see more of these made.