I Failed to Recreate the 1996 Space Jam Website with Claude
Key topics
The quest to recreate the nostalgic 1996 Space Jam website using AI model Claude has sparked a lively debate about the limitations of modern AI. The original poster, thecr0w, was stumped by Claude's inability to accurately replicate the site's table-based layout, prompting commenters to chime in with insights on the challenges of working with early web design techniques and the constraints of AI models. Some suggested that Claude's struggles stemmed from its lack of training data on outdated specs and its weakness in processing visual information, with one commenter noting that the model's decomposition of images into semantic vector spaces destroys pixel-level information. As the discussion unfolded, a consensus emerged that understanding the inner workings of AI models is crucial for effective "prompt engineering," and thecr0w's experiment has shed light on the complexities of working with these tools.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
36m
Peak period
117
0-6h
Avg / period
22.9
Based on 160 loaded comments
Key moments
- 01Story posted
Dec 7, 2025 at 12:18 PM EST
26 days ago
Step 01 - 02First comment
Dec 7, 2025 at 12:54 PM EST
36m after posting
Step 02 - 03Peak activity
117 comments in 0-6h
Hottest window of the conversation
Step 03 - 04Latest activity
Dec 9, 2025 at 4:27 PM EST
24 days ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
In 1996, We had only css1. Ask it to use tables to do this, perhaps.
Nonetheless, here is a link to a list of the specs you asked for: https://www.w3.org/Style/History/Overview.en.html
Websites that started before 2000 tend to stick around and comparably quite well archived.
What has model training to do with SEO, that's outright detrimental to it for the most part.
This is the spec it can guarantee you that's in every large language model thats actually large.
https://www.rfc-editor.org/rfc/rfc1866
https://www.w3.org/TR/2018/SPSD-html32-20180315/
Why even bring up CSS this was before that.
You can just also learn with the knowledge of 1996
Selfhtml exits it pretty easy to limit the scope of authoring language to a given HTML version and target browser. Your LLM should have no problem with german.
https://wiki.selfhtml.org/wiki/Museum
some other docs:
https://rauterberg.employee.id.tue.nl/publications/WWW-Desig...
Just spec the year and target browser and target standard and you will get something better than just asking for visual accuracy.
The OG prompt is simply poor and loose:
"Your job is to recreate the landing page as faithfully as possible, matching the screenshot exactly."
I tried your suggestion and also tried giving it various more general versions of the limitations presented by earlier generations.
Claude's instinct initially was actually to limit itself to less modern web standards.
Unfortunately, nothing got those planets to be in the right place.
The right way to handle this is not to build it grids and whatnot, which all get blown away by the embedding encoding but to instruct it to build image processing tools of its own and to mandate their use in constructing the coordinates required and computing the eccentricity of the pattern etc in code and language space. Doing it this way you can even get it to write assertive tests comparing the original layout to the final among various image processing metrics. This would assuredly work better, take far less time, be more stable on iteration, and fits neatly into how a multimodal agentic programming tool actually functions.
But this isn’t hugely different than your vision. You don’t see the pixel grid either. You have to use tools to measure things. You have the ability over time to iteratively interact with the image by perhaps counting grid lines but the LLM does not - it’s a one shot inference against this highly transformed image. They’ve gotten better at complex visual tasks including types of counting, but it’s not able to examine the image in any analytical way or even in its original representation. It’s just not possible.
It can however make tools that can. It’s very good at working with PIL and other image processing libraries or even writing image processing code de novo, and then using those to ground itself. Likewise it can not do math, but it can write a calculator that can do highly complex mathematics on its behalf.
That's not how to successfully use LLM's for coding in my experience. It is however perhaps a good demonstration of Claude's poor spatial reasoning skills. Another good demonstration of this is the twitch.tv/ClaudePlaysPokemon where Claude has been failing to beat pokemon for months now.
https://github.com/anthropics/claude-code/blob/main/plugins/...
> That's not how to successfully use LLM's for coding in my experience.
Yeah agree. I think I was just a little surprised it couldn't one-shot given the simplicity.
This article is a bit negative. Claude gets close , it just can't get the order right which is something OP can manually fix.
I prefer GitHub Copilot because it's cheaper and integrates with GitHub directly. I'll have times where it'll get it right, and times when I have to try 3 or 4 times.
This is also fairly contrived, you know? It's not a realistic limitation to rebuild HTML from a screenshot because of course if I have the website loaded I can just download the HTML.
???
This is precisely the workflow when a traditional graphic designer mocks up a web/app design, which still happens all the time.
They sketch a design in something like Photoshop or Illustrator, because they're fluent in these tools and many have been using them for decades, and somebody else is tasked with figuring out how to slice and encode that design in the target interactive tech (HTML+CSS, SwiftUI, QT, etc).
Large companies, design agencies, and consultancies with tech-first design teams have a different workflow, because they intentionally staff graphic designers with a tighter specialization/preparedness, but that's a much smaller share of the web and software development space than you may think.
There's nothing contrived at all about this test and it's a really great demonstration of how tools like Claude don't take naturally to this important task yet.
I've tried these tools a number of times and spent a good bit of effort on learning to maximize the return. By the time you know what prompt to write you've solved the problem yourself.
what if the LLM gets something wrong that the operator (a junior dev perhaps) doesn't even know it's wrong? that's the main issue: if it fails here, it will fail with other things, in not such obvious ways.
the same thing that always happens if a dev gets something wrong without even knowing it's wrong - either code review/QA catches it, or the user does, and a ticket is created
>if it fails here, it will fail with other things, in not such obvious ways.
is infallibility a realistic expectation of a software tool or its operator?
https://news.ycombinator.com/item?id=46185957
As the post shows, you can't trust them when they think they solved something but you also can't trust them when they think they haven't[0]. The things are optimized for human preference, which ultimately results in this being optimized to hide mistakes. After all, we can't penalize mistakes in training when we don't know the mistakes are mistakes. The de facto bias is that we prefer mistakes that we don't know are mistakes than mistakes that we do[1].
Personally I think a well designed tool makes errors obvious. As a tool user that's what I want and makes tool use effective. But LLMs flip this on the head, making errors difficult to detect. Which is incredibly problematic.
[0] I frequently see this in a thing it thinks is a problem but actually isn't, which makes steering more difficult.
[1] Yes, conceptually unknown unknowns are worse. But you can't measure unknown unknowns, they are indistinguishable from knowns. So you always optimize deception (along with other things) when you don't have clear objective truths (most situations).
If the tool needs you to check up on it and fix its work, it's a bad tool.
It’s not that binary imo. It can still be extremely useful and save a ton of time if it does 90% of the work and you fix the last 10%. Hardly a bad tool.
It’s only bad tool if you spent more time fixing the results than building it yourself, which sometimes used to be the case for LLMs but is happening less and less as they get more capable.
I agree that there are domains for which 90% good is very, very useful. But 99% isn't always better. In some limited domains, it's actually worse.
Humans don't get it right 100% or the time.
This isn't about whether AI is statistically safer, it's actually about the user experience of AI: If we can provide the same guidance without lulling a human backup into complacency, we will have an excellent augmented capability.
All tools have failure modes and truthfully you always have to check the tool's work (which is your work). But being a master craftsman is knowing all the nuances behind your tools, where they work, and more importantly where they don't work.
That said, I think that also highlights the issue with LLMs and most AI. Their failure modes are inconsistent and difficult to verify. Even with agents and unit tests you still have to verify and it isn't easy. Most software bugs are created from subtle things, often which compound. Which both those things are the greatest weaknesses of LLMs: nuance and compounding effects.
So I still think they aren't great tools, but I do think they can be useful. But that also doesn't mean it isn't common for people to use them well outside the bounds of where they are generally useful. It'll be fine a lot of times, but the problem is that it is like an alcohol fire[0]; you don't know what's on fire because it is invisible. Which, after all, isn't that the hardest part of programming? Figuring out where the fire is?
[0] https://www.youtube.com/watch?v=5zpLOn-KJSE
It's remarkable how high our expectations have been steadily creeping.
What's with the panicked pleas and need to preserve the site, assuming locally...?
It seems to me the post is about how Claude fails to recreate a very simple website from 1996.
- "First, calculate the orbital radius. To do this accurately, measure the average diameter of each planet, p, and the average distance from the center of the image to the outer edge of the planets, x, and calculate the orbital radius r = x - p"
- "Next, write a unit test script that we will run that reads the rendered page and confirms that each planet is on the orbital radius. If a planet is not, output the difference you must shift it by to make the test pass. Use this feedback until all planets are perfectly aligned."
That said, I love this project. haha
The loop here, imo, refers to the feedback loop. And it's true that ideally there should be no human involvement there. A tight feedback loop is as important for llms as it is for humans. The more automated you make it, the better.
One of the keys to being productive with LLMs is learning how to recognize when it's going to take much more effort to babysit the LLM into getting the right result as opposed to simply doing the work yourself.
Wrt unit test script, let's take Claude out of the equation, how would you design the unit test? I kept running into either Claude or some library not being capable of consistently identifying planet vs non planet which was hindering Claude's ability to make decisions based on fine detail or "pixel coordinates" if that makes sense.
But if that could be done deterministically, I totally agree this is the way to go. I'll put some more time into it over the next couple weeks.
> What he produces
I feel like personifying LLMs more than they currently are is a mistake people make (though humans always do this), they're not entities, they don't know anything. If you treat them too human you might eventually fool yourself a little too much.
Aside from that point: if you are reading this and making people do a project as part of the hiring process, you should absolutely be paying them for their time (even a token amount).
I don't doubt that it is possible eventually, but I haven't had much luck.
Something that seemed to assist was drawing a multi coloured transparent chequerboard, if the AI knows the position of the grid colours it can pick out some relative information from the grid.
I have also not had luck with any kind of iterative/guess-and-check approach. I assume the models are all trained to one-shot this kind of thing and struggle to generalize to what are effectively relative measurements.
> After these zoom attempts, I didn't have any new moves left. I was being evicted. The bank repo'd my car. So I wrapped it there.
https://knowyourmeme.com/memes/my-father-in-law-is-a-builder...
This was soon moved to a static table layout with higher quality images: https://web.archive.org/web/19970412180040/http://www.spacej...
You say that as if that’s uncommon.
I am lucky that I don't depend on this for work at a corporation. I'd be pulling my hair out if some boss said "You are going to be doing 8 times as much work using our corporate AI from now on."
Why not use wget to mirror the website? Unless you're being sarcastic.
$ wget --mirror --convert-links --adjust-extension --page-requisites --no-parent http://example.org
Source: https://superuser.com/questions/970323/using-wget-to-copy-we...
And
> I'm an engineering manager
I can't tell if this is an intentional or unintentional satire of the current state of AI mandates from management.
- Read an .icc file from disk
- parsed the file and extracted the VCGT (video card gamma table)
- wrote the VCGT to the video card for a specified display via amdgpu driver APIs
The only thing I had to fix was the ICC parsing, where it would parse header strings in the wrong byte-order (they are big-endian).
Because the rendered output (pixels, not HTML/CSS) is not fed as data in the training. You will find tons of UI snippets and questions, but they rarely included screenshots. And if they do, the are not scraped.
I don’t think it’s fair to call someone who used Stack Overflow to find a similar answer with samples of code to copy to their project an asshole.
Stack Overflow offers access to other peoples’ work, and developers combined those snippets and patterns into their own projects. I suspect attribution is low.
How do you describe the “reckless” use of information?
Or are you saying that every piece of code you ever wrote was 100% original and not adapted from any previous codebase you ever worked in or any book / reference you ever read?
There are court cases where this is being addressed currently, and if you think about how LLMs operate, a reasonable person typically sees that it looks an awful lot like plagiarism.
If you want to claim it is not plagiarism, that requires a good argument, because it is unclear that LLMs can produce novelty, since they're literally trying to recreate the input data as faithfully as possible.
> since they're literally trying to recreate the input data as faithfully as possible.
Is that how they are able to produce unique code based on libraries that didn't exist in their training set? Or that they themselves wrote? Is that how you can give them the documentation for an API and it writes code that uses it? Your desire to make LLMs "not special" has made you completely blind to reality. Come back to us.
The LLM is trained on a corpus of text, and when it is given a sequence of tokens, it finds a set of token that, when one of them is appended, make the resulting sequence most like the text in that corpus.
If it is given a sequence of tokens that is unlike anything in its corpus, all bets are off and it produces garbage, just like machine learning models in general: if the input is outside the learned distribution, quality goes downhill fast.
The fact that they've added a Monte Carlo feature to the sequence generation, which makes it sometimes select a token that is slightly less like the most exact match in the corpus does not change this.
LLMs are fuzzy lookup tables for existing text, that hallucinate text for out-of-distribution queries.
This is LLM 101.
If the LLM was only trained using documentation, then there would be no problem. If it would generate a design, look at the documentation, understand the semantics of both, and translate the design to code by using the documentation as a guide.
But that's not how it works. It has open source repositories in its corpus that it then recreates by chaining together examples in this stochastic parrot -method I described above.
You have the whole burden of proof thing backwards.
Do you have papers to back this up ? That was also my reaction when i saw some really crazy accurate comments on some vibe coded piece of code, but i couldn't prove it, and thinking about it now i think my intuition was wrong (ie : LLMs do produce original complex code).
If that does not work then the moment you introduce AI you cap their capabilities unless humans continue to create original works to feed the AI. The conclusion - to me, at least - is that these pieces of software regurgitate their inputs, they are effectively whitewashing plagiarism, or, alternatively, their ability to generate new content is capped by some arbitrary limit relative to the inputs.
There’s something in what you’re saying, but until you refine it to something actually true, it’s just more slop.
We all stand on the shoulders of giants and learn by looking at others’ solutions.
To me that's proof positive they know their output is mangled inputs, they need that originality otherwise they will sooner or later drown in nonsense and noise. It's essentially a very complex game of Chinese whispers.
If so, I'm not sure it's a useful framing.
When I apply that machine (with its giant pool of pirated knowledge) _to my inputs and context_ I can get results applicable to my modestly novel situation which is not in the training data. Perhaps the output is garbage. Naturally if my situation is way out of distribution I cannot expect very good results.
But I often don't care if the results are garbage some (or even most!) of the time if I have a way to ground-truth whether they are useful to me. This might be via running a compile, a test suite, a theorem prover or mk1 eyeball. Of course the name of the game is to get agents to do this themselves and this is now fairly standard practice.
¹https://chatgpt.com/share/69367c7a-8258-8009-877c-b44b267a35...
It does this all the time, but as often as not then outputs nonsense again, just different nonsense, and if you keep it running long enough it starts repeating previous errors (presumably because some sliding window is exhausted).
“Reference the original uploaded image. Between each image in the clock face, create lines to each other image. Measure each line. Now follow that same process on the app we’ve created, and adjust the locations of each image until all measurements align exactly.”
https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
296 more comments available on Hacker News