Optimizing a 6502 Image Decoder, From 70 Minutes to 1 Minute
Posted3 months agoActive3 months ago
colino.netTechstory
calmpositive
Debate
20/100
RetrocomputingOptimizationImage Processing
Key topics
Retrocomputing
Optimization
Image Processing
The author optimized a 6502 image decoder, reducing its processing time from 70 minutes to 1 minute, sparking discussion on the value of optimization and the nostalgia of working with old hardware.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
2h
Peak period
23
0-12h
Avg / period
9.7
Comment distribution29 data points
Loading chart...
Based on 29 loaded comments
Key moments
- 01Story posted
Sep 29, 2025 at 6:11 AM EDT
3 months ago
Step 01 - 02First comment
Sep 29, 2025 at 7:53 AM EDT
2h after posting
Step 02 - 03Peak activity
23 comments in 0-12h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 3, 2025 at 5:00 PM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45412022Type: storyLast synced: 11/20/2025, 3:47:06 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Refreshing at times, isn't it?
As far as the abstraction, when does it get to a point the compiler can't undo the abstraction? At what point does one need to get to a point where something cannot be done?
The early 8-bit systems were so constrained in everything from memory to registers, instruction set, and clock speed, that using a high level language wasn't an option if you were trying to optimize performance or squeeze a lot of functionality into available memory. An 8-bit system would have a 64KB address space, but maybe only 16-32KB of RAM, with the rest used by the "OS" and mapped to the display, etc.
The 6502 was especially impoverished, having only three 8-bit registers and a very minimalistic instruction set. Writing performant software for it depended heavily on using "zero-page" memory (special addressing mode to access 1st 256 bytes of memory) to hold variables, rather than passing stack based parameters to functions, etc. It was really a completely different style of programming and mindset to using a high level language - not about language abstraction, but a painful awareness of the bare metal you were running on all the time.
This doesn’t require a formal education. I was self-taught long before studying CS institutionally. And per the article I am super grateful to the 6502 for being the platform that I learned from.
Simply put the vast majority of developers will never need this information nor be resource constrained that they'll need to use their time to understand the issue deeper.
In the mid 2000s, intro CS classes started focusing on Java. These days Python occupies a similar niche. It's not until they start to take an operating systems class, possibly by junior year, that a student might be confronted with manual memory management.
> I first calculated e to 47 K bytes of precision in January 1978. The program ran for 4.5 days, and the binary result was saved on cassette tape. Because I had no way of detecting lost-bit errors on the Apple (16 K-byte dynamic memory circuits were new items back then), a second result, matching the first, was required. Only then would I have enough confidence in the binary result to print it in decimal. Before I could rerun the 4.5 day program successfully, other projects at Apple, principally the floppy-disk controller, forced me to deposit the project in the bottom drawer. This article, already begun, was postponed along with it. Two years later, in March 1980, I pulled the e project out of the drawer and reran it, obtaining the same results. As usual (for some of us), writing the magazine article consumed more time than that spent meeting the technical challenges.
See page 392 of https://archive.org/details/byte-magazine-1981-06.
Edit: this is the latency project I was thinking about https://danluu.com/input-lag/
If you put a bit of load on the modern hardware things get dramatically worse. As if there is some excuse for it.
I had this thought long ago that the picture on the monitor could be stitched together from many different sources. You would have a box some place on the screen with an application or widget rendered in it by physically isolated hardware. An input area needs a font and a color scheme. The text should still be send to the central processor and/or to the active process but it can be send to the display stitcher simultaneously.
You could even type a text and have it encrypted without the rest of the computer learning what the words say.
I look at and click around KolibriOS one time, everything is so responsive it made me slightly angry.
Even the small Apple II screen takes 7.5 kilobytes of RAM. Reading https://imapenguin.com/2022/06/how-fast-can-a-6502-transfer-..., just writing all of that to the screen takes a tenth of a second, for just over 50,000 pixels, and that’s ignoring the idiosyncratic video memory layout of the Apple II.
Going below ten seconds for decompressing that would mean you must produce 5 output pixels every millisecond, which means you have about 200 CPU cycles per pixel. On a 6502, that’s less than 100 instructions.
That makes me doubt it can get under 10 seconds.
⇒ if you want to get down to seconds, I think you’ll have to drop even more image data than this does, and do that fast.
> I’ll be using the revised version as I think it’s a well-established example of doing a real-world block transfer. Sure there may be faster ways, but this is a realistic way, which is what we’re going for.
Something unrolled like: Isn't even the fastest solution but 3 out of 5 instructions are gone and I think the two remaining are faster. The transfer in the book is of course really practical while this one is already almost unworkable. You can do worse tho: For the truly insane solution you would have to consider if you even have to read the data. Maybe you can present the image as-is and modify it... or worse... turn it into code...(Decoding and dithering are done in two passes for memory reasons and space on floppy disk reasons but it brings auto-levelling)
It's about 450 cycles per pixel for decoding QT100 (1200 for QT150), and 230 cycles per pixel for dithering.
And yet both images have the exact amount of information, because all pixels that have been removed are simply black.
The effect is so pronounced, that I wonder whether there wasn't any additional processing between the two images. Compare the window with the sky reflection between the two: In the image without black pixels, it looks distorted and aliased, while in the one with, it looks pristine.
If actually only the black pixels have been removed (and the result nearest-neighbor scaled), I think the black pixel grid is a form of dithering, although neither random, nor the error diffusion kind one usually thinks of when hearing "dithering". There is no error being diffused here, and all added pixels are constant black.
Maybe the black pixels allow the mind to fill in the gaps (shifting the interpolation that was also removed, prior to the black pixel image, to our brain, essentially). It is known that our brain interpolates and even straight makes up highly non-linear stuff as part of our vision system. A famous example of that is how our blind spot, the fovea where the optic nerve comes out, is "hidden" from us that way.
The aliasing would "disappear" because we sort of have twice the amount of samples (pixels), leading to twice the Nyquist frequency, only that half of the samples have been made up by our vision system through interpolation. (This is a way simplified and crude way to look at it. Pun unintended.)
But before jumping to such lofty conclusions, I still wonder whether nothing more happened between the two images...
But that's a good point nonetheless: That 640x480 picture is certainly upscaled somehow, otherwise it would be absolutely tiny on a retina display. That upscaling might certainly lead to very different results from both images.
However, when I download the images and open them in Preview, I get the same result. When I zoom in, I do see individual pixels. It looks like it just did nearest neighbor upscaling, so I'm not sure upscaling is at fault here.