Optimizing a 6502 Image Decoder, From 70 Minutes to 1 Minute

Posted3 months agoActive3 months ago

davikr

195 points

29 comments

colino.netTechstory

calmpositive

Debate

20/100

RetrocomputingOptimizationImage Processing

Key topics

Retrocomputing

Optimization

Image Processing

The author optimized a 6502 image decoder, reducing its processing time from 70 minutes to 1 minute, sparking discussion on the value of optimization and the nostalgia of working with old hardware.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

Peak period

0-12h

Avg / period

9.7

Comment distribution29 data points

Loading chart...

Based on 29 loaded comments

Key moments

01Story posted
Sep 29, 2025 at 6:11 AM EDT
3 months ago
Step 01
02First comment
Sep 29, 2025 at 7:53 AM EDT
2h after posting
Step 02
03Peak activity
23 comments in 0-12h
Hottest window of the conversation
Step 03
04Latest activity
Oct 3, 2025 at 5:00 PM EDT
3 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (29 comments)

Showing 29 comments

JSR_FDED

3 months ago

1 reply

Good reminder to do less, rather than the same thing but optimized.

iberator

3 months ago

Yeah. Amazing idea and execution (counting number of instructions per module).

flanked-evergl

3 months ago

2 replies

If I have the "Without interpolating, we can clearly see we only have half the pixels." image entirely on screen, using Chrome, KDE with X11 on Ubuntu 24.04, then it makes my whole screen change colour. Everything becomes slightly darker or something. Very odd. I will try it on another computer.

Cockbrand

3 months ago

Without having tried it, maybe there's some HDR content on the page, triggering the display's HDR mode?

opello

3 months ago

I wonder if there's not some adaptive backlight automatic gain control cueing off of the moire image's black pixels, since you describe things as slightly darker.

JKCalhoun

3 months ago

2 replies

I kind of enjoy seeing these posts from time to time on HN. I thought it was my age (I remember this hardware) but I think a lot of engineers are enjoying practicing their craft in a more pure environment with so few (or no) layers of abstraction underneath.

Refreshing at times, isn't it?

dylan604

3 months ago

3 replies

For me, it's less the abstraction vs having hard limits on things like memory. Modern software has nearly limitless memory compared to the less than 1MB typical on these projects. It was definitely a lesson I had to learn.

As far as the abstraction, when does it get to a point the compiler can't undo the abstraction? At what point does one need to get to a point where something cannot be done?

HarHarVeryFunny

3 months ago

1 reply

> As far as the abstraction, when does it get to a point the compiler can't undo the abstraction?

The early 8-bit systems were so constrained in everything from memory to registers, instruction set, and clock speed, that using a high level language wasn't an option if you were trying to optimize performance or squeeze a lot of functionality into available memory. An 8-bit system would have a 64KB address space, but maybe only 16-32KB of RAM, with the rest used by the "OS" and mapped to the display, etc.

The 6502 was especially impoverished, having only three 8-bit registers and a very minimalistic instruction set. Writing performant software for it depended heavily on using "zero-page" memory (special addressing mode to access 1st 256 bytes of memory) to hold variables, rather than passing stack based parameters to functions, etc. It was really a completely different style of programming and mindset to using a high level language - not about language abstraction, but a painful awareness of the bare metal you were running on all the time.

dylan604

3 months ago

When I asked, it was based on my use of Arduino IDE where one writes higher level code that then gets compiled into machine code. I had a project where I was using multiple sensors where I could not store each of their responses in memory to write to a log in one shot. Instead, I had to write to the log after reading each sensor directly and releasing the memory at the end of each loop. I was originally hoping to do more analysis with the data onboard the Arduino, but in hindsight, that was a pretty dumb idea. The sensing platform should do just that. Do the analysis in post.

asveikau

3 months ago

1 reply

I may sound bitter in describing this, but I started to notice about 15 years ago that the then-current crop of developers, trained exclusively on GC'd languages, seemed to have no idea what a memory constraint would look like and thought that a hidden memory allocation is free as long as it occurs a few layers beneath you.

dylan604

3 months ago

2 replies

Formal training with something like a CS course definitely starts with limited systems progressing to larger systems. So if you've been through that pain, I could see the bitterness. However, I'd wager the vast majority of coders did not take a CS course, and are self taught or boot camp grads. It's not really their fault they don't know assembly. It's just not something they've ever or will most likely never need to deal with for they day job.

inopinatus

3 months ago

1 reply

The painful thing is when someone describes themselves as a “full stack developer” but lack any mechanical sympathy for what the processor, memory, and I/O buses are actually doing.

This doesn’t require a formal education. I was self-taught long before studying CS institutionally. And per the article I am super grateful to the 6502 for being the platform that I learned from.

pixl97

3 months ago

I mean, no, that's mostly not what I'd consider a full stack developer myself and I've been around a long time. You can be full stack grabbing everyone else's libraries and making an app spit data from the db to the UI.

Simply put the vast majority of developers will never need this information nor be resource constrained that they'll need to use their time to understand the issue deeper.

asveikau

3 months ago

1 reply

Nope, I am talking about people with CS degrees.

In the mid 2000s, intro CS classes started focusing on Java. These days Python occupies a similar niche. It's not until they start to take an operating systems class, possibly by junior year, that a student might be confronted with manual memory management.

dylan604

3 months ago

Interesting. My CS course started right out of the gate with assembly. My fledgling computer course in high school started with Pascal instead of BASIC. Clearly, before "mid 2000s". Maybe I was taking classes out of order??? It was way to long ago for me to remember those details, but I was well underwater trying to jump right into the low level language. I remember struggling with pointers in Pascal until one day it finally clicked. Assembly came across to me as a weed out course right at the beginning.

tmoertel

3 months ago

On 6502-based systems the available memory was often less than 64 KiB, the maximum addressable by the processor directly. Still, you could squeeze a lot into that small amount of you were clever. For example, Steve Wozniak wrote in BYTE magazine about computing e to over 100K places on an Apple 2:

> I first calculated e to 47 K bytes of precision in January 1978. The program ran for 4.5 days, and the binary result was saved on cassette tape. Because I had no way of detecting lost-bit errors on the Apple (16 K-byte dynamic memory circuits were new items back then), a second result, matching the first, was required. Only then would I have enough confidence in the binary result to print it in decimal. Before I could rerun the 4.5 day program successfully, other projects at Apple, principally the floppy-disk controller, forced me to deposit the project in the bottom drawer. This article, already begun, was postponed along with it. Two years later, in March 1980, I pulled the e project out of the drawer and reran it, obtaining the same results. As usual (for some of us), writing the magazine article consumed more time than that spent meeting the technical challenges.

See page 392 of https://archive.org/details/byte-magazine-1981-06.

matheusmoreira

3 months ago

"Low-level programming is good for the programmer’s soul." -- John Carmack

6510

3 months ago

3 replies

It is quite surreal to me that this was not the road taken. Optimizing software doesn't have the same potential as optimizing hardware but i'd say 1/70 is significant. If thousands of people would work on this indefinitely the time would drop to seconds. That code would also be completely incomprehensible. The argument that people should just buy a faster computer could just as easily have worked out the other way around, just write faster software. Going the hardware way gave us really really readable code which is great. The other direction however would have given us really really cheap devices. Receiving and sending a signal for [say] a chat application requires very little stuff. It would be next to impossible to add images, word suggestions or spell checkers. We could still bake mature applications onto dedicated chips. But until now those efforts went pretty much nowhere(?) I imagine one could quite easily bake a mail client or server, or a torrent client, irc, perhaps even a gui for windowed applications. Maybe an error console?

dr_zoidberg

3 months ago

1 reply

In the ~30 years I've used computers, they've become ~1,000,000 times faster. My daily experience with computers doesn't show it. There's someone out there who took the time to measure UI latency and has shown that, no only isn't it faster, it's actually slowed down. And yet, our hardware is 1,000,000 times faster...

Edit: this is the latency project I was thinking about https://danluu.com/input-lag/

6510

3 months ago

What beautiful table in how it is seems sorted both by time and latency with the exception of some systems that are ahead of their time in slowness.

If you put a bit of load on the modern hardware things get dramatically worse. As if there is some excuse for it.

I had this thought long ago that the picture on the monitor could be stitched together from many different sources. You would have a box some place on the screen with an application or widget rendered in it by physically isolated hardware. An input area needs a font and a color scheme. The text should still be send to the central processor and/or to the active process but it can be send to the display stitcher simultaneously.

You could even type a text and have it encrypted without the rest of the computer learning what the words say.

I look at and click around KolibriOS one time, everything is so responsive it made me slightly angry.

Someone

3 months ago

2 replies

> If thousands of people would work on this indefinitely the time would drop to seconds.

Even the small Apple II screen takes 7.5 kilobytes of RAM. Reading https://imapenguin.com/2022/06/how-fast-can-a-6502-transfer-..., just writing all of that to the screen takes a tenth of a second, for just over 50,000 pixels, and that’s ignoring the idiosyncratic video memory layout of the Apple II.

Going below ten seconds for decompressing that would mean you must produce 5 output pixels every millisecond, which means you have about 200 CPU cycles per pixel. On a 6502, that’s less than 100 instructions.

That makes me doubt it can get under 10 seconds.

⇒ if you want to get down to seconds, I think you’ll have to drop even more image data than this does, and do that fast.

6510

3 months ago

When we had this tiny memory and the slow cpu to work with everyone was into optimization. It quite regularly happened that someone would find a new way to do something way beyond what everyone else thought possible. You would quite literally sit in front the screen stuck in a loop saying "What?" then "How?" with 5 second pauses in between. I'm not bragging about my amazing skills here, more the opposite. I've just adjusted my estimated accordingly. If enough people spend enough time the solution will turn out much weirder than the quake inverse square root.

> I’ll be using the revised version as I think it’s a well-established example of doing a real-world block transfer. Sure there may be faster ways, but this is a realistic way, which is what we’re going for.

   NEXT  LDX #NUMBER  
         LDA BASE,X 
         STA DEST,X 
         DEX 
         BNE NEXT

Something unrolled like:

    $1000 LDA $2000
    $1003 STA $3000
    $1006 LDA $2001
    $1009 STA $3001
    etc...

Isn't even the fastest solution but 3 out of 5 instructions are gone and I think the two remaining are faster. The transfer in the book is of course really practical while this one is already almost unworkable. You can do worse tho:

    $1000 LDA #$12
    $1002 STA $30
    $1004 LDA #$34
    $1006 STA $31
    etc...

For the truly insane solution you would have to consider if you even have to read the data. Maybe you can present the image as-is and modify it... or worse... turn it into code...

colinlm

3 months ago

Oh yes the dithering takes 10 seconds all to itself. The Quicktake 100 format is much more simple (4-bits nibbles) and it still needs 22 seconds to decode 640x480.

(Decoding and dithering are done in two passes for memory reasons and space on floppy disk reasons but it brings auto-levelling)

It's about 450 cycles per pixel for decoding QT100 (1200 for QT150), and 230 cycles per pixel for dithering.

flenserboy

3 months ago

1 reply

but it still should happen. every project (of a certain size) should have at least 1-2 programmers whose job it is to make the code faster & smaller. frankly, this could well be the best use of AI on code — not to generate it, but to use its potential speed to chew through existing code bases, outputting squashed-down, streamlined code which is tested to return the same results as the original (think the dream of gentoo, but every program on your system optimized for your particular hardware).

6510

3 months ago

You could do it client side as well! Based on how the user uses the application. Make the things I use 1000 times per day faster and things I'm never going to use may grow slightly slower.

anyfoo

3 months ago

1 reply

Isn't it crazy how the image where every other pixel is black (labeled "Without interpolating, we can clearly see we only have half the pixels") sort of looks to have higher fidelity than the one after it, where the black pixels have been removed, which now looks pixelated?

And yet both images have the exact amount of information, because all pixels that have been removed are simply black.

The effect is so pronounced, that I wonder whether there wasn't any additional processing between the two images. Compare the window with the sky reflection between the two: In the image without black pixels, it looks distorted and aliased, while in the one with, it looks pristine.

If actually only the black pixels have been removed (and the result nearest-neighbor scaled), I think the black pixel grid is a form of dithering, although neither random, nor the error diffusion kind one usually thinks of when hearing "dithering". There is no error being diffused here, and all added pixels are constant black.

Maybe the black pixels allow the mind to fill in the gaps (shifting the interpolation that was also removed, prior to the black pixel image, to our brain, essentially). It is known that our brain interpolates and even straight makes up highly non-linear stuff as part of our vision system. A famous example of that is how our blind spot, the fovea where the optic nerve comes out, is "hidden" from us that way.

The aliasing would "disappear" because we sort of have twice the amount of samples (pixels), leading to twice the Nyquist frequency, only that half of the samples have been made up by our vision system through interpolation. (This is a way simplified and crude way to look at it. Pun unintended.)

But before jumping to such lofty conclusions, I still wonder whether nothing more happened between the two images...

bawolff

3 months ago

1 reply

If i understand, the picture without black pictures is half the resolution at 320x240. That's small enough your browser might be upscaling it

anyfoo

3 months ago

I just opened both pictures in separate tabs, the browser says that both are 640x480.

But that's a good point nonetheless: That 640x480 picture is certainly upscaled somehow, otherwise it would be absolutely tiny on a retina display. That upscaling might certainly lead to very different results from both images.

However, when I download the images and open them in Preview, I get the same result. When I zoom in, I do see individual pixels. It looks like it just did nearest neighbor upscaling, so I'm not sure upscaling is at fault here.

homarp

3 months ago

Part II: https://news.ycombinator.com/item?id=45467707

View full discussion on Hacker News

ID: 45412022Type: storyLast synced: 11/20/2025, 3:47:06 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN