Speed-Coding for the 6502 – a Simple Example

Posted4 months agoActive4 months ago

mmphosis

53 points

13 comments

colino.netTech Discussionstory

informativepositive

Debate

0/100

CPU DesignCoding ParadigmsRetrocomputing

Key topics

CPU Design

Coding Paradigms

Retrocomputing

Diving into the world of 6502 microprocessor optimization, a recent article showcased a simple example of "speed-coding" that sparked a lively discussion. Commenters debated the merits of pre-calculating lookup tables versus building them on-demand as a cache, with some pointing out that the article actually explored the former, while others suggested the latter could be a valuable next step. As one commenter noted, the table is used to scale X/Y coordinates, making every entry necessary, while another observed that the technique of using lookup tables to trade off memory for runtime is a timeless optimization trick. The conversation highlighted the creative problem-solving that went into working with the 6502's constraints, with some sharing their own experiences squeezing complex projects into tight memory limits.

Snapshot generated from the HN discussion

Discussion Activity

Moderate engagement

First comment

22m

Peak period

0-6h

Avg / period

2.6

Comment distribution13 data points

Loading chart...

Based on 13 loaded comments

Key moments

01Story posted
Aug 28, 2025 at 5:24 PM EDT
4 months ago
Step 01
02First comment
Aug 28, 2025 at 5:46 PM EDT
22m after posting
Step 02
03Peak activity
7 comments in 0-6h
Hottest window of the conversation
Step 03
04Latest activity
Sep 1, 2025 at 10:45 PM EDT
4 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (13 comments)

Showing 13 comments

rbanffy

4 months ago

1 reply

What a delightful short read.

They could go one step further and calculate the table as needed and use it as a cache.

For an single image scaling it might get a little bit better.

spc476

4 months ago

1 reply

If you read the entire article, they do that at the end of the article.

anyfoo

4 months ago

2 replies

Not quite. They build the entire table upfront, whether any individual entry is needed or not.

Making it an on-demand cache instead is a neat next step. Whether it helps or hurts depends on the actual input: If the input image uses every pixel value anyway, the additional overhead of checking whether the table entry is computed is just unnecessary extra with no value.

But if a typical image only uses a few pixel values, then the amortized cost of just calculating the few needed table entries may very well be significantly below the cost of the current algorithm.

If images are somewhere in between, or their characteristics not well known, then simply trying out both approaches with typical input data is a good approach!

Unless you’re perfectly happy with 0.2 seconds, for example because the runtime of some other parts take so long that dwarfs those 0.2s, then why bother.

egypturnash

4 months ago

1 reply

Even if you could get the "is this cell of the table calculated?" check down to one cycle (which you can't, 6502 branch instructions take 2-4 cycles, plus a few more operations to check the table calculation status), it's still gonna add one cycle to a hot loop that's done 43k times. Pre-calcing the entire table only burns 6k cycles, and can probably be done while displaying some transitional effect that requires very little CPU.

If we presume this scaling operation will be done more than one time (which it probably will be if it's getting this level of optimization) then it gets even worse.

anyfoo

4 months ago

1 reply

Good analysis. I guess I lost my cycle-counting foo, because on modern CPUs cycle-counting is a) infeasible on a superscalar, speculative, out-of-order CPU, and b) even if you did manage to do it, it hardly matters when a CPU cycle is less than a nanosecond, but any memory access that slips through the caches means many orders of magnitude higher latency.

Of course none of that applies here, but it colors the way you think about things…

egypturnash

4 months ago

meanwhile the 6502 is all "cache? what's cache", every 6502 optimization is basically a pessimization for modern CPUs.

colinlm

4 months ago

1 reply

Every entry will be used, it's to scale X/Y coordinates :) And indeed, those 0.2s are dwarfed by the rest of the algorithm (10 seconds)

rbanffy

4 months ago

Good point. I was thinking of pixel values.

Joker_vD

4 months ago

1 reply

I wonder if building this table can be sped up by noticing a recurring pattern?

    x    0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 ... 254 255
    f(x) 0  0  1  2  3  3  4  5  6  6  7  8  9  9 10 11 12 12 ... 190 191

So something like

    sta table,y
    iny
    sta table,y
    adc $0
    iny    
    sta table,y
    adc $0
    iny    
    sta table,y
    adc $0
    iny

used as the loop body that should be repeated 64 times, should work. Will it take less than 6000 cycles total?

FatalLogic

4 months ago

I didn't check your code worked.. just copied it and ran the inner part of the loop once, but according to https://www.masswerk.at/6502/

It's about 2x faster. Your code uses 44 CPU cycles x 64

Edit: plus a branch instruction, maybe that adds 3 cycles x 64 I guess

anyfoo

4 months ago

Very neat. Breaking up multiplications and divisions into bit shifts, and lookup table to trade off memory for runtime, are indeed nothing new to engineers working on the low level, but this paints a very pretty picture of how this looks in practice.

JKCalhoun

4 months ago

Games that needed trig in the 90's often used SIN/COS lookup-tables. To keep memory down:

1) you only need the first 1/4 of the Sine table since the remaining 3/4 are either the first 1/4 in reverse and/or with the sign flipped.

2) and of course Sine can also be used as a Cosine lookup if you add pi/2 radians to the cosine angle (wrapping around of course).

3) to avoid the size needed for a table of floats you can of course use integers (scaled by some factor) or fixed-point values.

4) and simple interpolation would get you seemingly more precision.

(Combining all the above was a bit gross so documentation and a good SPI helped.)

lttlrck

4 months ago

I wrote a token ring networking simulator with error simulation (token dropping etc) on a 6502. 128 bytes of RAM. It was a squeeze.

I used a Hewlett Packard development system with a 12" hard disk.

Good times.

View full discussion on Hacker News

ID: 45057209Type: storyLast synced: 11/20/2025, 5:45:28 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN