The Stack Circuitry of the Intel 8087 Floating Point Chip, Reverse-Engineered
Key topics
Diving into the intricacies of the Intel 8087 floating-point chip, a recent reverse-engineering effort has shed light on its stack circuitry, sparking a lively discussion among tech enthusiasts. Commenters are weighing in on the reasoning behind the chip's unusual stack architecture, with some pointing to the influence of William Kahan, who had previously worked on HP scientific calculators, and others citing the limitations of instruction encoding. The conversation is revealing a mix of admiration for the chip's innovative design and criticism of its stack-based architecture, with some commenters sharing personal anecdotes about working with the 8087. As one commenter notes, the 8087's influence can still be seen in modern computers, making it "one of the most influential chips ever created."
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
16m
Peak period
47
0-12h
Avg / period
11.2
Based on 67 loaded comments
Key moments
- 01Story posted
Dec 9, 2025 at 1:16 PM EST
24 days ago
Step 01 - 02First comment
Dec 9, 2025 at 1:32 PM EST
16m after posting
Step 02 - 03Peak activity
47 comments in 0-12h
Hottest window of the conversation
Step 03 - 04Latest activity
Dec 13, 2025 at 10:43 PM EST
20 days ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Then again... they did try to force VLIW and APX on us so Intel has a history of "interesting" ideas about processor design.
I don't know about other backend guys, but I disliked the stack architecture because it just incompatible with enregistering variables, register allocation by live range analysis, common subexpression elimination, etc.
The real disadvantage is that the stack operations share the output operand, which introduces a resource dependency between otherwise independent operations, which prevents their concurrent execution.
There are hardware workarounds even for this, but the hardware would become much more complex, which is unlikely to be worthwhile.
x86 has a general pattern of encoding operands, the ModR/M byte(s), which gives you either two register operands, or a register and a memory operand. Intel also did this trick that uses one of the register operand for extra opcode bits, at the cost of sacrificing one of the operands.
There are 8 escape opcodes, and all of them have a ModR/M byte trailing it. If you use two-address instructions, that gives you just 8 instructions you can implement... not enough to do anything useful! But if you're happy with one-address instructions, you get 64 instructions with a register operand and 64 instructions with a memory operand.
A stack itself is pretty easy to compile for, until you have to spill a register because there's too many live variables on the stack. Then the spill logic becomes a nightmare. My guess is that the designers were thinking along these lines--organizing the registers in the stack is an efficient way to use the encoding space, and a fairly natural way to write expressions--and didn't have the expertise or the communication to realize that the design came with some edge cases that were painfully sharp to deal with.
When writing in assembly language, the stack architecture is very convenient and it minimizes the program size. That is why most virtual machines used for implementing interpreters for programming languages have been stack-based.
The only real disadvantage of the stack architecture is that it prevents the concurrent execution of operations, because all operations have a resource dependency by sharing the stack as output location.
At the time when 8087 was designed, the possibility of implementing parallel execution of instructions in hardware was still very far in the future, so this disadvantage was dismissed.
Replacing the stack by individually addressable registers is not the only possible method for enabling concurrent execution of instructions. There are 2 alternatives that can continue to use a stack architecture.
One can have multiple operand stacks and each instruction must contain a stack number. Then the compiler assigns each chain of dependent operations to one stack and the CPU can execute in parallel as many independent chains of dependent instructions as there are stacks.
The other variant is to also have multiple operand stacks but to have the same instruction set with only one implicit stack, while implementing simultaneous multi-threading (SMT). Then each hardware thread uses its own stack while sharing the parallel execution units and then one can execute in parallel as many instructions as there are threads. For this variant one would need to have much more threads than in a CPU with registers, which combines superscalar execution with SMT, so one would need 8 or more SMT threads to be competitive.
That.... seems like a stretch. Does x64 even support the legacy x87 stack anymore? Certainly no other major architectures worked like that. If they ever did, they don't now.
If you're referring to the fact that the 8087 was the first proto-IEEE754 implementation, that's historic enough, but I wouldn't count the FPU stack.
I had a 10Mhz XT, and ran a 8087-8 at a bit higher clock rate. I used it both for Lotus 1-2-3 and Turbo Pascal-87. It made Turbo Pascal significantly faster.
Nowadays, flash uses multiple voltage levels to store four bits per cell (QLC, Quad Level Cell), which is a similar concept.
I wrote a whole blog post about the 2-bit-per-transistor technique, back in 2018: https://www.righto.com/2018/09/two-bits-per-transistor-high-...
[1] https://en.wikipedia.org/wiki/R4200
The R4200 FPU performance suffered for this reason.
It's all about that 80-bit/82-bit floating point format with the explicit mantissa bit just to be extra different. ;) Not only is it a 1:15:1:63, it's (2(tag)):1:15:1:63, whereas binary64 is 1:15:0:53.
Looks like those tag bits are bunched together in FPUTagWord in the 387+ save state that must be sanitized in the 8087 by doing an FINIT or some FFREEs prior to task switching. This absence makes multitasking and interrupt handling switching extra hard on the plain 80287 too, but not as much with the 80C187 and i80287XL/T (I have a couple of these) that have the 387SX core.
Other pre-P5 ISA idiosyncrasies: Only the 8087 has FDISI/FNDISI, FENI/FNENI. Only the plain 287 has a functional FSETPM. Most everything else looks like a 387 ISA-wise, more or less until MMX arrived. That's all I know.
I'm curious what the CX-83D87 and Weiteks look like.
Keep up the good work!
The Weitek's were memory mapped. (At least those built for x86 machines.).
This essentially increased bandwidth by using the address bus as a source for floating point instructions. Was really a very cool idea, although I don't know what the performance realities were when using one.
http://www.bitsavers.org/components/weitek/dataSheets/WTL-31...
> The operand fields of a WTL 3167 address have been specifically designed so that a WTL 3167 address can be given as either the source or the destination to a REP MOVSD instruction.
> Single-precision vector arithmetic is accomplished by applying the 80386 block move instruction REP MOVSD to a WTL 3167 address involving arithmetic instead of loading or storing.
This feature had not been included in the IEEE standard, so it was no longer implemented.
Testing whether this feature works or not was used in the programs running on an 80386 CPU to detect whether the attached FP coprocessor was a 287 or a 387 (because the hardware allowed both; 387 was launched later than 386, so initially a 386 had to be coupled with a 287, if a hardware FPU was needed).
Running this code, the 8087 emitted a high-pitched whine. I could tell when my code was broken and it had gone into an infinite loop by the sound. Which was convenient because, of course, there was no debugger.
Thanks for bringing back this memory.
- Coincidentally, Forth promotes a fixed point philosophy.
- Forth people defined the IEEE754 standard on floating point, because they knew how to do that well in software.
IEEE 754 was principally developed by Kahan (in collaboration with his grad student, Coonen, and a visiting professor, Stone, whence the name KCS draft), none of whom were involved with Forth in any way that I am aware. And the history is pretty clear that the greatest influence on IEEE 754 before its release was Kahan's work with Intel developing the 8087.
The signalling NaN, however, turned out to be quite useless and I abandoned it.
I think the Zortech C++ compiler was the first one to fully support NaN with the Standard library.
As I got older, not only did computers stop doing that, my hearing also got worse (entirely normal for my age, but still), so that's mostly a thing of the past.
I thought I was protecting my ears from loud noises like rock concerts and gunshots. But I didn't know that driving with the window down damages the hearing. I crossed the country many times with the window down. I'm pretty sure that was the cause as my left ear is much worse off than my right.
I don't need a hearing aid yet, but I'm pretty careful in wearing ear plugs whenever there are loud noises.
On the other hand, skilled humans can do very very well with the x87; this 256-byte demo makes use of it excellently: https://www.pouet.net/prod.php?which=53816
[0]: https://en.wikipedia.org/wiki/EDSAC
Complicating this further, doing this in a loop requires that the stack state match between the start and end of the loop. This can be challenging to do with minimal FXCH instructions. I've seen compilers emit 3+ FXCH instructions in a row at the end of a loop to match the stack state, where with some hairy rearrangement it was possible to get it down to 2 or 1.
Finally, the performance characteristics of different x87 implementations varied in annoying ways. The Intel Pentium, for instance, required very heavy use of FXCH to keep the add and multiply pipelines busy. Other x87 FPUs at the time, however, were non-pipelined, some taking 4 cycles for an FADD and another 4 cycles for FXCH. This meant that rearranging x87 code for Pentium could _halve_ the speed on other CPUs.
Quake wouldnt happen until Pentium 2 if Intel didnt pipeline FPU.
K6 did have the advantage of being OOO, which reduced the importance of instruction scheduling a lot, and having good integer performance. It also had some advantage with 3DNow! starting with K6-2, for the limited software that could use it.
Once you have a parse tree, visiting it in post order (left tree, right tree, operation) produces the RPN.
As far as the layout, the outputs from the microcode ROM are the control signals that go to all parts of the chip, so it makes sense to give it a central location. There's not a lot of communication between the upper half of the chip (the bus interface to the 8086 and memory) and the lower half of the chip (the 80-bit datapath), so it doesn't get in the way too much. That said, I've been tracing out the chip and there is a surprising amount of wiring to move signals around. The wiring in the 8087 is optimized to be as dense as possible: things like running some parallel signals in silicon and some in polysilicon because the lines can get squeezed together just a bit more that way.
(I'm just commenting on interviews in general, and this is in no way a criticism of your response.)
Edit: Jogging my memory I believe they were explicit at the end of the interview they were looking for a Masters candidate. They did say I was on a good path IIRC. It wasn't a bad interview, but I was very clearly not what they were looking for.
The 80287 (AKA 287) and 80387 (AKA 387) floating point microprocessors started to pick up some competition from Weitek 1167 and 4167 chips and Inmos Transputer chips, so Intel integrated the FPU into the CPU with the 80486 processor (I question whether this was a monopoly move on Intel's part). This was also the first time that Intel made multiple versions of a CPU - there was a 486DX and a 486SX (colloquially referred to as the "sucks" model at the time) which disabled the FPU.
The 486 was also interesting because it was the first Intel x86 series chip to be able to operate at a multiple of the base frequency with the release of the DX2, DX3, and DX4 variants which allowed for different clock rates of 50MHz, 66MHz, 75MHz, and 100MHz based on the 25MHz and 33MHz base clock rates. I had a DX2-66MHz for a while and a DX4-100. The magic of these higher clock rates came from the introduction of the cache memory. The 486 was the first Intel CPU to utilize a cache.
Even though Intel had superseded the 8087/287/387 floating point coprocessor by including the latest version in the 80486, they introduced the 80860 (AKA i860) which was a VLIW RISC-based 64-bit FPU that was significantly faster, and also was the first microprocessor to exceed 1 million transistors.
The history of the FPU is that it eventually became superseded by the GPU. Some of the first powerful GPUs from companies like Silicon Graphics utilized a number of i860 chips on a card in a very similar structure to more modern GPUs. You can think of each of the 12x i860 chips on an SGI Onyx / RealityEngine2 like a Streaming Multiprocessor node in an NVIDIA GPU.
Obviously, modern computers run at significantly faster clock speeds with significantly more cache and many kinds of cache, but it's good to look at the history of where these devices started to appreciate where we are now.
Well, I was happy about that because I no longer had to deal with switches to generated x87 code or emulate it.
nit: IBM PC used the 8088, the "8 bit external bus" version of the 16 bit 8086
I don't think it was, transistor density became sufficient to integrate such a hefty chunk of circuitry on-die. Remember that earlier CPUs had even things like MMUs as separate chips, like Motorola 68851.
The 486 was the first Intel CPU to integrate a cache on its die (following the competing Motorola CPUs MC68020 and MC68030).
Previous Intel CPUs already utilized caches, otherwise they could not achieve 0-wait state memory access cycles.
The cheaper 80286 and 80386SX motherboards usually omitted the cache to minimize the costs, but any decent higher-end 80386DX motherboard included an external write-through cache, with a size typically between 32 kB and 64 kB, so significantly bigger than the internal 8 kB write-through cache of 80486.
- Vette! - Falcon 3.0 - Quake
And some others:
https://www.mobygames.com/attributes/attribute/115/
Apparently the 87/287/387 weren't that good as a gaming co-pro as the marshaling of the data to/fro from the CPU was too slow. It was a lot better on the 486DX onwards, I guess.
The field of the instruction that selects the stack offset.
Would be cool to here a real designer compare to the Weitek 1064.