Unusual circuits in the Intel 386's standard cell logic
Mood
informative
Sentiment
positive
Category
tech_discussion
Key topics
Intel
Hardware
Circuit Design
Microprocessor
Electronics
Discussion Activity
Moderate engagementFirst comment
1h
Peak period
6
Hour 2
Avg / period
2.7
Based on 52 loaded comments
Key moments
- 01Story posted
Nov 22, 2025 at 10:33 PM EST
1d ago
Step 01 - 02First comment
Nov 22, 2025 at 11:52 PM EST
1h after posting
Step 02 - 03Peak activity
6 comments in Hour 2
Hottest window of the conversation
Step 03 - 04Latest activity
Nov 23, 2025 at 9:57 PM EST
4h ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
I would love to know more about this – how much info is publicly available on how Intel used mainframes to design the 386? Did they develop their own software, or use something off-the-shelf? And I'm somewhat surprised they used IBM mainframes, instead of something like a VAX.
The 386 used a placement program called Timberwolf, developed by a Berkeley grad student and a proprietary routing tool.
Also see "Intel 386 Microprocessor Design and Development Oral History Panel" page 13. https://archive.computerhistory.org/resources/text/Oral_Hist...
"80386 Tapeout: Giving Birth to an Elephant" by Pat Gelsinger, Intel Technology Journal, Fall 1985, discusses how they used an Applicon system for layout and an IBM 3081 running UTS unix for chip assembly, faster than the VAX they used earlier. Timberwolf also ran on the 3081.
"Design And Test of the 80386" (https://doi.org/10.1109/MDT.1987.295165) describes some of the custom software they used, including a proprietary RTL simulator called Microsim, the Mossim switch-level simulator, and the Espresso PLA minimizer.
You can still find the software for Espresso (I ran it a few years ago):
https://en.wikipedia.org/wiki/Espresso_heuristic_logic_minim...
Knowing intel SW and based on it was succesful, I really doubt it
Top of the line VAX in 1984 was the 8600 with a 12.5 MHz internal clock, doing about 2 million instructions per second.
IBM 3084 from 1984 - quad SMP (four processors) at 38 MHz internal clock, about 7 million instructions per second, per processor.
Though the VAX was about $50K and the mainframe about $3 million.
Does that schedule include all the revisions they did too? The first few were almost uselessly buggy:
It's not just the lack of branch prediction, but the primitive pipeline, no register renaming, and of course it's integer only.
A Pentium Pro with modern design size would at least be on the same playing field as today's cores. Slower by far, but recognisably doing the same job - you could see traces of the P6 design in modern Intel CPUs until quite recently, in the same way as the Super Hornet has traces of predecessors going back to the 1950s F-5. The CPUs in most battery chargers and earbuds would run rings around a 386.
Bear in mind that with an 386 you can barely decode an MP2 file, while with a 486 DX you can play most MP3 files at least in mono audio and maybe run Quake at the lowest settings if you own a 100 MHZ one. A 166MHZ Pentium can at least multitask a little while playing your favourite songs.
Also, under Linux, a 386 would manage itself relativelly well with just terminal and SVGAlib tools (now framebuffer) and 8MB of RAM. With a 486 and 16MB of RAM, you can run X at sane speeds, even FVWM in wireframe mode to avoid window repaintings upon moving/resizing them.
If you emulate some old i440FX based PC under Qemu, switching between the 386 and 486 with -cpu flag gives the user clear results. Just set one with the Cirrus VGA and 16MB and you'll understand upong firing X.
This is a great old distro to test how well 386's and 486's behaved:
Nowadays I think it's still doable in theory but Linux kernel have some kind of hard coded limit of 4MB (something to do with memory paging size).
ELF supports loading a shared library to some arbitrary memory address and fixing up references to symbols in that library accordingly, including dynamically after load time with dlopen(3).
a.out did not support this. The executable format doesn't have relocation entries, which means every address in the binary was fixed at link time. Shared libraries were supported by maintaining a table of statically-assigned, non-overlapping address spaces, and at link time resolving external references to those fixed addresses.
Loading is faster and simpler when all you do is copy sections into memory then jump to the start address.
ISTR the cheap "Pentium clones" at the time - Cyrix, early AMDs before the K5/K6 and Athlon - were basically souped-up 486 designs.
(As an aside - it's very noticeable how much innovation happened between a single generation of CPU architectures at that time, compared to today. Even if some of them were buggy or had performance regressions. 5x86 to K5 was a complete redesign, and the same again between K6 and K7).
386, both SX and DX, run 16bit code at ~same clock for clock speed as 286. 286 topped out at 25MHz, Intel 386 at 33MHz. Now add the fact early Intel chips had broken 32bit and its not so beastly after all :)
In one of Computer History Museum videos someone from Intel mentioned they managed to cost reduce 386SX version so hard it cost Intel $5 out the door, the rest of initial 1988 $219 price was pure money printer. Only in 1992 Intel finally calmed down with i386SX-25 going from Q1 1990 $184 to Q4 1992 $59 due to losing AMD Am386 lawsuit, and only to screw with AMD relegating its Am386DX-40 Q2 1991 $231 flagship to the title of Q1 1993 $51 bottom feeder.
A large reason why out of order speculative execution is needed for performance is to deal with the memory latencies that appear in such a system.
By the time of 80486, motherboard cache sizes had increased to the range of 128 to 256 kB, while 80486 also had an internal cache of 8 kB (much later increased to 16 kB in 80486DX4, at a time when Pentium already existed).
So except for the lower-end MBs, a memory hierarchy already existed in the 80386-based computers, because the DRAM was already not fast enough.
"Showing one's work" would need details that are verifiable and reproducible.
> He walked across the street from Santa Clara 4 to Amdahl and they had a Unix that ran on 370 computers. So he went over there and got a tape and brought it back, sent it over to Phoenix where the mainframes were and told 'em to load it. They did, not knowing what was on that tape because they never would have done it if they had known
It's wild to read that Intel's flagship product, the part that basically defined the next 40 years of computing, might have turned out very differently if management and/or IT knew what the engineers were doing.
Everything old is new again, I guess.
Standard cell libraries often implement multiplexers using transmission gates (CMOS switches) with inverters to buffer the input and restore the signal drive. This implementation has the advantage of eliminating static hazards (glitches) in the output that can occur with conventional gates.
Before Thompson’s experiment, many researchers tried to evolve circuit behaviors on simulators. The problem was that simulated components are idealized, i.e. they ignore noise, parasitics, temperature drift, leakage paths, cross-talk, etc. Evolved circuits would therefore fail in the real world because the simulation behaved too cleanly.
Thompson instead let evolution operate on a real FPGA device itself, so evolution could take advantage of real-world physics. This was called “intrinsic evolution” (i.e., evolution in the real substrate).
The task was to evolve a circuit that can distinguish between a 1 kHz and 10 kHz square-wave input and output high for one, low for the other.
The final evolved solution:
- Used fewer than 40 logic cells
- Had no recognisable structure, no pattern resembling filters or counters
- Worked only on that exact FPGA and that exact silicon patch.
Most astonishingly:
The circuit depended critically on five logic elements that were not logically connected to the main path.
Removing them should not affect a digital design
- they were not wired to the output
- but in practice the circuit stopped functioning when they were removed.
Thompson determined via experiments that evolution had exploited:
- Parasitic capacitive coupling
- Propagation delay differences
- Analogue behaviours of the silicon substrate
- Electromagnetic interference from neighbouring cells
In short: the evolved solution used the FPGA as an analog medium, even though engineers normally treat it as a clean digital one.
Evolution had tuned the circuit to the physical quirks of the specific chip. It demonstrated that hardware evolution could produce solutions that humans would never invent.
Now that we have vastly faster compute, open FPGA bitstream access, on-chip monitoring, plus cheap and dense temperature/voltage sensing, reinforcement learning + evolution hybrids, it becomes possible to select explicitly for robustness and generality, not just for functional correctness.
The fact that human engineers could not understand how this worked in 1996 made researchers incredibly uncomfortable, and the same remains true today, but now we have vastly better tooling than back then.
- Parasitic capacitive coupling
- Propagation delay differences
- Analogue behaviours of the silicon substrate
...are not just influenced by the chip design, they're influenced by substrate purity and doping uniformity -- exactly the parts of the production process that we don't control. Or rather: we shrink the technology node to right at the edge where these uncontrolled factors become too big to ignore. You can't design a circuit based on the uncontrolled properties of your production process and still expect to produce large volumes of working circuits.Yes, we have better tooling today. If you use today's 14A machinery to produce a 1µ chip like the 80386, you will get amazingly high yields, and it will probably be accurate enough that even these analog circuits are reproducible. But the analog effects become more unpredictable as the node size decreases, and so will the variance in your analog circuits.
Also, contrary to what you said: the GA fitness process does not design for robustness and generality. It designs for the specific chip you're measuring, and you're measuring post-production. The fact that it works for reprogrammable FPGAs does not mean it translates well to mass production of integrated circuits. The reason we use digital circuitry instead of analog is not because we don't understand analog: it's because digital designs are much less sensitive to production variance.
We’re seeing this shift already in software testing around GenAI. Trying to write a test around non-deterministic outcomes comes with its own set of challenges, so we need to plan can deterministic variances, which seems like an oxymoron but is not in this context.
You may need to train on a smaller number of FPGAs and gradually increase the set. Genetic algorithms have been finicky to get right, and you might find that more devices would massively increase the iteration count
Answering another commenter's question: yes the final result was dependent on temperature. The author did try using it over different temperatures. It only was able to operate in the region of temperatures it was trained at.
Fig. 8 goes in details.
This implementation is sometimes called a "jam latch" (the new value is "jammed" into the inverter loop).
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.