OrthoRoute – GPU-accelerated autorouting for KiCad
Mood
thoughtful
Sentiment
positive
Category
tech
Key topics
KiCad
GPU-accelerated autorouting
PCB design
The post introduces OrthoRoute, a GPU-accelerated autorouting tool for KiCad, sparking discussion on its potential applications, design decisions, and the role of automation in PCB design.
Snapshot generated from the HN discussion
Discussion Activity
Light discussionFirst comment
41m
Peak period
4
Hour 2
Avg / period
1.7
Based on 22 loaded comments
Key moments
- 01Story posted
11/18/2025, 6:54:54 PM
1d ago
Step 01 - 02First comment
11/18/2025, 7:36:13 PM
41m after posting
Step 02 - 03Peak activity
4 comments in Hour 2
Hottest window of the conversation
Step 03 - 04Latest activity
11/19/2025, 5:54:53 PM
1h ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
heart attack
There are videos from JLCPCB (one of the biggest). That stuff is 90% automated.
I was inspired by this video: https://www.youtube.com/watch?v=HRfbQJ6FdF0 from bitluni that's a cluster of $0.10-0.20 RISC-V microcontrollers. For ten or twenty cents, these have a lot of GPIOs compared to other extremely low-cost microcontrollers. 18 GPIOs on the CH32V006F4U6. This got me thinking, what if I built a cluster of these chips. Basically re-doing bitluni's build.
But then I started thinking, at ten cents a chip, you could scale this to thousands. But how do you connect them? That problem was already solved in the 80s, with the Connection Machine. The basic idea here is to get 2^(whatever) chips, and connect them so each chip connects to (whatever) many other chips. The Connection Machine sold this as a hypercube, but it's better described as a hamming-distance-one graph or something.
So I started building that. I did the LEDs first, just to get a handle on thousands of parts: https://x.com/ViolenceWorks/status/1987596162954903808 and started laying out the 'cards' of this thing. With a 'hypecube topology' you can split up the cube into different parts, so this thing is made of sixteen cards (2^4), with 256 chips on each card (2^8), meaning 4096 (2^12) chips in total. This requires a backplane. A huge backplane with 8196 nets. Non-trivial stuff.
So the real stumbling block for this project is the backplane, and this is basically the only way I could figure out how to build it; write an autorouter. It's a fun project that really couldn't have been done before the launch of KiCad 9; the new IPC API was a necessity to make this a reality. After that it's just some CuPy because of sparse matrices and a few blockers trying to adapt PathFinder to circuit boards.
Last week I finished up the 'cloud routing' functionality and was able to run this on an A100 80GB instance on Vast.io; the board wouldn't fit in my 16GB 5080 I used for testing. That instance took 41 hours to route the board, and now I have the result back on my main battlestation ready for the bit of hand routing that's still needed. No, it's not perfect, but it's an autorouter. It's never going to be perfect.
This was a fun project but what I really should have been doing the past three months or so is grinding leetcode. It's hard out there, and given that I've been rejected from every technician job I've applied to, I don't think this project is going to help me. Either way, this project.... is not useful. There's probably a dozen engineers out there in the world that this _could_ help.
So, while it's working for my weird project, this is really not what hiring managers want to see.
I read the write-up with a LOT of interest, this is really amazing work, there's not a lot of good options for auto-routing with open-source PCB tools (i.e. KiCad). I have also used the other autorouter you mentioned for "low-complexity" boards in KiCad and it helped do the job but was painful.
In my career I've also used the autorouter built into the "high-end" PCB tools and they could handle the complexity of boards you outlined WITHOUT needing a massive GPU, but they also paid people to improve this stuff over 15-to-20-years and development happened when single-core computers with limited RAM were the norm.
On the technical side, somewhat more recent FPGA 'placement' algorithms used a simulated annealing algorithm, while what you didn't isn't about placement, that approach could posisbly help with 'net cross-over reduction' type of passes, and maybe help with designs where you can do port swap / pin swap.
I'm amused you made a RISC-V array with discrete parts -- I'm sure you considered using an FPGA? Jan Gray has done > 1000+ RISC-V cores (https://fpga.org/grvi-phalanx/) in "older" Xilinx FPGAs.
If you're trying to emulate Thinking Machines / CM-x or anything else, frankly I think a "mondo" FPGA is still the way to go.
Job-wise: A suggestion might be to reach out to the guys at AllSpice ( allspice.io ) who make revision control software for Altium and possibly KiCad. The work you did to enable IPC, etc seems like exactly the type of skillset these guys might need (contractor, maybe full-time?) to interoperate with KiCad.
If I see anything that might be up your alley I'd also reach out. I'm not in a position to hire anyone and while "some companies" may not be impressed by what you did, the right organization WOULD be.
I share your sentiment that the likes of "modern" companies like Apple, MSFT, etc the hiring process is really taylored to "I want a guy who can do X" and rarely "I want a guy who's shown he can learn Y and Z so he can certainly do X".
Yeah, that was the first step in creating the netlist for the backplane. Simulated annealing on the 8196 nets. TO BE FAIR, this would be a lot easier to route if I didn't explicitly want each of the 16 cards to be identical, but I think that's the most cost-effective way to do it.
As far as an FPGA.... I don't know if I see the point. The nodes in the original CM-1 were basically _only_ ALUs. Very little processing power. The CM-5 was a little better, but this entire thing is batshit crazy. I might as well go for four thousand indivdually programmable cores. See what it can actually do.
The reason an FPGA is a more suitable platform is you can translate "physical effort of making PCBs" into "creating a design in an infinitely re-programmable platform" and change your design as needed to your hearts content.
In fact, the original design of RISC-V included a bus called 'TileLink' to enable 'Many core' arrays of RISC-V processors.
Translation: You can pare-down open-source RISC-V cores and use TileLink and emulate CM or build something more complex as you see fit since that was built into the original open-source RISC-V specs.
FPGAs are their own joy and pain for sure and it's not as "cool" to re-program a blackbox on a PCB as it might be to make your own thing, so all depends on your goals.
Either option is cool, though.
Where the GPU router comes in is the geometric part: obeying layer stack, via rules, keepouts, blind-via constraints, etc. You can absolutely hand-encode one or two nice symmetric patterns in code; this board is ‘what if we made the search space big enough that you want Dijkstra + PathFinder + sparse GPU data structures to do it for you’.
In that case, can't you exploit the inherent symmetry in the design here to only route a quarter of your connectors and then mirror/rotate the result for the other one? Or, if you have a X*X matrix, route one size minus the corners and replicate to the other sides?
Also, with such a huge connection board, it smells a NIH issue here. I think you'd better serialize the IO to a bus (whatever) and few lines and perform the connection in software (in a GoWin FPGA for example, both extremely cheap and quite powerful). Just think of the harness you'll need to build to fit the connectors in. The obvious routing bugs, and so on. Any maintenance will be a nightmare, if you need to swap 2 pins on a connector or re-run the routing.
As far as symmetry goes, there really isn't any. For example, Board 0 conects to 1, 2, 4, and 8. Board 1 connects to 0, 3, 5, and 9. Board 3 connections to 1, 2 , 7, and 11.
There's one way I can think of to make this routing easier. Of of the 16 daughter boards, make the pinout unique to each daughter board. If I was doing this as a product, for manufacturing, this is exactly what I would do. I'd rearrange the pins on each daughter card so it would be easier to route. The drawback of this technique is that there would be 16 different varieties of daughter cards; not economical if you're just building one of these things.
So, with those constraints the only real optimization I have left is ensuring that the existing net plan is optimal. I already did that when I generated the netlist; used simulated annealing to ensure the minimal net length for the board before I even imported it into KiCad.
And yeah, serializing the IO would be better, but even better than that would be to emulate the entire system in a giant black box of compute. But then I wouldn't have written a GPU autorouter. I'm trying not to, but there is some optimization for _cool_ here, you know?
3 more comments available on Hacker News
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.