Amd's Epyc 9355p: Inside a 32 Core Zen 5 Server Chip

Posted3 months agoActive3 months ago

rbanffy

165 points

57 comments

chipsandcheese.comTechstoryHigh profile

calmpositive

Debate

40/100

Amd EpycZen 5 ArchitectureServer ProcessorsHardware Analysis

Key topics

Amd Epyc

Zen 5 Architecture

Server Processors

Hardware Analysis

The article analyzes AMD's EPYC 9355P server chip, revealing its 32-core Zen 5 architecture, and sparking discussion on its features, performance, and potential applications.

Snapshot generated from the HN discussion

Discussion Activity

Active discussion

First comment

Peak period

6-12h

Avg / period

4.6

Comment distribution55 data points

Loading chart...

Based on 55 loaded comments

Key moments

01Story posted
Oct 3, 2025 at 4:01 PM EDT
3 months ago
Step 01
02First comment
Oct 3, 2025 at 6:00 PM EDT
2h after posting
Step 02
03Peak activity
12 comments in 6-12h
Hottest window of the conversation
Step 03
04Latest activity
Oct 7, 2025 at 10:55 AM EDT
3 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (57 comments)

Showing 55 comments of 57

haunter

3 months ago

7 replies

>768 GB of DDR5-5200. The 12 memory controllers on the IO die provide a 768-bit memory bus, so the setup provides just under 500 GB/s of theoretical bandwidth

I know it's a server but I'd be so ready to use all of that as RAM disk. Crazy amount at a crazy high speed. Even 1% would be enough just to play around with something.

mtoner23

3 months ago

1 reply

For our build servers for devs we utilize roughly this setup as a ram disk. It's amazing. Build times are lighting fast (compared to HDD/SSD)

privatelypublic

3 months ago

4 replies

I'm interested in... why? What are you building that loading data from disk is so lopsided vs CPu load from compiling, or network load/latency(one 200ms of "is this the current git repo?" Is a heck of a lot of NVMe latency... and its going to be closer to 2s than 200ms)

motorest

3 months ago

1 reply

> I'm interested in... why? What are you building that loading data from disk is so lopsided vs CPu load from compiling (...)

This has been the basic pattern for ages, particularly with large C++ projects. C++ builds, specially with the introduction of multi-CPU and multi-core systems, turns builds into IO-bound workflows, specially during linking.

Creating RAM disks to speed up builds is one of the most basic and low effort strategies to improve build times, and I think it was the main driver for a few commercial RAM drive apps.

john01dav

3 months ago

2 replies

Why do we need commercial ram drive apps when Linux has tmpfs, or is this a historical thing?

p_l

3 months ago

1 reply

Historical, but also there was a bunch of physical ram drives - RAMsan, for example, sold DRAM-based (with battery backup) appliances connected by fiber channel - they were used for all kinds of tasks but often as very fast scratch space for databases. Some VAXen had a "RAM disk" card that was IIRC used as NFS cache on some unix variants. etc. etc.

rbanffyAuthor

3 months ago

2 replies

Still odd. The OS should be able to manage the memory and balance performance more efficiently than that. There’s no reason to preallocate memory by hardware.

p_l

3 months ago

1 reply

It was often used to supplement memory available in cheaper ways or otherwise more flexible. For example many hardware solutions allowed connecting more RAM than otherwise possible to be accessed by main bus, or at lower cost than the main memory (for example due to differences in interfaces required, adding battery backup, etc.)

RAMsan line for example started in 2000 with 64GB DRAM-based SSD with up to 15 1Gbit FC interfaces, providing a shared SAN SSD for multiple hosts (very well utilized by some of the beefier cluster SQL databases like Oracle RAC) but the company itself has been providing high speed specialized DRAM-based SSDs since 1978

rbanffyAuthor

3 months ago

1 reply

The way it makes sense is when you can't add that much memory to the system directly, or when directly attached memory would be significantly more expensive. For this you can get away with much slower memory than you would attach to the memory bus directly - all you need is to be faster than the storage bus you are using.

Last time I saw one was with a mainframe, which kind of makes sense if adding cheaper third party memory to the machine would void warranties or breach support contracts. People really depend on company support for those machines.

p_l

3 months ago

Main cases I've seen with mainframes involved network-attached ram disks (actually, even earliest S/360 could share a disk device between two mainframes, so...)

A fast scratch pad that can be shared between multiple machines can be ideal at times.

motorest

3 months ago

1 reply

> Still odd. The OS should be able to manage the memory and balance performance more efficiently than that. There’s no reason to preallocate memory by hardware.

You are arguing hypotheticals, whereas for decades the world had to deal with practicals. I recommend you spend a few minutes looking into how to create RAM drives on, say, Windows, and think through how to achieve that when your build workstation has 8GB of RAM and you need a scratchpad memory of, say, 16GB of RAM.

Recommended reading: https://en.wikipedia.org/wiki/RAM_drive

rbanffyAuthor

3 months ago

I know all that - I was there and I saw products like these in person (although they were in the megabyte range back then). I still remember a 5.25 hard-drive shaped box with a lead acid battery and lots of memory boards with 4164's (IIRC).

These are only for when the OS and the machine itself can't deal with the extra memory and wouldn't know what to do with it, things you buy when you run out of sensible options (such as adding more memory to your machine and/or configuring a RAM disk).

motorest

3 months ago

> Why do we need commercial ram drive apps when Linux has tmpfs, or is this a historical thing?

A) this technique precedes the existence of Linux.

B) Linux is far from the most popular OS in use today.

C) some software development projects are developed and target non-Linux platforms (see Windows)

finaard

3 months ago

3 replies

I'm running the same setup - our larger builders have 2 32-core epycs with 2TB RAM. We were doing that type of setup already almost two decades ago in a different company, and in that one for over a decade now - back then that was the only option for speed.

Nowadays nvmes might indeed be able to get close - but we'd probably need to still span over multiple SSDs (reducing the cost savings), and the developers there are incredible sensitive to build times. If a 5 minute build suddenly takes 30 seconds more we have some unhappy developers.

Another reason is that it'd eat SSDs like candy. Current enterprise SSDs have something like a 10000 TBW rating, which we'd exceed in the first month. So we'd either get cheap consumer SSDs and replace them every few days, or enterprise SSDs and replace them every few months - or stick with the RAM setup, which over the live of the build system will be cheaper than constantly buying SSDs.

trogdor

3 months ago

1 reply

> Current enterprise SSDs have something like a 10000 TBW rating, which we'd exceed in the first month

Wow. What’s your use case?

finaard

3 months ago

Same as the one earlier in the thread: Build servers, nicely loaded. A build generates a ridiculous amount of writes for stuff that just gets thrown out after the build.

We actually did try with SSDs about 15 years ago, and had a lot of dead SSDs in a very short time. After that we went for estimating data written, it's cheaper. While SSD durability increased a lot since then everything else got faster as well - so we'd have SSDs last a bit longer now (back then it was a weekly thing), but still nowhere near where it'd be a sensible thing to do.

rbanffyAuthor

3 months ago

1 reply

> If a 5 minute build suddenly takes 30 seconds more we have some unhappy developers

They sound incredibly spoiled. Where should I send my CV?

finaard

3 months ago

1 reply

You don't really want that. I'm keeping my sanity there just because my small company is running their CI and testing as contractor.

They indeed are quite spoiled - and that's not necessarily a good thing. Part of the issue is that our CI was good and fast enough that at some point a lot of the new hires never bothered to figure out how to build the code - so for quite a few the workflow is "commit to a branch, push it, wait for CI, repeat". And as they often just work on a single problem the "wait" is time lost for them, which leads to the unhappiness if we are too slow.

hamandcheese

3 months ago

It is still quite commendable that your outer dev loop (commit, push, build) is fast enough to work as a devs inter dev loop (edit, build/test).

jauntywundrkind

3 months ago

> Current enterprise SSDs have something like a 10000 TBW rating

Running the numbers to verify: a read-write-mixed enterprise SSD will typically have 3 DWPD (drive writes per day), across it's 5 year warranty. At 2TB, that would be 10950 TBW, so that sort of checks out. If endurance was a concern, upgrading to a higher capacity would linearly increase the endurance. For example the Kioxia CD8P-V. https://americas.kioxia.com/en-us/business/ssd/data-center-s...

Finding it a bit hard to imagine build machines working that hard, but I could believe it!

bob1029

3 months ago

> one 200ms of "is this the current git repo?" Is a heck of a lot of NVMe latency... and its going to be closer to 2s than 200ms

I don't know where you're buying your NVMe drives, but mine usually respond within a hundred microseconds.

mikepurvis

3 months ago

For the ROS ecosystem you’re often building dozens or hundreds of small CMake packages, and those configure steps are very io bound— it’s a ton of does this file exist, what’s in this file, compile this tiny test program, etc.

I assume the same would be true for any project that is configure-heavy.

tehlike

3 months ago

2 replies

I have 1TB ram on my home server. It's 2666 though...

WarOnPrivacy

3 months ago

3 replies

> I have 1TB ram on my home server. It's 2666 though...

this kit? https://www.newegg.com/nemix-ram-1tb/p/1X5-003Z-01930

mulmen

3 months ago

1 reply

Wow. I tried to tap but the Newegg app has an unskippable 5 second ad for something I didn’t read. What a shame. My fault for having their app installed I guess.

userbinator

3 months ago

2 replies

It's roughly $3/GB.

prodipto81

3 months ago

Bro just 3 ?

tehlike

3 months ago

I got for around 0.6$/gb iirc.

tehlike

3 months ago

No, 16*64 Samsung LRDIMM sticks off of ebay. 35$ each stick iirc.

prodipto81

3 months ago

3 just !!!!

saltcured

3 months ago

1 reply

Man, here I am in 2025 and my home server is a surplus Thinkpad P70 with just 64 GB RAM...

tehlike

3 months ago

It's the joy that counts. It's my "homelab" but i serve pricetracker.wtf on it.

I also have M920Q 8500t, HP prodesk with 10500t, and a lenovo P520 -> these three are truly for home purposes.

IF i were to do the pricetracker machine again, i'd go much smaller, and get a jbod + and probably a P520.

bigiain

3 months ago

1 reply

Indeed. I wonder what a system like that would cost (at consumer available prices)?

magicalhippo

3 months ago

1 reply

From what I can find here in Norway the CPU would be $3800, mobo around $2000, and one stick of 64 GB 6400 MHz registered ECC runs about $530, so about $6400 for the full 768 GB. Couldn't find any kits for those.

So just those components would be just over $12k.

That's just from regular consumer shops, and includes 25% VAT. Without the VAT it's about $9800.

Problem for consumers is that a just about all the shops that sells such and you might get a deal from would be geared towards companies, and not interested in deal with consumers due to consumer protection laws.

mlrtime

3 months ago

1 reply

The best deals on these high end servers for consumers is to find a local large server reseller. Meaning a company who buys used datacenter equipment in bulk then resells. It may not always be used equipment or old.

magicalhippo

3 months ago

True, though at least here that'll be older stuff, and seems almost exclusively Intel parts.

I found a used server with 768 GB DDR4 and dual Intel Gold 6248 CPUs for $4200 including 25% VAT.

That's a complete 2U server, the CPUs are a bit weak but not too bad all in all.

ksec

3 months ago

1 reply

I have been waiting for Netflix using FreeBSD to serve video at 1600Gb/s. They announced their 800Gbps record in 2021, and they were previously limited by CPU and Memory bandwidth. With 500GB/s that is pretty much not a thing.

NaomiLehman

3 months ago

damn, that's a lot of gigabytes for a movie

summarity

3 months ago

2 replies

> Crazy amount at a crazy high speed

That's 300GB/s slower than my old Mac Studio (M1 Ultra). Memory speeds in 2025 remain thouroughly unimpressive outside of high-end GPUs and fully integrated systems.

AnthonyMouse

3 months ago

1 reply

The server systems have that much memory bandwidth per socket. Also, that generation supports DDR5-6400 but they were using DDR5-5200. Using the faster stuff gets you 614GB/s per socket, i.e. a dual socket system with DDR5-6400 is >1200GB/s. And in those systems that's just for the CPU; a GPU/accelerator gets its own.

The M1 Ultra doesn't have 800GB/s because it's "integrated", it simply has 16 channels of DDR5-6400, which it could have whether it was soldered or not. And none of the more recent Apple chips have any more than that.

It's the GPUs that use integrated memory, i.e. GDDR or HBM. That actually gets you somewhere -- the RTX 5090 has 1.8TB/s with GDDR7, the MI300X has 5.3TB/s with HBM3. But that stuff is also more expensive which limits how much of it you get, e.g. the MI300X has 192GB of HBM3, whereas normal servers support 6TB per socket.

And it's the same problem with Apple even though there's no great reason for it to be. The 2019 Intel Xeon Mac Pro supported 1.5TB of RAM -- still in slots -- but the newer ones barely reach a third of that at the top end.

wtallis

3 months ago

1 reply

> The M1 Ultra doesn't have 800GB/s because it's "integrated", it simply has 16 channels of DDR5-6400, which it could have whether it was soldered or not.

The M1 Ultra has LPDDR5, not DDR5. And the M1 Ultra was running its memory at 6400MT/s about two and a half years before any EPYC or Xeon parts supported that speed—due in part to the fact that the memory on a M1 Ultra is soldered down. And as far as I can tell, neither Intel nor AMD has shipped a CPU socket supporting 16 channels of DRAM; they're having enough trouble with 12 channels per socket often meaning you need the full width of a 19-inch rack for DIMM slots.

AnthonyMouse

3 months ago

1 reply

LPDDR5 is "low power DDR5". The difference between that and ordinary DDR5 isn't that it's faster, it's that it runs at a lower voltage to save power in battery-operated devices. DDR5-6400 DIMMs were available for desktop systems around the same time as Apple. Servers are more conservative about timings for reliability reasons, the same as they use ECC memory and Apple doesn't. Moreover, while Apple was soldering their memory, Dell was shipping systems using CAMM with LPDDR5 that isn't soldered, and there are now systems from multiple vendors with CAMM2 and LPDDR5X.

Existing servers typically have 12 channels per socket, but they also have two DIMMs per channel, so you could double the number of channels per socket without taking up any more space for slots. You could also use CAMM which takes up less space.

They don't currently use more than 12 channels per socket even though they could because that's enough to not be a constraint for most common workloads, more channels increase costs, and people with workloads that need more can get systems with more sockets. Apple only uses more because they're using the same memory for the GPU and that is often constrained by memory bandwidth.

jauntywundrkind

3 months ago

1 reply

> Existing servers typically have 12 channels per socket, but they also have two DIMMs per channel, so you could double the number of channels per socket without taking up any more space for slots. You could also use CAMM which takes up less space.

Usually this comes at a pretty sizable hit to MHz available. For example STH notes that their Zen5 ASRock Rack EPYC4000D4U goes from DDR5-5600 down to DDR5-3600 with the second slot populated, a 35% drop in throughput. https://www.servethehome.com/amd-epyc-4005-grado-is-great-an...

AnthonyMouse

3 months ago

It comes with a drop in performance because there are then two sticks on the same channel. Having the same number of slots and twice as many channels would be a way around that.

(It's also because of servers being ultra-cautious again. The desktops say the same thing in the manual but then don't enforce it in the BIOS and people run two sticks per channel at the full speed all over the place.)

matja

3 months ago

Do you have a benchmark that shows the M1 Ultra CPU to memory throughput?

elorant

3 months ago

Even better you could use it for inference and with that much RAM you could load any model.

skhameneh

3 months ago

12 memory channels per CPU and DDR5-6400 may be supported (for reference, I found incorrect specs when I was looking at Epyc CPU retail listings some weeks ago), see https://www.amd.com/en/products/processors/server/epyc/9005-...

ashvardanian

3 months ago

2 replies

Those are extremely uniform latencies. Seems like on these CPUs most benefits from NUMA-aware thread-pools will be coming from reduced contention - mostly synchronizing small subsets of cores, rather than the actual memory affinity.

PunchyHamster

3 months ago

1 reply

Well, all of the memory is at IO die. I remember AMD docs outright recommend to make processor hide NUMA nodes from the workload as trying to optimize for it might not even do anything for a lot of workloads

phire

3 months ago

That AMD slide (in the conclusion) claims their switching fabric has some kind of bypass mode to improve latency when utilisation is low.

So they have been really optimising that IO die for latency.

NUMA is already workload sensitive, you need to benchmark your exact workload to know if it’s worth enabling or not, and this change is probably going to make it even less worthwhile. Sounds like you will need a workload that really pushes total memory bandwidth to make NUMA worthwhile.

afr0ck

3 months ago

NUMA is only useful if you have multiple sockets, because then you have several I/O dies and you want your workload 1) to be closer to the I/O device and 2) avoid crossing the socket interconnect. Within the same socket, all CPUs shared the same I/O die, thus uniform latency.

iberator

3 months ago

1 reply

Is it true that EPYC doesn't use the program counter as in: next instruction address is in the second operand for some operations?

nine_k

3 months ago

1 reply

EPYC runs x64 code. In it, jump instructions work exactly as you describe.

themafia

3 months ago

Well, there are relative jumps, so a global program counter has to exist on some level.

flumpcakes