Towards Memory Specialization: a Case for Long-Term and Short-Term RAM
Posted4 months agoActive4 months ago
arxiv.orgTechstory
calmmixed
Debate
60/100
Memory TechnologyComputer ArchitectureHardware Design
Key topics
Memory Technology
Computer Architecture
Hardware Design
A research paper proposes a new memory architecture with separate long-term and short-term RAM, sparking discussion on its potential applications, feasibility, and relevance to current technology trends.
Snapshot generated from the HN discussion
Discussion Activity
Moderate engagementFirst comment
45m
Peak period
7
1-2h
Avg / period
2.5
Comment distribution27 data points
Loading chart...
Based on 27 loaded comments
Key moments
- 01Story posted
Sep 1, 2025 at 4:05 PM EDT
4 months ago
Step 01 - 02First comment
Sep 1, 2025 at 4:51 PM EDT
45m after posting
Step 02 - 03Peak activity
7 comments in 1-2h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 2, 2025 at 3:46 AM EDT
4 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45096140Type: storyLast synced: 11/20/2025, 2:49:46 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
For "hot instruction paths", caching is already the answer. Not sure about locality of reference for model weights. Do LLMs blow the cache?
There's no fundamental difference in gate technology between the two, so a flash that is addressable to a finder granularity will always be larger than the coarser equivalent. That's the trade off.
I don't know if anyone has applied this to neutral networks.
Device physics-wise, you could probably make SRAM faster by dropping the transistor threshold voltage. It would also make it harder / slower to write. The bigger downside is that it would have higher leakage power, but if it's a small portion of all the SRAM, it might be worth the tradeoff.
For DRAM, there isn't as much "device" involved because the storage element isn't transistor-based. You could probably make some design tradeoff in the sense amplifier to reduce read times by trading off write times, but I doubt it would make a significant change.
And lower power usage. Datacenters and mobile devices will always want that.
Sometimes very yes?
If you've got 1GB of weights, those are coming through the caches on their way to execution unit somehow.
Many caches are smart enough to recognize these accesses as a strided, streaming, heavily prefetchable, evictable read, and optimize for that.
Many models are now quantized too to reduce the overall the overall memory bandwidth needed for execution, which also helps with caching.
> "The key insight motivating LtRAM is that long data lifetimes and read heavy access patterns allow optimizations that are unsuitable for general purpose memories. Primary applications include model weights in ML inference, code pages, hot instruction paths, and relatively static data pages—workloads that can tolerate higher write costs in exchange for lower read energy and improved cost per bit. This specialization addresses fundamental mismatches in current systems where read intensive data competes for the same resources as frequently modified data."
Essentially I guess they're calling for more specific hardware for LLM tasks, much like was done with all the networking equipment for dedicated packet processing with specialized SRAM/DRAM/TCAM tiers to keep latency to a minimum.
While there's an obvious need for this for traffic flow across the internet, whether or not LLMs are really going to scale like that, or there's a massive AI/LLM bubble about to pop, would be the practical issue, and who knows? The tea leaves are unclear.
https://www.intel.com/content/www/us/en/products/details/mem...
https://en.wikipedia.org/wiki/3D_XPoint
The same could be said for, say, SIMD/vectorization, which 99% of ordinary application code has no use for, but it quietly provides big performance benefits whenever you resample an image, or use a media codec, or display 3D graphics, or run a small AI model on the CPU, etc. There are lots of performance microfeatures like this that may or may not be worth it to include in a system, but just because they are only useful in certain very specific cases does not mean they should be dismissed out of hand. Sometimes the juice is worth the squeeze (and sometimes not, but you can't know for sure unless you put it out into the world and see if people use it).
However, dedicate read-optimized memory would be instead of a comparable amount of general purpose memory, as data stored in one need not be stored in the other. The only increase in memory used would be what is necessary to account for fragmentation overhead when your actual usage ratio differs from what the architect assumed. Even then, the OS could use the more plentiful form of memory as swap-space for more in demand form (or, just have low priority memory regions used the less optimal form). This will open up a new and exciting class of resource management problems for kernel developers to eek out a few extra percentage points of performance.
Also SIMD is just one example. Modern DMA controllers are probably another good example but I know less about them (although I did try some weird things with the one in the Raspberry Pi). Or niche OS features like shared memory--pipes are usually all you need, and don't break the multitasking paradigm, but in the few cases where shared memory is needed it speeds things up tremendously.
If you’re not in manual memory management land, then you probably don’t care about this optimization just like you barely think of stack vs heap. Maybe the compiler could guess something for you, but I wouldn’t be worrying about it in that problem space.
I'm not sure how good applications are at properly annotating it, but for most applications assets are also effectively read only.
You don't even need most of the ram usage to be able to take advantage of this. If you can reasonably predict what portion of ram usage will be heuristically read-heavy, then you can allocate your ram budget accordingly, and probably eak out a measurable performance improvement. In a world with Moore's law, this type of heterogeneous architecture has proven to not really be worth it. However that calculus chagnes once we lose the ability to throw more transistors at the problem.
https://dsf.berkeley.edu/cs286/papers/fiveminute-tr1986.pdf
and a revisit of the rule 20 years later (It still held).
https://cs-people.bu.edu/mathan/reading-groups/papers-classi...