Reduce Bandwidth Costs with Dm-Cache: Fast Local SSD Caching for Network Storage

Posted4 months agoActive4 months ago

tlar

85 points

24 comments

devcenter.upsun.comTechstory

calmmixed

Debate

60/100

Storage OptimizationCloud ComputingCaching Strategies

Key topics

Storage Optimization

Cloud Computing

Caching Strategies

The article discusses using dm-cache to reduce bandwidth costs by caching network storage on local SSDs, sparking a discussion on caching strategies, data integrity, and cloud architecture.

Snapshot generated from the HN discussion

Discussion Activity

Active discussion

First comment

Peak period

84-96h

Avg / period

Comment distribution24 data points

Loading chart...

Based on 24 loaded comments

Key moments

01Story posted
Sep 9, 2025 at 8:14 AM EDT
4 months ago
Step 01
02First comment
Sep 12, 2025 at 7:31 PM EDT
3d after posting
Step 02
03Peak activity
12 comments in 84-96h
Hottest window of the conversation
Step 03
04Latest activity
Sep 14, 2025 at 9:04 AM EDT
4 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (24 comments)

Showing 24 comments

AtlasBarfed

4 months ago

2 replies

"When deploying infrastructure across multiple AWS availability zones (AZs), bandwidth costs can become a significant operational expense"

An expense in the age of 100gbit networking that is entirely because AWS can get away with charging the suckers, um, customers for it

0xbadcafebee

4 months ago

1 reply

AZs are whole datacenters, so I imagine their backbone bandwidth between AZs is a fraction of total bandwidth inside the DC. If they didn't charge it'd probably get saturated and then there's not much point in using them for reliability.

The internet egress price is where they're bastards.

martinald

4 months ago

Definitely not. Azure doesn't charge for intra region costs FWIW.

Getting terabits and terabits of 'private' interconnect is unbelievably cheap at amazon scale. AWS even own some of their own cables and have plans to build more.

There is _so_ much capacity available on fiber links. For example one newish (Anjana) cable between the US and Europe has 480Tbit/sec capacity. That's just one cable. And that could probably be upgraded to 10-20x that already with newer modulation techniques.

random3

4 months ago

reduce network bandwidth from the network attaches SSD volumes, yes?

0xbadcafebee

4 months ago

1 reply

This is good timing; I was just looking at a use-case where we need more iops and the only immediate solutions involve allocating way more high-performance disks or network storage. The problem with a cache is having a large dataset with random access, so repeated cache hits might not be frequent. But I had a theory that you could still make an impact on performance and lower your storage performance requirements. I may try this out, but it is block-level, so it's a bit intrusive.

Another option I haven't tried is tmpfs with an overlay. Initial access is RAM, falls back to underlying slower storage. Since I'm mostly doing reads, should be fine, writes can go to the slower disk mount. No block storage changes needed.

otterley

4 months ago

You don’t need a tmpfs to have the OS use memory to cache block reads for you. The kernel gives you that for free.

kayson

4 months ago

1 reply

I was looking into SSD caching recently and decided to go with Open-CAS instead, which should be more performant (didn't test it personally): https://github.com/Open-CAS/open-cas-linux/issues/1221

It's maintained by Intel and Huawei and the devs were very responsive.

mgerdts

4 months ago

1 reply

Is Intel still working on it? Open-CAS bdev support was nearly removed from SPDK at a time when Intel still employed a SPDK development and QA team. Huawei stepped in to offer support to keep it alive, preventing its removal.

I’ve been under the impression that Intel got rid of pretty much all of their storage software employees.

quickslowdown

4 months ago

I mean to ask a genuine, good faith question here, because I don't know much about Huawei's development team.

My head goes to the xz attack when I hear that Intel decided to stop supporting an open source tool, and a Chinese company known to sell backdoored equipment "steps in" to continue development, and it makes me suspicious & concerned.

This is to say nothing of the quality of the software they write or its functionality, they may be "good stewards" of it, but does it seem paranoid to be unsure of that arrangement?

rbranson

4 months ago

1 reply

> For e-commerce workloads, the performance benefit of write-back mode isn’t worth the data integrity risk. Our customers depend on transactional consistency, and write-through mode ensures every write operation is safely committed to our replicated Ceph storage before the application considers it complete.

Unless the writer is always overwriting entire files at once blindly (doesn't read-then-write), consistency requires consistency reads AND writes. Even then, potential ordering issues creep in. It would be really interesting to hear how they deal with it.

twotwotwo

4 months ago

They mention it as a block device, and the diagram makes it look like there's one reader. If so, this seems like it has the same function as the page cache in RAM, just saving reads, and looks a lot like https://discord.com/blog/how-discord-supercharges-network-di... (which mentions dm-cache too).

If so, safe enough, though if they're going to do that, why stop at 512MB? The big win of Flash would be that you could go much bigger.

mrkurt

4 months ago

2 replies

dm-cache writeback mode is both amazing and terrifying. It reorders writes, so not only do you lose data if the cache fails, you probably just corrupted the entire backing disk.

saltcured

4 months ago

Yeah, when I used it on a workstation many years ago, I layered it on top of an MD RAID-1 SSD array for the cache and an MD RAID-5 HDD array for the bulk store.

I used writeback mode, but expected to wipe the machine if the caching layer ever collapsed. In the end, the SSDs outlived my interest in the machine, though I think I did failover an HDD or two while the rest remained in normal operating mode.

namibj

4 months ago

Wow, meanwhile it'd be so easy to just take cache flush commands as "only" reordering barriers without breaking the single-system consistency (don't use it for a backing store of a Raft/PAXOS cluster, though!).

mgerdts

4 months ago

5 replies

I remember seeing another strategy where a remote block device was (lazily?) mirrored to a local SSD. The mirror was configured such that reads from the local device were preferred and writes would go to both devices. I think this was done by someone on GCP.

Does this ring any bells? I’ve searched for this a time or two and can’t find it again.

magicalhippo

4 months ago

There was some discussion amongst the ZFS devs for such a feature.

As I recall it was to change the current mirrored read strategy to be aware of the speed of the underlying devices, and prefer the faster if it has capacity. Though perhaps a fixed pool property to always read from a given device was discussed, it's been a while so my memory is hazy.

The use-case was similar IIRC, where a customer wanted to combine local SSD with remote block device.

So, might come to ZFS.

cperciva

4 months ago

I've done this on EC2 -- in particular back in the days when EBS billed per I/O (as opposed to using a "reserved IOPs" model where you say up front how much I/O performance you need). I haven't bothered recently since EBS performance is good enough for most purposes and there's no automatic cost savings.

zipmapfoldright

4 months ago

Google's L4 cache? https://cloud.google.com/blog/products/storage-data-transfer...

twotwotwo

4 months ago

Discord: https://discord.com/blog/how-discord-supercharges-network-di...

(Somehow the name "SuperDisks" was burned into my brain for this. Although Discord's post does use 'Super-Disks' in a section header, if you search the Internet for SuperDisks you'll everything's about the LS-120 floppies that went by that name.)

Conch5606

4 months ago

This is not quite the same, it's for migrating from one device to another while keeping the file system writable, but it's quite neat: dm-clone[1]

I've used it before for a low downtime migration of VMs between two machines - it was a personal project and I could have just kept the VM offline for the migration, but it was fun to play around with it.

You give it a read-only backing device and a writable device that's at least as big. It will slowly copy the data from the read-only device to the writable device. If a read is issued to the dm-clone target it's either gotten from the writable device if it's already cloned or forwarded to the read-only device. Writes are always going to the writable device and afterwards the read-only device is ignored for that block.

It's not the fastest, but it's relatively easy to set up, even though using device mapper directly is a bit clunky. It's also not super efficient, IIRC if a read goes to a chunk that hasn't been copied yet, that's used to give the data to the reading program, but it's not stored on the writable device, so it has to be fetched again. If the file system being copied isn't full, it's a good idea to run trimming after creating the dm-clone target as discarded blocks are marked as not needing to be fetched.

[1] https://docs.kernel.org/admin-guide/device-mapper/dm-clone.h...

loeg

4 months ago

Historically, I believe bcache offered a better design than dm-cache. I wonder if that has changed at all?

That said, for this use, I would be very concerned about coherency issues putting any cache in front of the actual distributed filesystem. (Unless this is the only node doing writes, I guess?)

miladyincontrol

4 months ago

I just use fs-cache for networked storage caching. Good enough for redhat, good enough for me. Unsure how performance compares but I like that it works transparently with little more than a mount flag to activate, works fine in containers, and if managed with cachefilesd it can scale dynamically as per configured quotas.

For local disks though? bcache

otterley

4 months ago

Why is two-thirds of their I/O crossing AZ boundaries for a read-heavy application? This application seems like it’s not well architected for AWS and puts them at availability risk in the event of a zonal impairment. It looks like they’re using Ceph instead of EBS, and it’s not clear why.

kosolam

4 months ago

Hmm.. I have a few questions:

1. How is the cache invalidated to avoid reading stale data? 2. If multi az setup is for high availability then I guess the only traffic between zones must be replication from the active one to the standby zones, in such a setup read cache doesn’t make much sense..

View full discussion on Hacker News

ID: 45180876Type: storyLast synced: 11/20/2025, 1:32:57 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN