Cdc File Transfer
Posted3 months agoActive3 months ago
github.comTechstoryHigh profile
calmmixed
Debate
60/100
Content Defined ChunkingFile TransferRsync
Key topics
Content Defined Chunking
File Transfer
Rsync
Google open-sourced CDC File Transfer, a tool using Content Defined Chunking for fast incremental file transfer, sparking discussion on its potential, limitations, and comparisons to existing tools like rsync.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
45m
Peak period
26
3-6h
Avg / period
8.5
Comment distribution102 data points
Loading chart...
Based on 102 loaded comments
Key moments
- 01Story posted
Sep 30, 2025 at 10:38 PM EDT
3 months ago
Step 01 - 02First comment
Sep 30, 2025 at 11:23 PM EDT
45m after posting
Step 02 - 03Peak activity
26 comments in 3-6h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 2, 2025 at 11:28 PM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45433768Type: storyLast synced: 11/20/2025, 7:55:16 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
...which is more important / needed than ever. I encourage every who asks to get my music from bit torrent instead of spotify.
I'm not above piracy if there's no DRM free option (or if the music is very old or the artist is long dead), but I still believe in supporting artists who actively support freedom.
Even better though, is a P2P service that is censorship resistant.
But yeah I like Bandcamp plenty.
> artists who actively support freedom.
The bluegrass world is quickly becoming this.
https://pickipedia.xyz/wiki/DRM-free
At its core is a corpus of traditional songs which have been handed down across generations, especially Irish fiddle tunes and West African banjo music.
The concepts of digital freedom trace multiple clear lineages to this tradition of music. For example, John Perry Barlow, who founded The Electronic Frontier Foundation, The Freedom of The Press Foundation, and participated heavily in discussions on The WELL which laid the cryptological groundwork that eventually became blockchains, was a member of The Grateful Dead (who, while more of a rock or country band than a traditional string band, stewarded and celebrated this corpus of music across several decades) and was himself an aficionado of the history of IP-unencumbered music.
If you see the word "Traditional" on a bluegrass setlist (usually listed next to a song where an author normally goes), it effectively means "I assert that this song is not subject to intellectual property."
I need to revisit this in the next few weeks as I release my second record (which, if I may boast, has an incredible ensemble of most of my favorite bluegrass musicians on it; it was a really fun few days at the studio).
Currently I do pin all new content to IPFS and put the hashes in the content description, as with this video of Drowsy Maggie with David Grier: https://www.youtube.com/watch?v=yTI1HoFYbE0
Another note: our study of Drowsy Maggie was largely made possible by finding old-and-nearly-forgotten versions in the Great78 project, which of course the industry attempted to sue out of existence on an IP basis. This is another example of how IP is a conceptual threat to traditional music - we need to be able to hear the tradition in order to honor it.
This is unlike the model that PlayStation, Xbox and even Nvidia are following - I don’t know about Amazon Luna.
They recently added "Install to Play" where you can install games from Steam that aren't modified for the service. They charge for storage for this though.
Sadly, there's still tons of games unavaiable because publishers need to opt in and many don't.
Stadia required special version of games, so it wouldn't be that useful
Artemis is a bit better, but it still requires per-device setup of displays since it somehow doesn't disable the physical output next to the virtual one. Those drivers also add latency to the capture (the author of looking glass really dislikes them because they undo all the hard work of near-zero latency).
[1]: https://github.com/acuteaura/universe/blob/main/systems/_mod...
> Built-in Virtual Display with HDR support that matches the resolution/framerate config of your client automatically
It includes a virtual screen driver, and it handles all the crap (it can disable your physical screen when streaming and re enable after, it can generate the virtual screen by client to match the client's needs, or do it by game, or ...)
I stream from my main pc to both my laptop and my steamdeck, and each get the screen that matches them without having to do anything more than connect to it with moonlight.
> virtual_display (charp)
> Set to enable virtual display feature. This feature provides a virtual display hardware on headless boards or in virtualized environments. It will be set like xxxx:xx:xx.x,x;xxxx:xx:xx.x,x. It’s the pci address of the device, plus the number of crtcs to expose. E.g., 0000:26:00.0,4 would enable 4 virtual crtcs on the pci device at 26:00.0. The default is NULL.
[1]https://www.kernel.org/doc/html/latest/gpu/amdgpu/module-par...
https://bugzilla.kernel.org/show_bug.cgi?id=203339
edit: didn't realize you're the OP lol
Speaking of which, who thought up the idea to use custom hardware for this that would _already be obsolete_ a year later? Who considered using Linux native instead of a compat layer? Why did the original Stadia website not even have a search bar??
Steam has game streaming built in and works very well. Both Nvidia and AMD built this into their GPU drivers at one point or another (I think the AMD one was shut down?)
Those are just the solutions I accidentally have installed despite not using that functionality. You can even stream games from the steam deck!
Sony even has a system to let you stream your PS4 to your computer anywhere and play it. I think Microsoft built something similar for Xbox.
Use case is to copy file over slow net, but the previous version is already there, so one can save time by only sending changed parts of the file.
Not to be confused with USB CDC ("communications device class"), an USB device protocol used to present serial ports and network cards. It can also be used to transfer files, the old PC-to-PC cables used it by implementing two network cards connected to each other.
Instead with CDC the block boundaries are define by the content, so an insertion doesn’t change the block boundary, so it can tell the subsequent blocks are unchanged. I haven’t read the CDC paper but I’m guessing they just use some probabilistic hash function to define certain strings as block boundaries.
You choose a number of bits (say, 12) and then evenly distribute these in a 48-bit mask; if the hash at any point has all these bits on, that defines a boundary.
[0] https://en.wikipedia.org/wiki/Cult_of_the_Dead_Cow
https://partner.steamgames.com/doc/sdk/uploading#AppStructur...
Does this work Linux to Linux too?
I wonder if Steam ever decides to supercharge their content handling with some user-space filesystem stuff. With fast connections, there isn't really a reason they couldn't launch games in seconds, streaming data on-demand with smart pre-caching steering based on automatically trained access pattern data. And especially with finely tuned delta patching like this, online game pauses for patching could be almost entirely eliminated. Stop & go instead of a pit stop.
[1] https://web.archive.org/web/20250517130138/https://venusoft....
[2] https://venusoft.net/#home
This would be extra cool for LAN parties with good network hardware
With low bandwidth just downloading the whole thing while having enough compression to 80% saturate the local system would be optimal instead, sure.
- only works on a weird combo of (src platform / dst platform). Why???? How hard is it to write platform-independent code to read/write bytes and send them over the wire in 2025?
- uses bazel, an enormous, Java-based abomination, to build.
Fingers crossed that these can be fixed, or this project is dead in the water.
Stadia ran on linux, and 99.9999999% of game development is done on windows (and cross compiled for linux).
> Fingers crossed that these can be fixed, or this project is dead in the water.
The project was archived 9 months ago, and hasn't had a commit in 2 years. It's already dead.
The great thing is googlers could make such a tool and publish it in the first place. So you can improve it to use it in your scenario. Or become maintainer of such a tool.
Literally tonight my buddy was talking about how months long plan to introduce bazel into his companies infra.
https://github.com/buildbarn/go-cdc
Did you compare it to Buzhash? I assume gearhash is faster given the simpler per iteration structure. (also, rand/v2's seeded generators might be better for gear init than mt19937)
Regarding the RNG used to seed the GEAR table: I don't think it actually makes that much of a difference. You only use it once to generate 2 KB of data (256 64-bit constants). My suspicion is that using some nothing-up-my-sleeve numbers (e.g., the first 2048 binary digits of π) would work as well.
If it doesn't work with any random number, then some work better than others then intuitively you can find a (or a set of) best seed(s).
If we take that out of the equation and only measure the size of the additional chunks being transferred, it's a reduction of about 3.4%. So it's not an order of magnitude difference, but not bad for a relatively small change.
(Please don't hurt me.)
AI turns out to be useful for data compression (https://statusneo.com/creating-lossless-compression-algorith...) and RF modulation optimization (https://www.arxiv.org/abs/2509.04805).
Maybe it'd be useful to train a small model (probably of the SSM variety) to find optimal chunking boundaries.
In my case I need to ensure that all producers of data use exactly the same algorithm, as I need to look up build cache results based on Merkle tree hashes. That's why I'm intentionally focusing on having algorithms that are not only easy to implement, but also easy to implement consistently. I think that MaxCDC implementation that I shared strikes a good balance in that regard.
I just wanted to let you know, this is really cool. Makes me wish I still used Bazel.
This is a problem interesting enough, with huge potential benefits for humanity if it manages to improve anything, which it did.
[1] - https://github.com/google/cdc-file-transfer/issues/56#issuec...
[2] - https://github.com/librsync/librsync/issues/242
The git blob was hashed with a header of decimal length, and you change a slight bit of content, you have to calculate the hash from start again.
Something like CDC would improve this alot.
https://joshleeb.com/posts/content-defined-chunking.html
https://joshleeb.com/posts/gear-hashing.html
Looking forward to reading those.
https://en.wikipedia.org/wiki/Rolling_hash#Content-based_sli...
https://en.wikipedia.org/wiki/Rolling_hash#Content-based_sli...
The speed improvements over rsync seem related to a more efficient rolling hash algorithm, and possibly by using native windows executables instead of cygwin (windows file systems are notoriously slow, maybe that plays a role here).
Or am I missing something?
In any case, the performance boost is interesting. Glad the source was opened, and I hope it finds its way into rsync.
That said, VIM 8 was terrific.
The original is the GPL variant [today displaying "Upgrade required"]:
https://rsync.samba.org/
The second is the BSD clone:
https://www.openrsync.org/
The BSD version would be used on platforms that are intolerant of later versions of the GPL (Apple, Android, etc.).
No, it operates on fixed size blocks over the destination file. However, by using a rolling hash, it can detect those blocks at any offset within the source file to avoid re-transferring them.
https://rsync.samba.org/tech_report/node2.html
> scp always copies full files, there is no "delta mode" to copy only the things that changed, it is slow for many small files, and there is no fast compression.
I havent tried it myself but doesnt this already suit that requirement ? https://docs.rc.fas.harvard.edu/kb/rsync/
> Compression If the SOURCE and DESTINATION are on different machines with fast CPUs, especially if they’re on different networks (e.g. your home computer and the FASRC cluster), it’s recommended to add the -z option to compress the data that’s transferred. This will cause more CPU to be used on both ends, but it is usually faster.
Maybe it's not fast enough, but seems a better place to start than scp imo.
Game development, in particular, often involves truly enormous sizes and numbers of assets, particularly for dev build iteration, where you're sometimes working with placeholder or unoptimized assets, and debug symbol bloated things, and in my experience, rsync scales poorly for speed of copying large numbers of things. (In the past, I've used naive wrapper scripts with pregenerated lists of the files on one side and GNU parallel to partition the list into subsets and hand those to N different rsync jobs, and then run a sync pass at the end to cleanup any deletions.)
Just last week, I was trying to figure out a more effective way to scale copying a directory tree that was ~250k files varying in size between 128b and 100M, spread out across a complicatedly nested directory structure of 500k directories, because rsync would serialize badly around the cost of creating files and directories. After a few rounds of trying to do many-way rsync partitions, I finally just gave the directory to syncthing and let its pregenerated index and watching handle it.
> The key insight is that file operations in separate directories don’t (for the most part) interfere with each other, enabling parallel execution.
It really is magically fast.
EDIT: Sorry, that tool is only for local copies. I just remembered you're doing remote copies. Still worth keeping in mind.
Interesting, so unlike rsync there is no need to set up a service on the destination Linux machine. That always annoyed me a bit about rsync.
You were misinformed if you thought using rsync required setting up an rsync service.
https://en.wikipedia.org/wiki/Remote_Differential_Compressio...
https://www.ibm.com/products/aspera
the google repo has been archived, Did they give it up?
This is a cool idea!
[1] https://github.com/claytongulick/bit-sync