Cdc File Transfer

Posted3 months agoActive3 months ago

GalaxySnail

389 points

102 comments

github.comTechstoryHigh profile

calmmixed

Debate

60/100

Content Defined ChunkingFile TransferRsync

Key topics

Content Defined Chunking

File Transfer

Rsync

Google open-sourced CDC File Transfer, a tool using Content Defined Chunking for fast incremental file transfer, sparking discussion on its potential, limitations, and comparisons to existing tools like rsync.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

45m

Peak period

3-6h

Avg / period

8.5

Comment distribution102 data points

Loading chart...

Based on 102 loaded comments

Key moments

01Story posted
Sep 30, 2025 at 10:38 PM EDT
3 months ago
Step 01
02First comment
Sep 30, 2025 at 11:23 PM EDT
45m after posting
Step 02
03Peak activity
26 comments in 3-6h
Hottest window of the conversation
Step 03
04Latest activity
Oct 2, 2025 at 11:28 PM EDT
3 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (102 comments)

Showing 102 comments

rekttrader

3 months ago

7 replies

Nice to see Stadia had some long term benefit. It’s a shame they don’t make a self hosted version but if you did that it’s just piracy in today’s drm world.

jMyles

3 months ago

2 replies

> it’s just piracy in today’s drm world

...which is more important / needed than ever. I encourage every who asks to get my music from bit torrent instead of spotify.

MyOutfitIsVague

3 months ago

1 reply

Why not something like Bandcamp, or other DRM-free purchase options?

I'm not above piracy if there's no DRM free option (or if the music is very old or the artist is long dead), but I still believe in supporting artists who actively support freedom.

jMyles

3 months ago

1 reply

Yep, I put everything on bandcamp. https://justinholmes.bandcamp.com/

Even better though, is a P2P service that is censorship resistant.

But yeah I like Bandcamp plenty.

> artists who actively support freedom.

The bluegrass world is quickly becoming this.

https://pickipedia.xyz/wiki/DRM-free

MyOutfitIsVague

3 months ago

1 reply

That sounds pretty good! I'll buy an album. I know nothing about Bluegrass, but I love that it has fostered a culture of digital freedom. It kind of makes sense I suppose, given that DRM freedom aligns with real traditional American values.

jMyles

3 months ago

1 reply

Bluegrass is a highly technical, musically complex, urban form of Appalachian country music. It is distinct for sharing the rhythm among stringed instruments (stemming from the constraint that the trip down from the mountains into the cities - usually conducted by freight trains and pickup trucks - did not afford the transport of a drum kit).

At its core is a corpus of traditional songs which have been handed down across generations, especially Irish fiddle tunes and West African banjo music.

The concepts of digital freedom trace multiple clear lineages to this tradition of music. For example, John Perry Barlow, who founded The Electronic Frontier Foundation, The Freedom of The Press Foundation, and participated heavily in discussions on The WELL which laid the cryptological groundwork that eventually became blockchains, was a member of The Grateful Dead (who, while more of a rock or country band than a traditional string band, stewarded and celebrated this corpus of music across several decades) and was himself an aficionado of the history of IP-unencumbered music.

If you see the word "Traditional" on a bluegrass setlist (usually listed next to a song where an author normally goes), it effectively means "I assert that this song is not subject to intellectual property."

MyOutfitIsVague

3 months ago

Fascinating. Thanks for the information. I'll have to explore the Bluegrass space more. It's a genre that I have never given much thought or time to in the past; I was simply never exposed, other than the slim association with some folk punk acts.

MaxikCZ

3 months ago

1 reply

So you create and seed your torrents with your music, and present them prominently on your site?

jMyles

3 months ago

I was doing that for a while, and running a seedbox. However, on occasions when the seedbox was the only seeder, clients were unable to begin the download, for reasons I've never figured out. If I also seeded from my desktop, then fan downloads were being fed by both the desktop and the seedbox. But without the desktop, the seedbox did nothing.

I need to revisit this in the next few weeks as I release my second record (which, if I may boast, has an incredible ensemble of most of my favorite bluegrass musicians on it; it was a really fun few days at the studio).

Currently I do pin all new content to IPFS and put the hashes in the content description, as with this video of Drowsy Maggie with David Grier: https://www.youtube.com/watch?v=yTI1HoFYbE0

Another note: our study of Drowsy Maggie was largely made possible by finding old-and-nearly-forgotten versions in the Great78 project, which of course the industry attempted to sue out of existence on an IP basis. This is another example of how IP is a conceptual threat to traditional music - we need to be able to hear the tradition in order to honor it.

oofbey

3 months ago

1 reply

What do you mean piracy in the a DRM world. Like being able to share your own PC games through the cloud?

killingtime74

3 months ago

You can share the games you authored all you like. If you bought a license to play them that's another story.

kanemcgrath

3 months ago

1 reply

for self-hosted game streaming you can use moonlight + sunshine, they work really well in my experience.

BrokenCogs

3 months ago

Exactly my experience too. I easily get 60fps at 1080p over wireless LAN with moonlight + sunshine. Parsec is also another option

sheepscreek

3 months ago

3 replies

Probably wouldn’t have been feasible - I heard developers had to compile their games with Stadia support. Maybe it was an entirely different platform, with its own alternative to DirectX, or maybe had some kind of lightweight emulation (such as Proton) but I remember vaguely the few games I played had custom stadia key bindings (with stadia symbols). They would display like that within the game. So definitely some customization did happen.

This is unlike the model that PlayStation, Xbox and even Nvidia are following - I don’t know about Amazon Luna.

jakebasile

3 months ago

1 reply

As I understand it, GeForce Now actually does require changes to the game to run in the standard and until recently only option of "Ready To Play". This is the supposed reason that new updates to games sometimes take time to get released on the service, since either the developers themselves or Nvidia needs to modify it to work correctly on the service. I have no idea if this is true, but it makes sense to me.

They recently added "Install to Play" where you can install games from Steam that aren't modified for the service. They charge for storage for this though.

Sadly, there's still tons of games unavaiable because publishers need to opt in and many don't.

TiredOfLife

3 months ago

GeForce Now doesn't require any changes.

numpad0

3 months ago

They did have a dev console based on a Lenovo workstation, as well as off-menu AMD V340L 2x8GB GPUs, both later leaked into Internet auctions. So some hardware and software customizations had definitely happened.

MindSpunk

3 months ago

Stadia games were just run on Linux with Vulkan + some extra Stadia APIs for their custom swapchain and other bits and pieces. Stadia games were basically just Linux builds.

nolok

3 months ago

1 reply

For self hosted remote streaming of game look at Moonlight / Sunshine (Apollo)

Stadia required special version of games, so it wouldn't be that useful

asmor

3 months ago

2 replies

It's a shame that virtual / headless displays are such a mess on both Linux and Windows. I use a 32:9 ultrawide and stream to 16:9/16:10 devices, and even with hours of messing around with an HDMI dummy and kscreen-doctor[1] it was still an unreliable mess. Sometimes it wouldn't work when the machine was locked, and sometimes Sunshine wouldn't restore the resolution on the physical monitor (and there's no session timeout either).

Artemis is a bit better, but it still requires per-device setup of displays since it somehow doesn't disable the physical output next to the virtual one. Those drivers also add latency to the capture (the author of looking glass really dislikes them because they undo all the hard work of near-zero latency).

[1]: https://github.com/acuteaura/universe/blob/main/systems/_mod...

nolok

3 months ago

1 reply

Use Apollo (a fork of Sunshine) : https://github.com/ClassicOldSong/Apollo

> Built-in Virtual Display with HDR support that matches the resolution/framerate config of your client automatically

It includes a virtual screen driver, and it handles all the crap (it can disable your physical screen when streaming and re enable after, it can generate the virtual screen by client to match the client's needs, or do it by game, or ...)

I stream from my main pc to both my laptop and my steamdeck, and each get the screen that matches them without having to do anything more than connect to it with moonlight.

asmor

3 months ago

Artemis/Apollo are mentioned in the post above - yeah they work better than the out of box experience, but you still have to configure your physical screen to be off for every virtual display. It unfortunately only runs on Windows and my machine usually doesn't. I also only have one dGPU and a Raphael iGPU (which are sensitive to memory overclocks) and I like the Linux gaming experience for the most part, so while I did have a working gaming VM, it wasn't for me (or I'd want another GPU).

heavyset_go

3 months ago

1 reply

On Linux with an AMD i/dGPU, you can set the `virtual_display` module parameter for `amdgpu`[1] and do what you want without the need for an HDMI dummy or weird software. It's also hardware accelerated.

> virtual_display (charp)

> Set to enable virtual display feature. This feature provides a virtual display hardware on headless boards or in virtualized environments. It will be set like xxxx:xx:xx.x,x;xxxx:xx:xx.x,x. It’s the pci address of the device, plus the number of crtcs to expose. E.g., 0000:26:00.0,4 would enable 4 virtual crtcs on the pci device at 26:00.0. The default is NULL.

[1]https://www.kernel.org/doc/html/latest/gpu/amdgpu/module-par...

asmor

3 months ago

1 reply

Unfortunately this seems to disable physical outputs.

https://bugzilla.kernel.org/show_bug.cgi?id=203339

heavyset_go

3 months ago

I figure if you're using an HDMI dummy you're running headless anyway

edit: didn't realize you're the OP lol

laidoffamazon

3 months ago

Stadia was sadly engineered in such a way that this is impossible.

Speaking of which, who thought up the idea to use custom hardware for this that would _already be obsolete_ a year later? Who considered using Linux native instead of a compat layer? Why did the original Stadia website not even have a search bar??

mrguyorama

3 months ago

I don't understand, "self hosted stadia" is just one of the myriad of services and tools that do literally that.

Steam has game streaming built in and works very well. Both Nvidia and AMD built this into their GPU drivers at one point or another (I think the AMD one was shut down?)

Those are just the solutions I accidentally have installed despite not using that functionality. You can even stream games from the steam deck!

Sony even has a system to let you stream your PS4 to your computer anywhere and play it. I think Microsoft built something similar for Xbox.

theamk

3 months ago

2 replies

This CDC is "Content Defined Chunking" - fast incremental file transfer.

Use case is to copy file over slow net, but the previous version is already there, so one can save time by only sending changed parts of the file.

Not to be confused with USB CDC ("communications device class"), an USB device protocol used to present serial ports and network cards. It can also be used to transfer files, the old PC-to-PC cables used it by implementing two network cards connected to each other.

oofbey

3 months ago

2 replies

The clever trick is how it recognizes insertions. The standard trick of computing hashes on fixed sized blocks works efficiently for substitutions but is totally defeated by an insertion or deletion.

Instead with CDC the block boundaries are define by the content, so an insertion doesn’t change the block boundary, so it can tell the subsequent blocks are unchanged. I haven’t read the CDC paper but I’m guessing they just use some probabilistic hash function to define certain strings as block boundaries.

adzm

3 months ago

> I haven’t read the CDC paper but I’m guessing they just use some probabilistic hash function to define certain strings as block boundaries.

You choose a number of bits (say, 12) and then evenly distribute these in a 48-bit mask; if the hash at any point has all these bits on, that defines a boundary.

teraflop

3 months ago

Probably worth noting that ordinary rsync can also handle insertions/deletions because it uses a rolling hash. Rsync's method is bandwidth-efficient, but not especially CPU-efficient.

NooneAtAll3

3 months ago

2 replies

not to be confused with Center of Disease Control

1ncorrect

3 months ago

1 reply

…or cDc[0]

[0] https://en.wikipedia.org/wiki/Cult_of_the_Dead_Cow

bbkane

3 months ago

1 reply

Or https://en.wikipedia.org/wiki/Change_data_capture

monocasa

3 months ago

Or https://en.wikipedia.org/wiki/Control_Data_Corporation

petsfed

3 months ago

Especially in the context of recent (that is, last 10 years) removal of data from Center of Disease Control sources due to changing political winds.

modeless

3 months ago

1 reply

Does Steam do something like this for game updates?

Scaevolus

3 months ago

Steam unfortunately doesn't use a rolling hash like this (fastcdc, buzhash, etc.), but rather slices files into 1MB chunks, hashes them, and updates at that granularity.

https://partner.steamgames.com/doc/sdk/uploading#AppStructur...

supportengineer

3 months ago

2 replies

Cygwin? Does anyone still use that?

mayli

3 months ago

I use that cause it's pure user space, and doesn't require extra system feature.

cheema33

3 months ago

Cygwin has its benefits over WSL. e.g. It does not run in a VM for example and therefore does not suffer from the resulting performance penalty.

mikae1

3 months ago

1 reply

> cdc_rsync is a tool to sync files from a Windows machine to a Linux device, similar to the standard Linux rsync.

Does this work Linux to Linux too?

kxrm

3 months ago

No: https://github.com/google/cdc-file-transfer?tab=readme-ov-fi...

maxlin

3 months ago

2 replies

Having dabbled in trying to make a quick delta patch system like Steam's, which required me to understand delta patching methods and made small patches to big files in a 10gb+ installation in a few seconds, this is sure is quite interesting!

I wonder if Steam ever decides to supercharge their content handling with some user-space filesystem stuff. With fast connections, there isn't really a reason they couldn't launch games in seconds, streaming data on-demand with smart pre-caching steering based on automatically trained access pattern data. And especially with finely tuned delta patching like this, online game pauses for patching could be almost entirely eliminated. Stop & go instead of a pit stop.

fsfod

3 months ago

1 reply

Someone already created that[1] using custom kernel driver and there own CDN, but they seem to of abandoned it[2], maybe because they would of attracted Valve's wrath trying to monetized it.

[1] https://web.archive.org/web/20250517130138/https://venusoft....

[2] https://venusoft.net/#home

maxlin

3 months ago

That's actually quite interesting. Not entirely what I had in mind but close! My version would have only the first boot be a bit slow, but the aspect of dynamically replacing local content there is cool.

This would be extra cool for LAN parties with good network hardware

Zekio

3 months ago

1 reply

steam game installs are bottlenecked by cpu speed these days due to the heavy compression, so doubt it be much faster

maxlin

3 months ago

Well, the amount of compression isn't set in stone, obviously a system like this would run with a less compressed dataset to balance game boot time, time taken away from running the game by compression, and scale on available bandwidth.

With low bandwidth just downloading the whole thing while having enough compression to 80% saturate the local system would be optimal instead, sure.

ur-whale

3 months ago

3 replies

Great initiative, especially the new sync algorithm, but giant hurdles to adoption:

- only works on a weird combo of (src platform / dst platform). Why???? How hard is it to write platform-independent code to read/write bytes and send them over the wire in 2025?

- uses bazel, an enormous, Java-based abomination, to build.

Fingers crossed that these can be fixed, or this project is dead in the water.

maccard

3 months ago

> only works on a weird combo of (src platform / dst platform). Why????

Stadia ran on linux, and 99.9999999% of game development is done on windows (and cross compiled for linux).

> Fingers crossed that these can be fixed, or this project is dead in the water.

The project was archived 9 months ago, and hasn't had a commit in 2 years. It's already dead.

jve

3 months ago

Hey the repo is archived and as I read the tool was meant to solve one specific scenario. Not everything has to please the public.

The great thing is googlers could make such a tool and publish it in the first place. So you can improve it to use it in your scenario. Or become maintainer of such a tool.

hobs

3 months ago

First thing might be considered a bug by googles, but everyone I have talked to LOVED their bazel or at least thought of it as superior to any other tool to do the same stuff.

Literally tonight my buddy was talking about how months long plan to introduce bazel into his companies infra.

EdSchouten

3 months ago

4 replies

I’ve also been doing lots of experimenting with Content Defined Chunking since last year (for https://bonanza.build/). One of the things I discovered is that the most commonly used algorithm FastCDC (also used by this project) can be improved significantly by looking ahead. An implementation of that can be found here:

https://github.com/buildbarn/go-cdc

Scaevolus

3 months ago

1 reply

This lookahead is very similar to the "lazy matching" used in Lempel-Ziv compressors! https://fastcompression.blogspot.com/2010/12/parsing-level-1...

Did you compare it to Buzhash? I assume gearhash is faster given the simpler per iteration structure. (also, rand/v2's seeded generators might be better for gear init than mt19937)

EdSchouten

3 months ago

2 replies

Yeah, GEAR hashing is simple enough that I haven't considered using anything else.

Regarding the RNG used to seed the GEAR table: I don't think it actually makes that much of a difference. You only use it once to generate 2 KB of data (256 64-bit constants). My suspicion is that using some nothing-up-my-sleeve numbers (e.g., the first 2048 binary digits of π) would work as well.

Scaevolus

3 months ago

Right, just one fewer module dependency using the stdlib RNG.

pbhjpbhj

3 months ago

The random number generation could match the first 2048 digits of pi, so if it works with _any_ random number...

If it doesn't work with any random number, then some work better than others then intuitively you can find a (or a set of) best seed(s).

rokkamokka

3 months ago

1 reply

What would you estimate the performance implications of using go-cdc instead of fastcdc in their cdc_rsync are?

EdSchouten

3 months ago

In my case I observed a ~2% reduction in data storage when attempting to store and deduplicate various versions of the Linux kernel source tree (see link above). But that also includes the space needed to store the original version.

If we take that out of the equation and only measure the size of the additional chunks being transferred, it's a reduction of about 3.4%. So it's not an order of magnitude difference, but not bad for a relatively small change.

quotemstr

3 months ago

1 reply

I wonder whether there's a role for AI here.

(Please don't hurt me.)

AI turns out to be useful for data compression (https://statusneo.com/creating-lossless-compression-algorith...) and RF modulation optimization (https://www.arxiv.org/abs/2509.04805).

Maybe it'd be useful to train a small model (probably of the SSM variety) to find optimal chunking boundaries.

EdSchouten

3 months ago

Yeah, that's true. Having some kind of chunking algorithm that's content/file format aware could make it work even better. For example, it makes a lot of sense to chunk source files at function/scope boundaries.

In my case I need to ensure that all producers of data use exactly the same algorithm, as I need to look up build cache results based on Merkle tree hashes. That's why I'm intentionally focusing on having algorithms that are not only easy to implement, but also easy to implement consistently. I think that MaxCDC implementation that I shared strikes a good balance in that regard.

xyzzy_plugh

3 months ago

> https://bonanza.build

I just wanted to let you know, this is really cool. Makes me wish I still used Bazel.

laidoffamazon

3 months ago

2 replies

As I've gotten further in my career I've started to wonder - how many engineering quarters did it take to build this for their customers? How did they manage to get this on their own roadmap? This seems like a lot of code surface area for a fairly minimal optimization that would be redundant with a different development substrate (like running Windows on Stadia like how Amazon Luna worked...)

jayd16

3 months ago

1 reply

It's easy to get work on this problem. Any effort that shortens game deploy time will be highly visible. It's something every game needs, and every member of the team deals with.

laidoffamazon

3 months ago

1 reply

Im sympathetic to this idea but it seems like this is a situation that most game developers don’t have because they just develop locally. Sometimes they do need to push to a console which this could help with if Microsoft or Sony built this into their dev kit tooling.

jayd16

3 months ago

Sure there are local builds but there are so many baked assets the whole team is waiting for and they come from build machines. Source assets need to sync to machines, intermediate assets need to sync to dev machines, and other build steps, builds come back to QA machines. It can beb100s of gigs a pop these days.

grodes

3 months ago

You are thinking like a manager, but this (as with most of the good things in life) has been built by doers, artisans, and engineers (developers).

This is a problem interesting enough, with huge potential benefits for humanity if it manages to improve anything, which it did.

AnonC

3 months ago

1 reply

Does anyone know if there’s work being done to integrate this into the standard rsync tool (even as an optional feature)? It seems like a very useful improvement that ought to be available widely. From this website it seems a bit disappointing that it’s not even available for Linux to Linux transfers.

rincebrain

3 months ago

You can find some thoughts on it not working for Linux to Linux, and more broad compatibility, here[1] and here[2].

[1] - https://github.com/google/cdc-file-transfer/issues/56#issuec...

[2] - https://github.com/librsync/librsync/issues/242

est

3 months ago

2 replies

I wonder if this could be applied to git.

The git blob was hashed with a header of decimal length, and you change a slight bit of content, you have to calculate the hash from start again.

Something like CDC would improve this alot.

pabs3

3 months ago

Backup tools like restic/borg do this, I wonder if anyone has used them to replace git yet.

oac

3 months ago

It's done in xet as a replacement for git lfs: https://huggingface.co/blog/from-files-to-chunks

wheybags

3 months ago

1 reply

If anyone else was left wondering about the details of how CDC actually generates chunks, I found these two blog posts explained the idea pretty clearly:

https://joshleeb.com/posts/content-defined-chunking.html

https://joshleeb.com/posts/gear-hashing.html

jcul

3 months ago

Thanks, I was puzzled by that. They kind of gloss over it in the original link.

Looking forward to reading those.

MayeulC

3 months ago

3 replies

I am quite confused; doesn't rsync already use content-defined chunk boundaries, with a condition on the rolling hash to define boundaries?

https://en.wikipedia.org/wiki/Rolling_hash#Content-based_sli...

The speed improvements over rsync seem related to a more efficient rolling hash algorithm, and possibly by using native windows executables instead of cygwin (windows file systems are notoriously slow, maybe that plays a role here).

Or am I missing something?

In any case, the performance boost is interesting. Glad the source was opened, and I hope it finds its way into rsync.

sneak

3 months ago

2 replies

rsync seems frozen in time; it’s been around for ages and there are so many basic and small quality of life improvements that could have been made that haven’t been. I have always assumed it’s like vim now: only really maintained in theory, not in practice.

Zardoz84

3 months ago

1 reply

So you not used vim or neovim in the last 10 years ?

lftl

3 months ago

1 reply

To be fair, there was a roughly 6 year period when vim saw one very minor release. That slow development period was the impetus for the fork of Neovim.

Zardoz84

3 months ago

1 reply

I know. I use Neovim. But since that, and thanks to Neovim, Vim has speedup and got some improvements.

dotancohen

3 months ago

Time for neorsync.

That said, VIM 8 was terrific.

chasil

3 months ago

Please bear in mind that there are [now] two distinct rsync codebases.

The original is the GPL variant [today displaying "Upgrade required"]:

https://rsync.samba.org/

The second is the BSD clone:

https://www.openrsync.org/

The BSD version would be used on platforms that are intolerant of later versions of the GPL (Apple, Android, etc.).

ohitsdom

3 months ago

The readme very nicely contrasts the approach with rsync.

3 months ago

> doesn't rsync already use content-defined chunk boundaries, with a condition on the rolling hash to define boundaries?

No, it operates on fixed size blocks over the destination file. However, by using a rolling hash, it can detect those blocks at any offset within the source file to avoid re-transferring them.

https://rsync.samba.org/tech_report/node2.html

bilekas

3 months ago

2 replies

This is actually kind of cool, I've implemented my own version of this for my job and seems to be something that's important when the numbers gets tight, but if I remember correctly for their case i guess, wouldn't it have been easier to work from rsynch?

> scp always copies full files, there is no "delta mode" to copy only the things that changed, it is slow for many small files, and there is no fast compression.

I havent tried it myself but doesnt this already suit that requirement ? https://docs.rc.fas.harvard.edu/kb/rsync/

> Compression If the SOURCE and DESTINATION are on different machines with fast CPUs, especially if they’re on different networks (e.g. your home computer and the FASRC cluster), it’s recommended to add the -z option to compress the data that’s transferred. This will cause more CPU to be used on both ends, but it is usually faster.

Maybe it's not fast enough, but seems a better place to start than scp imo.

rincebrain

3 months ago

1 reply

rsync in my experience is not optimized for a number of use cases.

Game development, in particular, often involves truly enormous sizes and numbers of assets, particularly for dev build iteration, where you're sometimes working with placeholder or unoptimized assets, and debug symbol bloated things, and in my experience, rsync scales poorly for speed of copying large numbers of things. (In the past, I've used naive wrapper scripts with pregenerated lists of the files on one side and GNU parallel to partition the list into subsets and hand those to N different rsync jobs, and then run a sync pass at the end to cleanup any deletions.)

Just last week, I was trying to figure out a more effective way to scale copying a directory tree that was ~250k files varying in size between 128b and 100M, spread out across a complicatedly nested directory structure of 500k directories, because rsync would serialize badly around the cost of creating files and directories. After a few rounds of trying to do many-way rsync partitions, I finally just gave the directory to syncthing and let its pregenerated index and watching handle it.

jmuhlich

3 months ago

Try this: https://alexsaveau.dev/blog/projects/performance/files/fuc/f...

> The key insight is that file operations in separate directories don’t (for the most part) interfere with each other, enabling parallel execution.

It really is magically fast.

EDIT: Sorry, that tool is only for local copies. I just remembered you're doing remote copies. Still worth keeping in mind.

regularfry

3 months ago

> The remote diffing algorithm is based on CDC. In our tests, it is up to 30x faster than the one used in rsync (1500 MB/s vs 50 MB/s).

velcrovan

3 months ago

1 reply

> Download the precompiled binaries from the latest release to a Windows device and unzip them. The Linux binaries are automatically deployed to ~/.cache/cdc-file-transfer by the Windows tools. There is no need to manually deploy them.

Interesting, so unlike rsync there is no need to set up a service on the destination Linux machine. That always annoyed me a bit about rsync.

justinsaccount

3 months ago

The most common use for rsync is to run it over ssh where it starts the receiving side automatically. cdc is doing the exact same thing.

You were misinformed if you thought using rsync required setting up an rsync service.

ksherlock

3 months ago

They should have duck ducked the initialism. CDC is Control Data Corporation.

Sammi

3 months ago

It's dead and archived atm, but it looks like a good candidate for revival as an actual active open source project. If you ever wanted to work on something that looks good on your resume, then this looks like your chance. Basically just get it running and released on all major platforms.

0xfeba

3 months ago

the name reminds me of Microsoft's RDC, Remote Differential Compression.

https://en.wikipedia.org/wiki/Remote_Differential_Compressio...

tgsovlerkhgsel

3 months ago

Key sentence: "The remote diffing algorithm is based on CDC [Content Defined Chunking]. In our tests, it is up to 30x faster than the one used in rsync (1500 MB/s vs 50 MB/s)."

syngrog66

3 months ago

CDC is an unfortunately chosen name

thesnide

3 months ago

i was thinking it is a new USB class, but then saw it is just another network file transfer protocol...

charleshwang

3 months ago

Is this how IBM Aspera works too? I was working QA at a game publisher a while ago, and they used it to upload some screen recordings. I didn't understand how it worked, but it was exceeding the upload speeds of the regular office internet.

https://www.ibm.com/products/aspera

guest567

3 months ago

https://github.com/google/cdc-file-transfer

the google repo has been archived, Did they give it up?

exikyut

3 months ago

I'm curious: what does MUC stand for? :)

janpmz

3 months ago

Tailscale and python3 -m http.server 1337 and then navigating the browser to ip:1337 is a nice way to transfer files too (without chunking). I've made an alias for it alias serveit="python3 -m http.server 1337"

phyzome

3 months ago

You can see something similar in use in the borg backup tool -- content-defined chunking, before deduplication and encryption.

claytongulick

3 months ago

I ran into some of those issues with the chunk size and hash misses when writing bitsync [1], but at the time I didn't want to get too clever with it because I was focused on rsync algorithm compatibility.

This is a cool idea!

[1] https://github.com/claytongulick/bit-sync

shae

3 months ago

I've read lots about content defined chunking and recently heard about monoidal hashing. I haven't tried it yet, but monoidal hashing reads like it would be all around better, does anyone know why or why not?

View full discussion on Hacker News

ID: 45433768Type: storyLast synced: 11/20/2025, 7:55:16 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN