Replica_db – Synthetic Data Generator Using Rust and Gaussian Copulas
Posted18 days ago
github.comProgrammingstory
informativeneutral
Debate
0/100
Synthetic DataRustAI Image Generation
Key topics
Synthetic Data
Rust
AI Image Generation
Discussion Activity
Light discussionFirst comment
N/A
Peak period
1
Start
Avg / period
1
Key moments
- 01Story posted
Dec 15, 2025 at 3:30 PM EST
18 days ago
Step 01 - 02First comment
Dec 15, 2025 at 3:30 PM EST
0s after posting
Step 02 - 03Peak activity
1 comments in Start
Hottest window of the conversation
Step 03 - 04Latest activity
Dec 15, 2025 at 3:30 PM EST
18 days ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 46280158Type: storyLast synced: 12/15/2025, 8:35:16 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
I built this because i kept running into the same bottleneck on data projects: staging environments are always either empty or dangerous. Using production dumps always puts you at risk of PII leaks, but generating meaningful test data with python tools (like faker or SDV) often hit OOM errors or took hours once I tried to simulate anything complex.
I spent the last week writing replica_db to solve this. its a CLI tool written in rust that reverse engineers your existing Postgres schema and foreign key topology, then creates a "statistical genome" of your data using reservoir sampling.
The cool part (for me) was implementing Gaussian Copulas to handle correlations. Most generators treat columns independently, which creates non correlated data (like a user with age 5 earning $200k). I used nalgebra to compute the covariance matrix of numeric columns, so the engine actually learns the shape of the data.
I tested this on Uber NYC trip dataset, and it automatically detected the correlation between latitude and longitude. When i generated 5 million fake trips they respected the actual geography of NYC instead of placing points randomly in the ocean.
Benchmarks on my laptop have been encouraging. Scanning 564k real world rows takes about 2.2 seconds and generating 10 million synthetic rows takes under 5 seconds (~49k rows/sec) with constant memory usage. The output streams standard COPY format directly to stdout so you can pipe it straight into psql.
The repo isn't licensed yet. Its my first project involving this level of systems programming and statistical math in rust. So i'd appreciate any feedback on the implementation or the math strategy!
https://github.com/Pragadeesh-19/replica_db