Replica_db – Synthetic Data Generator Using Rust and Gaussian Copulas

Posted18 days ago

pragadeesh21

1 points

1 comments

github.comProgrammingstory

informativeneutral

Debate

0/100

Synthetic DataRustAI Image Generation

Key topics

Synthetic Data

Rust

AI Image Generation

Discussion Activity

Light discussion

First comment

N/A

Peak period

Start

Avg / period

Key moments

01Story posted
Dec 15, 2025 at 3:30 PM EST
18 days ago
Step 01
02First comment
Dec 15, 2025 at 3:30 PM EST
0s after posting
Step 02
03Peak activity
1 comments in Start
Hottest window of the conversation
Step 03
04Latest activity
Dec 15, 2025 at 3:30 PM EST
18 days ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (1 comments)

Showing 1 comments

pragadeesh21Author

18 days ago

Hey HN,

I built this because i kept running into the same bottleneck on data projects: staging environments are always either empty or dangerous. Using production dumps always puts you at risk of PII leaks, but generating meaningful test data with python tools (like faker or SDV) often hit OOM errors or took hours once I tried to simulate anything complex.

I spent the last week writing replica_db to solve this. its a CLI tool written in rust that reverse engineers your existing Postgres schema and foreign key topology, then creates a "statistical genome" of your data using reservoir sampling.

The cool part (for me) was implementing Gaussian Copulas to handle correlations. Most generators treat columns independently, which creates non correlated data (like a user with age 5 earning $200k). I used nalgebra to compute the covariance matrix of numeric columns, so the engine actually learns the shape of the data.

I tested this on Uber NYC trip dataset, and it automatically detected the correlation between latitude and longitude. When i generated 5 million fake trips they respected the actual geography of NYC instead of placing points randomly in the ocean.

Benchmarks on my laptop have been encouraging. Scanning 564k real world rows takes about 2.2 seconds and generating 10 million synthetic rows takes under 5 seconds (~49k rows/sec) with constant memory usage. The output streams standard COPY format directly to stdout so you can pipe it straight into psql.

The repo isn't licensed yet. Its my first project involving this level of systems programming and statistical math in rust. So i'd appreciate any feedback on the implementation or the math strategy!

https://github.com/Pragadeesh-19/replica_db

View full discussion on Hacker News

ID: 46280158Type: storyLast synced: 12/15/2025, 8:35:16 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN