You Think You Have Data Lineage, but You Probably Don't
Key topics
The data itself carries no history. You end up scraping logs, parsing queries, and making guesses across tools that were never built to retain this kind of context. By the time you try to trace something back, it's already too late or too incomplete to be useful.
The layered nature of modern stacks makes this worse. Each stage drops context. Ingest, transform, orchestrate, deliver and somewhere along the way, you lose track of what actually happened.
After years of dealing with this in real-world systems, I started building something different. In our system (called Tabsdata), every dataset is versioned as it flows from where it is produced to where it is finally consumed. Each transformation, join, and enrichment step is recorded along the way. As a result, lineage is accurate, explainable, and always available, without scraping anything. And so is provenance.
Curious how others are thinking about this. Has anyone found a clean way to make lineage reliable across tools, without turning it into a forensic exercise?
The author argues that most data systems lack true data lineage, instead relying on incomplete reconstruction methods, and presents their own solution, Tabsdata, while seeking feedback from others.
Snapshot generated from the HN discussion
Discussion Activity
No activity data yet
We're still syncing comments from Hacker News.
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Discussion hasn't started yet.