Data Engineering and Software Engineering Are Converging
Original: Data engineering and software engineering are converging
Key topics
The lines between data engineering and software engineering are blurring, sparking a lively debate about the future of data infrastructure. Some commenters, like zurfer and craneca0, argue that while no-code tooling remains prevalent, a code-first approach will become more prominent to enable LLMs and agents to automate data work. Others, like giantg2 and CalRobert, contend that data engineering has always been a form of software engineering, with the distinction being more about job titles and focus areas than fundamental differences. As rawgabbit's experience with Snowflake's Python Stored Procedures illustrates, the convergence is already underway, with data engineers leveraging programming languages and software engineering techniques to tackle complex data tasks.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
1h
Peak period
35
0-6h
Avg / period
11
Based on 66 loaded comments
Key moments
- 01Story posted
Aug 29, 2025 at 2:43 PM EDT
4 months ago
Step 01 - 02First comment
Aug 29, 2025 at 4:02 PM EDT
1h after posting
Step 02 - 03Peak activity
35 comments in 0-6h
Hottest window of the conversation
Step 03 - 04Latest activity
Aug 31, 2025 at 8:18 PM EDT
4 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
I been in interviews where after reading my resume they say oh your an embedded developer. Another said a front-end, no a back-end, no a system developer, and other desktop developer. Reality, I did all of those to get the job done and create a viable product.
Likewise, we had to steer HR away from “data engineer” because we got very mixed results with candidates.
Along the same reason, that's why there are DBAs, dev-ops engineers, etc...
Obviously the same can apply in any given title, and does with data engineers like you pointed out, but it's not as simple as just title inflation.
There it is! I found the post title was strange. Thanks for setting the record straight so succinctly.
Ridiculous.
But I think whats interesting from the post is looking at SEs adopting data infra into their workflow, as opposed to DEs writing more software.
Anyone have workflows or tooling that are highly compatible with the entrenched notebook approach, and are easy to adopt? I want to prevent theses people from learning well-trodden lessons the hard way.
For Spark, glue works quite well. We use it as 'spark as a service', keeping our code as close to vanilla pyspark as possible. This leaves us free to write our code in normal python files, write our own (tested) libraries which are used in our jobs, use GitHub for version control and ci and so on
personally you couldn't pay me to run Spark myself these days (and I used to work for the biggest Hadoop vendor in the mid 2010s doing a lot of Spark!)
There are plenty of us out here with many repos, dozens of contributors, and thousands of lines of terraform, python, custom GitHub actions, k8s deployments running airflow and internal full stack web apps that we're building, EMR spark clusters, etc. All living in our own Snowflake/AWS accounts that we manage ourselves.
The data scientists that we service use notebooks extensively, but it's my teams job to clean it up and make it testable and efficient. You can't develop real software in a notebook, it sounds like they need to upskill into a real orchestration platform like airflow and run everything through it.
Unit test the utility functions and helpers, data quality test the data flowing in and out. Build diff reports for understanding big swings in the data to sign off changes.
My email is in my profile I'm happy to discuss further! :-)
You're still dealing with notebooks. Back then there was a tool to connect your IDE to a Databricks cluster. That got killed, not sure if they have something new.
1. You use a real programming language that supports all the abstractions software engineers rely on, not (just) SQL.
2. The data is not too big, so the feedback cycle is not too horrendously slow.
#2 can't ever be fully solved, but testing a data pipeline on randomly subsampled data can help a lot in my experience.
In all of these packages, the base tabular object you get is a local in-memory table. For manipulating remote SQL database tables, the best full-featured object API is provided by R's dbplyr package, IMHO.
I think Apache Spark, Apache Ibis, and some other big data packages can be configured to do this too, but IMHO their APIs are not nearly as helpful. For those who (understandably) don't want to use R and need an alternative to dbplyr, Apache Ibis is probably the best one to look at.
Another anecdatum: the data engineers role at Zillow is called "Software Development Engineer, Big Data"
Their organization often insists they must use standard tools, and their idea of a good job is that the task works fine within their personal version. No automatic testing, no automated deployment, no version control, and handcrafted environments. And then they get yelled at when things break and yelled at for taking too long. And most DEs want to quit the field after a few years.
The real question is not that DE and software engineering are converging. It's why most DEs don't have the self-respect and confidence to engineer systems so that their lives don't suck.
My view is that it isn't so much a lack of "self-respect and confidence" but an acknowledgment that the path of least resistance is often the best one. Often data teams are something that was tacked on as an afterthought and the organizational environment is oriented towards buying off-the-shelf solutions rather than developing things in house.
Saying that, versional control and replicable environments are becoming standard in the profession and, as data professionals become first class citizens in organizations, we may find that orgs orient themselves towards a more production focused environment.
My title is senior data engineer at GAMMA/FAANG/whatever we're calling them. I have a CS degree and am firmly in the engineering. My passion, though, is in using software engineering and computer science principles to make very large-scale data processing as stupid fast as we can. To the extent I can ignore it, I don't personally care much about the tooling and frameworks and such (CI/CD, Airflow, Kafka, whatever). I care about how we're affinitizing our data, how we index it, whether and when we can use data sketches to achieve a good tradeoff between accuracy and compute/memory, and so on.
While there are plenty of folks in this thread bashing analysts, one could also bash other "proper" engineers that can do the CI/CD but don't know shit about how to be efficient with petabyte-scale processing.
Make up a person and attack him, literal strawman. You sound pleasant to work with.
They’re not engineers and shouldn’t have been labeled data engineers. They have some other value to the company, presumably, but trying to repackage them as data engineers does cause issues. That’s the topic of this thread.
That job taught me a lot.
But that would be SWEs no?
I was a 'data engineer' (until they changed the terrible title) at a startup and I ended up having to fight with Spark and Apache Beam at times, eventually contributing back to improve throughput for our use cases.
That's not the same thing as a Business Analyst who can run a little pyspark query.
Python is dynamically typed, which you can patch a bit with type hints, but it's still easy to go to production with incompatible types, leading to panics in prod. It's uncompiled nature also makes it very slow.
SQL is pretty much impossible to unit test, yet often you will end up with logic that you want to test. E.g. to optimize a query.
For SQL I don't have a solution. It's a 50 year old language that lacks a lot of features you would expect. It's also the defacto standard for database access.
For Python I would say that we should start adopting statically typed compiled languages. Rust has polars as dataframe package, but the language itself isn't that easy to pick up. Go is very easy to learn, but has no serious dataframe package, so you end up doing a lot of that work yourself in goroutines. Maybe there are better options out there.
In general, choice of language isn’t important - again if you’re using spark your data frame structure schema defines that structure Python or not.
Most folks confuse pandas with “data engineering”. It’s not. Most data engineering is spark.
But when you create a dataframe in spark, that schema needs to be defined - or if it’s sql takes the form of the columns returned.
Use of Python can create hotspots with data transfers between spark and the Python gateway. Python UDFs are a common culprit.
Either way, my point is there are architectural and design points to your data solution that can cause many more problems than choice of language.
SQL is the most beautiful, expressive, get stuff done language I've used.
It is perfect for whatever data engineering is defined as.
- input data should be pseudorandom, so the chance of a test being “accidentally correct” is minimized
- you need a way to verify only part of the result set. Or, at the very least, a way to write tests so that if you add a column to the result set, your test doesn’t automatically break
In addition, I added CSV exports so you can verify the results by hand, and hot-reload for queries with CTEs — if you change a .sql file then it will immediately rerun each CTE incrementally and show you which ones’ output changed.
https://cs.brown.edu/~sk/Publications/Papers/Published/kf-da...
There's a specific type of "data engineer" (quotes to indicate this is what they're called by the business, not to contest their legitimacy) that just writes lots of SQL, but they're usually a bad hire for businesses. They're approximately as expensive as what people call platform engineers, but platform engineers in the data space can usually do modelling as well.
When organizations split teams up by the most SWE-type DEs and the pure SQL ones, the latter all jockey to join the former team which causes a lot of drama too.
[1] Fundamentals of Data Engineering:
https://www.oreilly.com/library/view/fundamentals-of-data/97...
[2] Fundamentals of Data Engineering Review:
https://maninekkalapudi.medium.com/fundamentals-of-data-engi...