Data Engineering and Software Engineering Are Converging

Posted4 months agoActive4 months ago

Original: Data engineering and software engineering are converging

craneca0

108 points

66 comments

clickhouse.comTech DiscussionstoryHigh profile

informativepositive

Debate

20/100

Data EngineeringProgrammingDeveloper ToolsData Infrastructure

Key topics

Data Engineering

Programming

Developer Tools

Data Infrastructure

The lines between data engineering and software engineering are blurring, sparking a lively debate about the future of data infrastructure. Some commenters, like zurfer and craneca0, argue that while no-code tooling remains prevalent, a code-first approach will become more prominent to enable LLMs and agents to automate data work. Others, like giantg2 and CalRobert, contend that data engineering has always been a form of software engineering, with the distinction being more about job titles and focus areas than fundamental differences. As rawgabbit's experience with Snowflake's Python Stored Procedures illustrates, the convergence is already underway, with data engineers leveraging programming languages and software engineering techniques to tackle complex data tasks.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

Peak period

0-6h

Avg / period

Comment distribution66 data points

Loading chart...

Based on 66 loaded comments

Key moments

01Story posted
Aug 29, 2025 at 2:43 PM EDT
4 months ago
Step 01
02First comment
Aug 29, 2025 at 4:02 PM EDT
1h after posting
Step 02
03Peak activity
35 comments in 0-6h
Hottest window of the conversation
Step 03
04Latest activity
Aug 31, 2025 at 8:18 PM EDT
4 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (66 comments)

Showing 66 comments

zurfer

4 months ago

2 replies

Maybe. On the one side you have something like dbt or Moosestack. On the other hand analytics and data pipelining is still a lot of no code tooling and I doubt it will go away. However I would love to learn more about how other people use coding agents to do DE tasks.

rawgabbit

4 months ago

1 reply

In Snowflake, I am now writing Python Stored Procedures that make REST API calls to things like Datadog REST API and dumping the JSON into a Snowflake table. I then unpack the JSON and transform it into a normalized table. So far it works reasonably well. This is possible using Snowflake's external access feature. https://docs.snowflake.com/en/developer-guide/external-netwo...

hobs

4 months ago

1 reply

Right, but isn't snowflake like the most expensive way to run python? why not just host the python outside of snowflake and store it there?

rawgabbit

4 months ago

It is to take advantage of snowpark.

craneca0Author

4 months ago

agreed on the presence and stickiness of no-code tooling. but in a future where we want to enable LLMs and agents to do as much of that work as possible, a code-first approach seems far more likely to make that effective. not just because agents are better are writing code than clicking through interfaces (maybe that will change as agents evolve?), but because the SDLC is valuable for agents for the same reasons it's valuable for human developers - collaboration, testing, auditing, versioning, etc.

giantg2

4 months ago

2 replies

I've never really seen the distinction between data and software engineering. It's more like front-end vs backend. If you're a data engineer and it's all no code tooling, then you're just an analyst or something.

flexiflex

4 months ago

1 reply

When I worked at bigCo , it was a totally different world. Data engineers used data platform tools to do data work, usually for data’s sake. Software teams trying to build stuff with data had to finagle their way onto roadmaps.

sdairs

4 months ago

this has been my experience too

yndoendo

4 months ago

The difference in titles is more or less where most of the time is spent. Developer could be doing front-end, back-end, embedded, high-performance computing, system, game, data analysis, or any other niche work. All of those have different design, tooling, and ways of thinking that you gain through actually doing.

I been in interviews where after reading my resume they say oh your an embedded developer. Another said a front-end, no a back-end, no a system developer, and other desktop developer. Reality, I did all of those to get the job done and create a viable product.

CalRobert

4 months ago

8 replies

Data engineering was software engineering from the very beginning. Then a bunch of business analysts who didn't know anything about writing software got jealous and said that if you knew SQL/DBT you were a data engineer. I've had to explain too many times that yes, indeed, I can set up a CI/CD pipeline or set up kafka or deploy Dagster on ECS, to the point where I think I need to change my title just to not be cheapened.

craneca0Author

4 months ago

2 replies

yeah, i've seen large fortune 100 data and analytics orgs where the majority of folks with data engineering titles are uncomfortable with even the basics of git.

vjvjvjvjghv

4 months ago

We have these at my company. They refuse to do any infrastructure work so you have to spoon feed the databases to them ready to go. It’s pretty annoying.

Foobar8568

4 months ago

Or the basics of SQL...

kentm

4 months ago

1 reply

Yep, I specifically asked my company to make sure my job title was not “data engineer” when working on data infrastructure, because there was a growing trend of using it to mean “can write some sql”.

Likewise, we had to steer HR away from “data engineer” because we got very mixed results with candidates.

itsoktocry

4 months ago

3 replies

Ironic, since "Data Engineers" are probably far more in demand right now than "Software Engineers".

majormajor

4 months ago

1 reply

The "write SQL for ETLs all day" job is a risky one right now since LLMs really lower the barrier for dealing with gnarly SQL. So it's still not a bad time to have your resume be as clear as possible that you're the "deals with complex distributed systems" SWE type instead.

mrugge

4 months ago

That's more of an analytics engineer role. LLMs lower the barrier to entry, but popular SQL queries are about correctness and flexibility and this often requires deep understanding and ownership of each filter and window function. This lower barrier can quickly can turn into enough rope to hang yourself.

CyberDildonics

4 months ago

1 reply

Why would a software engineer not be able to do both roles?

omgwtfbyobbq

4 months ago

They can, but it takes time away from software engineering.

Along the same reason, that's why there are DBAs, dev-ops engineers, etc...

ozim

4 months ago

Only in places silly enough to believe software devs/enga cannot write SQL.

majormajor

4 months ago

1 reply

"Data Engineering" being considered a different role from "regular" SWE predates DBT by... at least one decade? If not two? Probably folks working with Hadoop vs RDMS DBA jobs.

snthpy

4 months ago

In work yes, but as a title? I only started seeing it called that around dbt origin.

omgwtfbyobbq

4 months ago

Part of the problem is that a BA/BSA who writes Python, SQL, etc... as part of their day to day work will get lumped with those who don't and their salary doesn't reflect their skills and work product.

Obviously the same can apply in any given title, and does with data engineers like you pointed out, but it's not as simple as just title inflation.

mrugge

4 months ago

Titles in software engineering have never mattered less than they do today. Energy worrying about titles or jealosy over specific tech ownership is best channeled into focus on customer, on problem to solve and on finding the best way to solve it as a team.

mandeepj

4 months ago

> Data engineering was software engineering from the very beginning.

There it is! I found the post title was strange. Thanks for setting the record straight so succinctly.

isaacremuant

4 months ago

Agreed. Weird distinction to pay less to people who did certain things and you could a high variance between "data engineers". Some who had only done a course and others that had extensive knowledge of software engineering practices were considered the same.

Ridiculous.

sdairs

4 months ago

I think even before dbt turned DE into "just write sql & yaml", there was an appreciable difference in DE vs SE. There was defo some DEs writing a lot of java/scala if they were in Spark heavy co's, but my experience is that DEs were doing a lot more platform engineering (similar to what you suggest), SQL and point-and-click (just because that was the nature of the tooling). I wasn't really seeing many DEs spending a lot of time in an IDE.

But I think whats interesting from the post is looking at SEs adopting data infra into their workflow, as opposed to DEs writing more software.

zamalek

4 months ago

5 replies

One things have seen through my more recent exposure to experienced data engineers is the lack of repeatability rigor (CI/CD, IaC, etc.). There's a lot of doing things in notebooks and calling that production-ready. Databricks has git (GitHub only from what I can tell) integration, but that's just checking out and directly committing to trunk, if it's in git then we have SDLC right, right? It's fucking nuts.

Anyone have workflows or tooling that are highly compatible with the entrenched notebook approach, and are easy to adopt? I want to prevent theses people from learning well-trodden lessons the hard way.

RobinL

4 months ago

1 reply

I think this may be a databricks thing? From what I've seen there's a gap between data engineers forced to use databricks and everyone else. From what I've seen, at least how it's used in practice, databricks seems to result in a mess of notebooks with poor dependency and version management.

zamalek

4 months ago

2 replies

Interesting, databricks has been my first exposure to DE at scale and it does seem to solve many problems (even though it sounds like it's causing some). So what does everyone else do? Run spark etc. themselves?

RobinL

4 months ago

We use aws glue for spark (but are increasingly moving towards duckdb because it's faster for our workloads and easier to test and deploy).

For Spark, glue works quite well. We use it as 'spark as a service', keeping our code as close to vanilla pyspark as possible. This leaves us free to write our code in normal python files, write our own (tested) libraries which are used in our jobs, use GitHub for version control and ci and so on

sdairs

4 months ago

tbh I see just as much notebook-hell outside of dbx, it's certainly not contained to just them. There's some folks doing good SDLC with Spark jobs in java/scala, but I've never found it to be overly common, I see "dump it on the shared drive" equally as much lol. IME data has always been a bit behind in this area

personally you couldn't pay me to run Spark myself these days (and I used to work for the biggest Hadoop vendor in the mid 2010s doing a lot of Spark!)

faxmeyourcode

4 months ago

This is insane to read as a data engineer who actually builds software. These sound like amateurs, not experienced data engineers to be perfectly honest.

There are plenty of us out here with many repos, dozens of contributors, and thousands of lines of terraform, python, custom GitHub actions, k8s deployments running airflow and internal full stack web apps that we're building, EMR spark clusters, etc. All living in our own Snowflake/AWS accounts that we manage ourselves.

The data scientists that we service use notebooks extensively, but it's my teams job to clean it up and make it testable and efficient. You can't develop real software in a notebook, it sounds like they need to upskill into a real orchestration platform like airflow and run everything through it.

Unit test the utility functions and helpers, data quality test the data flowing in and out. Build diff reports for understanding big swings in the data to sign off changes.

My email is in my profile I'm happy to discuss further! :-)

esafak

4 months ago

For CI, try dagger. It's code based and runs locally too, so you can write tests. But it is a moving target and more complex than Docker.

ViewTrick1002

4 months ago

That is what dbt solves. Version your SQL and continuously rehydrate the data to match the most recent models.

jochem9

4 months ago

Last time I worked with Databricks you could just create branches in their interface. PRs etc happened in your git provider, which for us was azure devops back then. We also managed some CI/CD.

You're still dealing with notebooks. Back then there was a tool to connect your IDE to a Databricks cluster. That got killed, not sure if they have something new.

getnormality

4 months ago

1 reply

It's not hard to do data engineering to the standards of software engineering, and many people do it already, provided that

1. You use a real programming language that supports all the abstractions software engineers rely on, not (just) SQL.

2. The data is not too big, so the feedback cycle is not too horrendously slow.

#2 can't ever be fully solved, but testing a data pipeline on randomly subsampled data can help a lot in my experience.

sdairs

4 months ago

1 reply

In your experience, how are folks doing (1)? The post is talking about a framework to add e.g. type safety, schema-as-code, etc. over assets in data infra in a familiar way as to what is common with Postgres; I'm not familiar with much else out there for that?

getnormality

4 months ago

Python, R, and Julia all have at least one package that defines a tabular data type. That means we can pass tables to functions, use them in classes, write tests for them, etc.

In all of these packages, the base tabular object you get is a local in-memory table. For manipulating remote SQL database tables, the best full-featured object API is provided by R's dbplyr package, IMHO.

I think Apache Spark, Apache Ibis, and some other big data packages can be configured to do this too, but IMHO their APIs are not nearly as helpful. For those who (understandably) don't want to use R and need an alternative to dbplyr, Apache Ibis is probably the best one to look at.

banku_brougham

4 months ago

1 reply

If are orchestrating pipelines in airflow or Prefect you are having to write the client software around those engines, and its a lot of python.

Another anecdatum: the data engineers role at Zillow is called "Software Development Engineer, Big Data"

craneca0Author

4 months ago

That's interesting with the Zillow anecdote. I wonder if the nuance in the title is actually correlated with a difference in behavior/culture/best practices/approach?

botswana99

4 months ago

1 reply

Many data teams often find themselves as 'tool jockeys' instead of becoming true engineers. They primarily learn some company data, and then rely on drag-and-drop or YML configuration functionality within the constraints of the tool's environment.

Their organization often insists they must use standard tools, and their idea of a good job is that the task works fine within their personal version. No automatic testing, no automated deployment, no version control, and handcrafted environments. And then they get yelled at when things break and yelled at for taking too long. And most DEs want to quit the field after a few years.

The real question is not that DE and software engineering are converging. It's why most DEs don't have the self-respect and confidence to engineer systems so that their lives don't suck.

rorylawless

4 months ago

Prefacing this with an acknowledgement that I'm a public sector data analyst by trade so my experience may not be universal.

My view is that it isn't so much a lack of "self-respect and confidence" but an acknowledgment that the path of least resistance is often the best one. Often data teams are something that was tacked on as an afterthought and the organizational environment is oriented towards buying off-the-shelf solutions rather than developing things in house.

Saying that, versional control and replicable environments are becoming standard in the profession and, as data professionals become first class citizens in organizations, we may find that orgs orient themselves towards a more production focused environment.

mynameisash

4 months ago

3 replies

The comments here are... interesting, as they indicate a strong split between analysts and those engineers that can operationalize things. I see another dimension to it all.

My title is senior data engineer at GAMMA/FAANG/whatever we're calling them. I have a CS degree and am firmly in the engineering. My passion, though, is in using software engineering and computer science principles to make very large-scale data processing as stupid fast as we can. To the extent I can ignore it, I don't personally care much about the tooling and frameworks and such (CI/CD, Airflow, Kafka, whatever). I care about how we're affinitizing our data, how we index it, whether and when we can use data sketches to achieve a good tradeoff between accuracy and compute/memory, and so on.

While there are plenty of folks in this thread bashing analysts, one could also bash other "proper" engineers that can do the CI/CD but don't know shit about how to be efficient with petabyte-scale processing.

tdb7893

4 months ago

1 reply

I mean this very sincerely but I'm a little lost how data engineering is distinct from software engineering. It seems like just a subset of it, my title was software engineer and I've done what sounds like very similar work.

lamp_book

4 months ago

I’m pretty sure the term came from Google (at least that is where I heard it first described) and just referred to a backend engineer with speciality in this area. Now usually these roles have “distributed systems” in the title, even if you aren’t really on the inside of the systems. That or “systems and infrastructure”, “data infrastructure”, or “AI/ML infrastructure” or sometimes “MLE” for those kinds of orgs. Or back to good ole “big data” now that it’s no longer tacked on everything.

kentm

4 months ago

2 replies

People who can utilize the tooling to process petabytes of data efficiently aren’t the ones that are catching flack. The people I’m thinking of basically run massive inefficient SQL queries and then throw their hands up when it runs slowly or gets an oom error. They don’t even know how to do an explain plan. And if you try to explain to them things like partitioning, indexes, sketches, etc then they are not able to comprehend and argue that it’s not their job to learn, and that it’s the “proper engineers” job to scale the processing.

itsoktocry

4 months ago

1 reply

>And if you try to explain to them things like partitioning, indexes, sketches, etc then they are not able to comprehend and argue that it’s not their job to learn, and that it’s the “proper engineers” job to scale the processing.

Make up a person and attack him, literal strawman. You sound pleasant to work with.

kentm

4 months ago

I’m referring to actual people I have worked and interacted with so no not made up.

They’re not engineers and shouldn’t have been labeled data engineers. They have some other value to the company, presumably, but trying to repackage them as data engineers does cause issues. That’s the topic of this thread.

CalRobert

4 months ago

My boss at a large company years ago wrote a query for daily stats and then proceeded to run it on the entire event history every day for the life of the company just to get DAU, etc. The solution was to just keep paying more for redshift until the bill was a few million a year. Suggestions to fix his crap were met with disdain.

That job taught me a lot.

VirusNewbie

4 months ago

>one could also bash other "proper" engineers that can do the CI/CD but don't know shit about how to be efficient with petabyte-scale processing.

But that would be SWEs no?

I was a 'data engineer' (until they changed the terrible title) at a startup and I ended up having to fight with Spark and Apache Beam at times, eventually contributing back to improve throughput for our use cases.

That's not the same thing as a Business Analyst who can run a little pyspark query.

jochem9

4 months ago

3 replies

One thing that I don't see mentioned but that does bug me: data engineers often use a lot of Python and SQL, even the ones that have heavily adopted software engineering best practices. Yet both languages are not great for this.

Python is dynamically typed, which you can patch a bit with type hints, but it's still easy to go to production with incompatible types, leading to panics in prod. It's uncompiled nature also makes it very slow.

SQL is pretty much impossible to unit test, yet often you will end up with logic that you want to test. E.g. to optimize a query.

For SQL I don't have a solution. It's a 50 year old language that lacks a lot of features you would expect. It's also the defacto standard for database access.

For Python I would say that we should start adopting statically typed compiled languages. Rust has polars as dataframe package, but the language itself isn't that easy to pick up. Go is very easy to learn, but has no serious dataframe package, so you end up doing a lot of that work yourself in goroutines. Maybe there are better options out there.

orochimaaru

4 months ago

1 reply

If you’re using some variety of spark for your data engineering then scala is an option too.

In general, choice of language isn’t important - again if you’re using spark your data frame structure schema defines that structure Python or not.

Most folks confuse pandas with “data engineering”. It’s not. Most data engineering is spark.

rovr138

4 months ago

1 reply

in spark, doesn't pyspark and sql both still get translated to scala?

orochimaaru

4 months ago

Yes. But with pyspark there is a Python gateway, the sql I think is translated natively in spark.

But when you create a dataframe in spark, that schema needs to be defined - or if it’s sql takes the form of the columns returned.

Use of Python can create hotspots with data transfers between spark and the Python gateway. Python UDFs are a common culprit.

Either way, my point is there are architectural and design points to your data solution that can cause many more problems than choice of language.

greekorich

4 months ago

1 reply

I've been a professional java dev for a decade. I've written a little python, clojure, lots of JS/TS/Node.

SQL is the most beautiful, expressive, get stuff done language I've used.

It is perfect for whatever data engineering is defined as.

antupis

4 months ago

2 replies

SQL is beautiful when it works but when it doesn’t you end up with some abomination eg if you need some kind dynamic query.

omgwtfbyobbq

4 months ago

Can you share more detail? Every pain point I've seen can be handled by a more suitable/performant/flexible DB and/or software that makes working with SQL less painful.

mrugge

4 months ago

The two most helpful things with SQL are (1) always use set-based operations (never a cursor) and (2) break up your queries into smallest possible reusable chunks (CTEs). Sprinkle in tests to taste. Without some discipline SQL can get out of hand. This is what made dbt popular.

sbrother

4 months ago

1 reply

When I was most recently at Google (2021-ish) my team owned a bunch of SQL Pipelines that had fairly effective SQL tests. Not my favorite thing to work on, but it was a productive way to transform data. There are lots of open source versions of the same idea, but I have yet to see them accompanied with ergonomic testing. Any recommendations or pointers to open source SQL testing frameworks?

physicles

4 months ago

Could you describe what made those tests effective? I just wrote some tools to write concise tests for some analytics queries, and some principles I stumbled on are:

- input data should be pseudorandom, so the chance of a test being “accidentally correct” is minimized

- you need a way to verify only part of the result set. Or, at the very least, a way to write tests so that if you add a column to the result set, your test doesn’t automatically break

In addition, I added CSV exports so you can verify the results by hand, and hot-reload for queries with CTEs — if you change a .sql file then it will immediately rerun each CTE incrementally and show you which ones’ output changed.

skybrian

4 months ago

Here’s an argument that college freshmen should be introduced to data science and computer science in the same introductory course. They’ve written a textbook, which seems pretty sensible:

https://cs.brown.edu/~sk/Publications/Papers/Published/kf-da...

SrslyJosh

4 months ago

"Data engineering and software engineering are converging" says firm selling analytics products/services. I think the perspective here may be a bit skewed.

ludicity

4 months ago

I think they've been fully converged in most strong practitioners for a long time.

There's a specific type of "data engineer" (quotes to indicate this is what they're called by the business, not to contest their legitimacy) that just writes lots of SQL, but they're usually a bad hire for businesses. They're approximately as expensive as what people call platform engineers, but platform engineers in the data space can usually do modelling as well.

When organizations split teams up by the most SWE-type DEs and the pure SQL ones, the latter all jockey to join the former team which causes a lot of drama too.

mrugge

4 months ago

This split between the main app stack and the data engineering / analytics stack is a time-tested architectural pattern. Has clickhouse changed the game so much that it is no longer helpful to have these purpose-built stacks? With modern coding agents being able to write more faster it might be good to explore more separation and purpose-built stacks and less convergence.

teleforce

4 months ago

For the foundation on data engineering I'd recommend this book by Joe Reis and Matt Housley. They did a good job on providing the framework that includes data engineering lifecycle, software engineering, data management, data architecture, etc. You can check the proposed framework here [1],[2].

[1] Fundamentals of Data Engineering:

https://www.oreilly.com/library/view/fundamentals-of-data/97...

[2] Fundamentals of Data Engineering Review:

https://maninekkalapudi.medium.com/fundamentals-of-data-engi...

View full discussion on Hacker News

ID: 45067867Type: storyLast synced: 11/20/2025, 3:29:00 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN