Why Was Apache Kafka Created?

Posted5 months agoActive4 months ago

enether

201 points

221 comments

bigdata.2minutestreaming.comTechstoryHigh profile

calmmixed

Debate

70/100

Apache KafkaData StreamingEvent-Driven Architecture

Key topics

Apache Kafka

Data Streaming

Event-Driven Architecture

The article discusses the origins of Apache Kafka and its use cases, sparking a discussion on its strengths, weaknesses, and alternatives in the HN community.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

23h

Peak period

138

Days 1-2

Avg / period

22.9

Comment distribution160 data points

Loading chart...

Based on 160 loaded comments

Key moments

01Story posted
Aug 22, 2025 at 3:31 PM EDT
5 months ago
Step 01
02First comment
Aug 23, 2025 at 2:46 PM EDT
23h after posting
Step 02
03Peak activity
138 comments in Days 1-2
Hottest window of the conversation
Step 03
04Latest activity
Sep 14, 2025 at 11:03 PM EDT
4 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (221 comments)

Showing 160 comments of 221

blast

5 months ago

1 reply

(I was wondering if this was some sort of generated ripoff but the author worked on Kafka for 6 years: https://x.com/BdKozlovski.)

enetherAuthor

5 months ago

1 reply

What do you mean by "generated ripoff"? Are you saying it read like AI?

blast

5 months ago

1 reply

I mean that it looked like either AI generated or blogspam or both. Not the writing though, more the look of the page. Happy to be wrong.

enetherAuthor

5 months ago

It's neither. Only the thumbnail background is AI generated.

The look of the page -- that's Substack's default UI, you can't control it too much. The other images are created by me.

I'm simply curious what parts give that "cheap" look so I can improve. On Reddit I've had massive amounts of downvotes because they guess the content is AI, when in fact no AI is used in the creation process at all.

One guess I have is the bullet points + bolding combo. Most AIs use a ton of that, and rightly so, because it aids in readability.

polynomial

5 months ago

2 replies

Why was it named that is also a question.

isaacremuant

5 months ago

3 replies

> Jay Kreps chose to name the software after the author Franz Kafka because it is "a system optimized for writing", and he liked Kafka's work.

From Wikipedia.

physicles

5 months ago

That’s funny, I assumed it was called Kafka because the act of processing items in a queue over and over could be described as kafkaesque.

clippy99

5 months ago

Kafkaesque to configure and get running for simple tasks.

atombender

5 months ago

A system optimized for writing could also describe the machine in Kafka's "In the Penal Colony".

theyinwhy

5 months ago

1 reply

It is called Kafka because it can write.

throw310822

5 months ago

1 reply

It's a bit like calling a dictation software "Hitler" because he also liked to dictate.

alt227

5 months ago

1 reply

Thats a brilliant idea, although if I ever create a dictation software I am going to call it 'Mussolini'.

throw310822

5 months ago

> Thats a brilliant idea

I know, Claude was also enthusiastic about it.

kachapopopow

5 months ago

4 replies

As someone who has made the mistake of using kafka in a non enterprise space - it really seems like the etcd problem where you need more time to run etcd than to run whatever service you're providing.

mrweasel

5 months ago

9 replies

I previously helped clients setup and run Kafka clusters. Why they'd need Kafka was always our first question, never got a good answer from a single one of them. That's not to say that Kafka isn't useful, it is, in the right setting, but that settings is never "I need a queue". If you need a queue, great, go get RabbitMQ, ZMQ, Redis, SQS, named pipes, pretty anything but Kafka. It's not that Kafka can't do it, but you are making things harder than they needed to be.

adev_

5 months ago

4 replies

> Why they'd need Kafka was always our first question, never got a good answer from a single one of them

"To follow the hype train, Bro" is often the real answer.

> If you need a queue, great, go get RabbitMQ, ZMQ, Redis, SQS, named pipes, pretty anything but Kafka.

Or just freaking MQTT.

MQTT has been battle-proven for 25 years, is simple and does perfectly the job if you do not ship GBs of blobs through your messaging system (which you should not do anyway).

atomicnumber3

5 months ago

1 reply

It's resume-driven development. It honestly can make sense for both company and employee.

Companies get standard tech stacks people are happy to work with, because working with them gets people experience with tech stacks that are standard at many companies. It's a virtuous cycle.

And sure even if you need just a specific thing, it's often better to go slightly overkill for something that's got millions of stack overflow solutions for common issues figured out. Vs picking some niche thing that you are now 1 of like six total people in the entire world using in prod.

Obviously the dose makes the poison and don't use kafka for your small internal app thing and don't use k8s where docker will do, but also, probably use k8s if you need more than docker instead of using some weird other thing nobody will know about.

laughing_man

5 months ago

1 reply

That's what happened where I worked. The people making the tech decisions were worried they weren't "keeping up" and committed us all-in on kafka. That decision cost the company millions.

adev_

5 months ago

1 reply

> That decision cost the company millions.

And 5 years later the responsible of the decision left the company with a giant pile of mess behind him/her.

But let's see things positively: he can now add "Kafka at scale" on the CV.

laughing_man

5 months ago

That is exactly what happened.

munksbeer

5 months ago

1 reply

> Or just freaking MQTT.

Disclaimer: I'm a dev and I'm not very familiar with the actual maintenance of kafka clusters. But we run the aws managed service version (MSK), and it seems to just pretty much work.

We send terrabytes of data through kafka asynchronously, because of its HA properties and persistent log, allowing consumers to consume in their own time and put the data where it needs to be. So imagine, many apps across our entire stack have the same basic requirement, publish a lot of data which people want to analyse somewhere later. Kafka gives us a single mechanism to do that.

So now my question. I've never used MQTT before. What are the benefits of using MQTT in our setup vs using kafka?

cowanon2222

5 months ago

2 replies

I use MQTT daily. I'm not sure why the commenter suggested it; it is orthogonal to queueing or log streams.

MQTT is a publish/subscribe protocol for large scale distributed messaging, often used in small embedded devices or factories. It is made for efficient transfer of small, often byte sized payloads of IoT device data. It does not replace Kafka or RabbitMQ - messages should be read off of the MQTT broker as quickly as possible. ( I know this from experience - MQTT brokers get bogged down rapidly if there are too many messages "in flight")

A very common pattern is to use MQTT for communications, and then Kafka or RabbitMq for large scale queuing of those messages for downstream applications.

munksbeer

5 months ago

Thank you.

adev_

5 months ago

> it is orthogonal to queueing or log streams.

That is currently the problem.

A lot of usage of Kafka I have seen in the wild are not for log stream or queing but deployed as a simple pub/sub messaging service because "why not".

physicles

5 months ago

MQTT and Kafka solve different problems. At my current company, we use both.

Kafka isn’t a queue. It’s overkill to use it as one.

Kafka is a great place to persist data for minutes, hours or days before it’s processed. It fully decouples producers and consumers. It’s also stupidly complex and very hard to operate reliably in an HA configuration.

MQTT is good for when data needs to leave or enter your cloud, but persistence is bolted on (at least it is in mosquitto), so a crash means lost data even though you got a PUBACK.

mdaniel

5 months ago

I presume one will want to use https://github.com/eclipse-mosquitto/mosquitto if going that route because I seem to recall the "mainstream" MQTT project doing a rugpull but since I'm not deeply in that community, I don't have substantiating links handy

Joeri

5 months ago

6 replies

Kafka isn’t a queue, it’s a distributed log. A partitioned topic can take very large volumes of message writes, persist them indefinitely, deliver them to any subscriber in-order and at-least-once (even for subscribers added after the message was published), and do all of that distributed and HA.

If you need all those things, there just are not a lot of options.

edem

5 months ago

1 reply

What do you think about Temporal?

cyberpunk

5 months ago

Okay for small numbers of high value jobs (e.g uber trips or food deliveries etc), prohibitively expensive for anything you need even a few k/sec of.

HarHarVeryFunny

5 months ago

3 replies

Why do you say log rather than just publish and subscribe?

HarHarVeryFunny

5 months ago

1 reply

The way people choose to use feedback on HN never fails to suprise me - we've got a generally intelligent user base here, but the most common type of feedback voting isn't because something is wrong but rather a childish "I don't like it - I want to suppress this comment".

In this case it's something different - this was an honest question, and received two useful replies, so why downvote?! The mental model of people using Kafka is useful to know - in this case the published data being more log-like than stream-like since it's retained per a TTL policy, with each "subscriber" having their own controllable read index.

MYEUHD

5 months ago

1 reply

> Please don't comment about the voting on comments. It never does any good, and it makes boring reading.

https://news.ycombinator.com/newsguidelines.html#comments

HarHarVeryFunny

5 months ago

Same goes for metacomments about voting.

exitb

5 months ago

Clients don’t have to subscribe to latest messages, but rather can request any available offset range.

halifaxbeard

5 months ago

the log stays on kafka for replay until your per log retention settings delete it

mannycalavera42

5 months ago

HERO

holografix

5 months ago

Finally a valuable answer thank you

iamcreasy

4 months ago

Why the delineation? As long as you have a queue - you can publishing whatever you want in it, right? Be it log or something else.

diarrhea

5 months ago

Perhaps the best terse summary of Kafka I have come across yet.

EdwardDiego

5 months ago

2 replies

I'd started using it at v0.8 at a previous adtech company because my problem was "We generate terabytes of events a day we need to process and aggregate and bill on, how the hell do we move this data around reliably?"

The data team I'd inherited had started with NFS and shell scripts, before a brief detour into GlusterFS after NFS proved to be, well, NFS. GlusterFS was no better.

Using S3 was better, but we still hit data loss problems (on our end, not S3 's, to clear) which isn't great when you need to bill on some of that data.

Then I heard about Kafka, bought a copy of I <3 Logs, and decided that maybe Kafka was worth the complexity, and boom, it was. No more data loss, and a happier business management.

I was headhunted for my current company for my Kafka experience. First thing I realised when I looked at the product was - "Ah, we don't need Kafka for this."

But the VP responsible was insistent. So now I spend a lot of time doing education on how to use Kafka properly.

And the very first thing I start with is "Kafka is not a queue. It's a big dumb pipe that does very smart things to move data efficiently and with minimal risk of data loss - and the smartest thing it does, it choosing to be very dumb.

Want to synchronously know if your message was consumed? Kafka don't care. You need a queue."

Gee101

4 months ago

1 reply

Have you found a good Head for Kafka to easily query the Topics using a SQL like language? Especially something that can infer table schema from the Schema Registry.

EdwardDiego

4 months ago

1 reply

KSQL?

Gee101

4 months ago

1 reply

It's not a create solution and Confluent is not really investing in it anymore.

EdwardDiego

4 months ago

True, they've gone all in on Flink, which can also do what you want. And I suspect SparkSQL can do it also these days, but I haven't looked at that for ages.

Good luck!

edem

5 months ago

Do you have a blog somewhere? Where do I learn how to use Kafka properly? I like the idea behind it, but its quirks...not so much.

selkin

5 months ago

3 replies

KIP-932[0] adds queue semantics for Kafka. You may still want to use another queue though: as always, no one size fits all.

[0] https://cwiki.apache.org/confluence/display/KAFKA/KIP-932%3A...

hvb2

5 months ago

I CAN drive a Ferrari to the grocery store.

Sure, it can do it. But it's not efficient or what it's good at.

EdwardDiego

5 months ago

I'm really unsure where the drive for this is coming from tbh. (cough CFLT share price since IPO, big enterprise customers cough) If this was so desirable, everyone would've jumped ship to Pulsar already.

rcombatwombat

4 months ago

Still not what I would call 'proper' queuing (IMHO!) because ultimately it's still just Kafka and just append only: you can't 'atomically acknowledge+delete' individual messages anywhere in the stream, you only dealing with offsets and that's a fundamental thing that has many consequences (e.g. you don't get ordering between batches) and also means that you will get re-deliveries in some scenarios that you would not have with a system that implements proper queuing.

kachapopopow

5 months ago

2 replies

I do not recommend Redis (janky implementation, subscribers drop randomly, java libraries are a crime against humanity) or RabbitMQ (memory issues), ZMQ is not a messaging queue, named pipes are not reliable and what the hell is SQS.

gigatexal

5 months ago

1 reply

How have you had so many issues with Redis. We used it at a previous place it was basically bullet proof. That being said we didn’t use Java but Python. Idk.

kachapopopow

5 months ago

lettuce - doesn't reconnect properly during redis restarts (1/10 chance) jedis - subscriptions drop and stop receiving for no reason

however, my latest wrapper for jedis does seem to be holding up and haven't had too many issues, but I have a very robust checking for dropped connections.

prerok

5 months ago

1 reply

SQS is simple queueing service in AWS. It's ok, guarantees at least one time delivery, but I am not sure how useful it is for large volumes of messages (by this I just mean that we use it for low volume messages and I don't have experience when using larger volumes).

stickfigure

5 months ago

1 reply

SQS is fantastic at exceptionally high total volumes of messages - you probably can't saturate it. But it's not great for streaming a list of ordered messages. SQS has a FIFO mode but performance will never be what you can get out of Kafka.

Also, SQS isn't pub/sub. Kafka and SQS really have very different use cases.

prerok

4 months ago

Agreed, I was just trying to answer the parent's question as to what it is.

HarHarVeryFunny

5 months ago

1 reply

IMO, recommending RabbitMQ depends on what language you are using and how well suited the available client libraries are to your use case.

I used RabbitMQ a few years back on a C++ project, and at the time (has anything changed?) the best supported C++ client library seemed to be AMQP-CPP which isn't multi-thread safe, therefore requiring an application interface layer to need to be written to address this in a performant way.

In our case we wanted to migrate a large legacy system from CORBA (point to point) to a more flexible bus-based architecture, so we also had to implement a CORBA-like RPC layer on top of Rabbit, including support for synchronous delivery failure detection, which required more infrastructure to be built on top of AMQP-CPP. In the end the migration was successful, but it felt like we were fighting AMQP-CPP a lot of the way.

SJC_Hacker

4 months ago

1 reply

Out of curiousity what was the issue with just wrapping the AMQP-CPP pub/sub calls around a mutex?

HarHarVeryFunny

4 months ago

I'm a bit hazy on the full details because it was a few years ago, but basically it gets more complicated because you subscribe by installing an async callback which needs to ack/nak messages, needing locking, and will be called from the context of the network event loop that also needs locking. If you do any real work in the message processing callback then you'll be blocking the event loop, so the callback has to defer processing by queuing C++ lambdas capturing the context, and running those in a thread pool.

slau

5 months ago

ZMQ is not a managed queue. It’s networking library.

rcombatwombat

4 months ago

Let's not forget NATS.io which does queuing (and a whole lot of other things from at least once low latency pub/sub to request/reply, streaming and all the way to KV and Object store) very well and is very light weight to run and much simpler to run than Kafka.

GoblinSlayer

4 months ago

I think our salesmen were happier to sell kafka, because it's enterprisey. Redis is better? Well, now we use kafka and redis.

AtlasBarfed

5 months ago

1 reply

If you're running a distributed system...

You're running a distributed system. They aren't simple.

Especially on AWS. AWS is really a double-bladed sword. Yeah, you'll get tutorials to set up whatever distributed system pretty quickly, but your nodes aren't nearly as reliable. Your networking isn't nearly as reliable. Your costs aren't nearly as reliable and administration. Headaches go up in the long run

SJC_Hacker

5 months ago

1 reply

This is what alot of people don't get. They think Kubernetes is a solution to everything. In reality all k8s is makes the mechanics a little easier.

But if you don't understand distributed systems, it almost makes it worse because its tempting to segment the system across dozens of microservices which all have to talk with each other and synchronize, and the whole thing becomes a buggy, slow clusterfuck.

0x445442

5 months ago

Dozens of microservices, oh one should be so lucky. Try hundreds where the complexity of the whole system rises non-linerarly as a function of each microservice.

shikhar

5 months ago

1 reply

You might like what we are building with https://s2.dev :)

zbentley

5 months ago

1 reply

Looks interesting; does it take a different architectural approach than WarpStream did?

That’s not coded “you’re reinventing the wheel”; WarpStream had some significant drawbacks, so I’m truly curious about different approaches in the message-log-backed-by-blob-store space.

shikhar

5 months ago

Architecturally, there are a lot of the same considerations. A key difference is that we offer streams as a cloud API, and it does not have WarpStream's BYOC split – where some stuff runs in your environment and metadata lives in their cloud – so we can offer lower latencies. We are also not trying to be Kafka API compatible, S2 has its own REST API.

The dimensions we focus on are number of streams (unlimited, so you can do granular streams like per user or session), internet accessibility (you can generate finely-scoped access tokens that can be safely used from clients like CLIs or browsers), and soon also massive read fanout for feed-like use cases.

raverbashing

5 months ago

With Kafka, at least the naming is on point

clippy99

5 months ago

8 replies

Startup founder here -- we tried it, and it feels bloated (Java!), bureaucratic and overcomplicated for what it is. Something like Redis queues or even ZMQ probably suffices for 90% of use cases. Maybe in hyper-scaled applications that need to be ultraperformant (e.g., realtime trading, massive streaming platforms) is where Kafka comes into play.

njitbew

5 months ago

6 replies

> and it feels bloated (Java!)

I'm curious, what exactly feels bloated about Java? I don't feel like the Java language or runtime are particularly bloated, so I'm guessing you're referring to some practices/principles that you often see around Java software?

slipperydippery

5 months ago

7 replies

Whatever efficiency may hypothetically be possible with Java, you can in-fact spot a real world Java program in the wild by looking for the thing taking up 10x the memory it seems like it should need… when idle.

Yes yes I’m sure there are exceptions somewhere but I’ve been reading Java fans using benchmarks to try to convince me that I can’t tell which programs on my computer are Java just by looking for the weirdly slow ones, when I in fact very much could, for 25ish years.

Java programs have a feel and it’s “stuttery resource hog”. Whatever may be possible with the platform, that’s the real-world experience.

bdangubic

5 months ago

1 reply

you know why you don’t see many non-Java programs on your computer taking up 10x memory? because no one uses them to write anything :)

jokes aside, we got a shift in the industry where many java programs were replaced by electron-like programs which now take 20x memory

vlovich123

5 months ago

1 reply

Technically kind of true but at the same time Android apps are predominantly Java/Kotlin. It speaks more to Java just having a bad desktop story. But it’s also why Android devices need 2x the ram

vips7L

5 months ago

1 reply

That has nothing to do with Java. The Android runtime is NOT Java/OpenJdk.

vlovich123

5 months ago

1 reply

No but it does speak to the memory overhead of tracing GC vs ref counting as garbage collection strategies.

gf000

5 months ago

2 replies

Which is very important in... embedded settings.

While for typical backend situations, reference counting has a crazy high throughput overhead, doing atomic inc/decs left and right, that instantly trashes any kind of cache, and does it in the mutator thread that would do the actual work, for the negligible benefit of using less memory. Meanwhile a tracing GC can do (almost) all its work in another thread, not slowing down the actually important business task, and with generational GCs cleaning up is basically a no-op (of just saying that this region can now be reused).

It's a tradeoff as everything in IT.

Also, iPhone CPUs are always a generation ahead, than any android CPU, if not more. So it's not really Apples to oranges.

vlovich123

5 months ago

2 replies

That would be a compelling counter if and only if languages like Java actually beat other languages in throughput. In practice that doesn’t seem to be the case and the reasons for that seem to be:

* languages like c++ and Rust simply don’t allocate as much as Java, instead using value types. Even C# is better here with value types being better integrated.

* languages like c++ and Rust do not force atomic reference counting. Rust even offers non atomic ref counting in the standard library. You also only need to atomic increment / decrement when ownership is being transferred to a thread - that isn’t quite as common depending on the structure of your code. Even swift doesn’t do too badly here because of the combination of compiler being able to prove the permission of eliding the need for reference counting altogether and offering escape hatches of data types that don’t need it.

* c++, Rust, and Swift can access lower level capabilities (eg SIMD and atomics) that let them get significantly higher throughput.

* Java’s memory model implies and requires the JVM to insert atomic accesses all over the place you wouldn’t expect (eg reading an integer field of a class is an atomic read and writing it is an atomic write). This is going to absolutely swamp any advantage of the GC. Additionally, a lot of Java code declares methods synchronized which requires taking a “global” lock on the object which is expensive and pessimistic for performance as compared with the fine-grained access other languages offer.

* there’s lots of research into ways of offering atomic reference counts more cheaply (called biased RC) which can safely avoid needing to do an atomic operation in places completely transparently and safely provided the conditions are met .

I’ve yet to see a Java program that actually gets higher throughput than Rust so the theoretical performance advantage you claim doesn’t appear to manifest in practice.

vips7L

5 months ago

> Java’s memory model implies and requires the JVM to insert atomic accesses all over the place you wouldn’t expect (eg reading an integer field of a class is an atomic read and writing it is an atomic write).

AFAIK that doesn’t really happen. They won’t insert atomic accesses anywhere on real hardware because the cpu is capable of doing that atomically anyway.

> Additionally, a lot of Java code declares methods synchronized which requires taking a “global” lock on the object which is expensive and pessimistic for performance as compared with the fine-grained access other languages offer.

What does this have to do with anything? Concurrency requires locks. Arc<T> is a global lock on references. “A lot” of Java objects don’t use synchronized. I’d even bet that 95-99% of them don’t.

gf000

5 months ago

The main topic here was Swift vs Android's Java.

Of course with manual memory management you may be able to write more efficient programs, though it is not a given, and comes at the price of a more complicated and less flexible programming model. At least with Rust, it is actually memory safe, unlike c++.

- ref counting still has worse throughout than a tracing GC, even if it is single-threaded, and doesn't have to use atomic instructions. This may or may not matter, I'm not claiming it's worse, especially when used very rarely as is the case with typical c++/rust programs.

> You also only need to atomic increment / decrement when ownership is being transferred to a thread

Java can also do on-stack replacement.. sometimes.

- regarding lower level capabilities, java does have an experimental Vector API for simd. Atomics are readily available in the language.

- Java's memory model only requires 32-bit writes to be "atomic" (though in actuality the only requirement is to not tear - there is no happens before relation in the general case, and that's what is expensive), though in practice 64-bit is also atomic, both of which are free on modern hardware. Field acces is not different from what rust or c++ does, AFAIK in the general case. And `synchronized` is only used when needed - it's just syntactic convenience. This depends on the algorithm at hand, there is no difference between the same algorithm written in rust/c++ vs java from this perspective. If it's lockless, it will be lockless in Java as well. If it's not, than all of them will have to add a lock.

The point is not that manual memory can't be faster/more efficient. It's that it is not free, and comes at a non-trivial extra effort on developers side, which is not even a one-time thing, but applies for the lifetime of the program.

zozbot234

5 months ago

It depends how you implement reference counting. In Rust the atomic inc-dec operations can be kept at a minimum (i.e. only for true changes in lifecycle ownership) because most accesses are validated at compile time by the borrow checker.

jayd16

5 months ago

1 reply

"Java is bloated because I only look at the bloated examples."

Is C++ bloated because of the memory Chrome uses?

slipperydippery

5 months ago

1 reply

When all your examples in actual use are bloated…

I’ve never seen another basic tech used to develop other programs that’s so consistently obvious from its high resource use and slowness, aside from the modern web platform (Chrome, as you put it). It was even more obvious back when we had slower machines, of course, but Java still stands out. It may be able to calculate digits of Pi in a tight loop about as fast as C, but real programs are bloated and slow.

gf000

5 months ago

2 replies

Sounds like a classic case of confirmation bias.

Especially that like half of the web runs on Java, you just have absolutely no idea when it silently does its job perfectly.

alt227

5 months ago

1 reply

> Especially that like half of the web runs on Java

Source?

W3 seems to think its more like ~5%

https://w3techs.com/technologies/comparison/pl-java

gf000

5 months ago

1 reply

5% of "whose server-side programming language we know"

From the website.

And 76% of these websites is PHP, which seems to mean.. they can determine PHP more easily for a website (nonetheless, there are indeed a lot of WordPress sites, but not this amount).

alt227

4 months ago

Right, so Im assuming that as you are saying 'Half the web runs on java', maybe you know more about what websites are using in their backend? Care to share where you are getting this information from?

slipperydippery

5 months ago

mldbk

5 months ago

2 replies

I held the same view as you when I was 22, more than 15 years ago.

With over 15 years of professional experience since then, my perspective has shifted: Java demonstrates its strength when stability, performance, and scalability are required (e.g. bloody enterprise)

A common misconception comes from superficial benchmarking. Many focus solely on memory consumption, which often provides a distorted picture of actual system efficiency.

I can point to EU-scale platforms that have reliably served over 100 million users for more than a decade without significant issues. The bottleneck is rarely the language itself, it is the depth of the team’s experience.

alt227

5 months ago

> Many focus solely on memory consumption, which often provides a distorted picture of actual system efficiency.

When other languages can do the same thing with an order of magnitude less RAM, any other efficencies in the system tend to be overshadowed by that and be the sticking point in peoples memories.

You may argue that holding on to this extra memory makes subsequent calls and reads quicker etc, but in my experience generally people are willing to sacrifice milliseconds to gain gigabytes of memory.

edem

5 months ago

node is a notable exception. Compared to java node is a hellhole. the standard library is non-existent, most libraries are a buggy mess, the build system is horrible...in fact there is no reliable build system that solves all your typical problems in 1 app. The list goes on.

edem

5 months ago

Note that the memory problem you mentioned is not really a problem in fact. It is just how managed memory works in Java. Just run .gc() and you'll see what I'm talking about. It reserves memory which you can see on the charts but it is not necessarily used memory.

jghn

5 months ago

> taking up 10x the memory it seems like it should need… when idle.

The JVM tends to hold onto memory in order to make things faster when it does wind up needing that memory for actual stuff. However, how much it holds on to, how the GC is setup, etc are all tunable parameters. Further, if it's holding onto memory that's not being used, these are prime candidates to be stored in virtual memory which is effectively free.

gf000

5 months ago

Well, you might want to read up on how OSs handle memory under the hood, and that virtual memory != physical, and that task manager and stuff like that can't show the real memory usage.

Nonetheless, tracing GCs do have some memory overhead in exchange for better throughput. This is basically the same concept as using a buffer.

-----

And can you tell which of these websites use Java from "the feel"? AWS cloud infra, a significant chunk of Google, Apple's backends, Alibaba, Netflix?

MathMonkeyMan

5 months ago

The JVM eats a chunk of memory in order to make its garbage collector more efficient. Think of it like Linux's page cache.

I haven't worked with too much Java, but I suspect that the distaste many have for it is due to its wide adoption by large organizations and the obfuscating "dressed up" tendency of the coding idioms used in large organizations.

The runtime isn't inherently slow, but maybe it's easier to write slow programs in Java.

poemxo

5 months ago

3 replies

Starting up a Java program takes much longer than it should and that affects perception.

halfmatthalfcat

5 months ago

1 reply

With AOT tho that should be somewhat moot.

poemxo

5 months ago

I am just explaining why it has that reputation.

nova22033

5 months ago

It may affect developer perception but I'm pretty sure my users don't notice and don't care.

edem

5 months ago

Depends on the program (especially the framework used) and the GC being used. I can write a java program and set it up in a way that it runs faster than almost everything else. For example in a serverless architecture where you need fast startup and small programs you can choose __not__ to use a GC and run ephemeral Java scripts. It starts and finishes running faster than you can blink.

buckle8017

5 months ago

4 replies

Java the language and Java the runtime are fine.

The way most Java code is written is terrible Enterprise factory factory factory.

vlovich123

5 months ago

3 replies

But the perf is not reliable. If you want latency and throughput, idiomatic Rust will give you better properties. Interestingly even will Go for some reason has better latency guarantees I believe even though it’s GC is worse than Java.

gf000

5 months ago

There is not much point talking about throughput and latency in the abstract - they are very often opposing goals, you can make one better at the expense of the other.

Go's GC is tuned more for latency at the expense of throughput (not sure if it still applies, but Go was quite literally stopping the "business" mutator threads when utilisation got higher to be able to keep up with the load - Java's default GC is tuned for a more balanced approach, but it can deliver it at very high congestion rates as well. Plus it has a low-latency focused GC which has much better latency guarantees, and it trades off some throughput in a consistent manner, so you can choose what fits best). The reason it might sometimes be more efficient than Java is simply value types - it doesn't create as much garbage, so doesn't need as good a GC in certain settings.

Rust code can indeed be better at both metrics for a particular application, but it is not automatically true, e.g. if the requirements have funny lifetimes and you put a bunch of ARC's, then you might actually end up worse than a modern tracing GC could do. Also, future changes to the lifetimes may be more expensive (even though the compiler will guide you, you still have to make a lot of recursive changes all across the codebase, even if it might be a local change only in, say, Java), so for often changing requirements like most business software, it may not be the best choice (even though I absolutely love Rust).

buckle8017

5 months ago

For many applications deferred garbage collection is acceptable.

Worse latency every ten minutes tends to be fine.

jghn

5 months ago

This presupposes the use case is such that this even matters. Obviously that is the case sometimes, but in the vast majority of cases it is not.

von_lohengramm

5 months ago

2 replies

The problem is that writing genuinely performant Java code requires that you drop most if not all of the niceties of writing Java. At that point, why write Java at all? Just find some other language that targets the JVM. But then you're already treading such DIY and frictionful waters that just adopting some other cross-platform language/runtime isn't the worst idea.

gf000

5 months ago

> The problem is that writing genuinely performant Java code requires that you drop most if not all of the niceties of writing Java

Such as? The only area where you have to "drop" features is high-frequency trading, where they often want to reach a steady-state for the trading interval with absolutely no allocations. But for HFT you would have to do serious tradeoffs for every language.

In my experience, vanilla java is more than fine for almost every application - you might just benchmark your code and maybe add some int arrays over an Integer list, but Java's GC is an absolute beast, you don't have to baby it at all.

munksbeer

5 months ago

>The problem is that writing genuinely performant Java code requires that you drop most if not all of the niceties of writing Java. At that point, why write Java at all?

The reason is quite well known. Supporting multiple languages is a cost. If you only have to support one language, everything is simpler and cheaper.

With Java, you can write elegant code these days, rely on ZGC, not really worry too much about GC and get excellent performance with quick development cycles for most of your use cases. Then with the same language and often in the same repo (monorepo is great) you can write smarter code for your hot path in a GC free manner and get phenomenal performance.

And you get that with only having one build system, one CI pipeline, one deployment system, some amazing profiling and monitoring tooling, a bunch of shared utility code that you don't have to duplicate, and a lot more benefits.

That's the reason to choose Java.

Of course, if you're truly into HFT space, then they'll be writing in C, C++ or on FPGAs.

throwaway8242

5 months ago

1 reply

That doesn't match my experience in the last 15 years working for 3 companies (one was a big enterprise, one medium sized and one startup)

Maybe I have been lucky, or that the practice is more common in certain countries or eco systems? Java has been a very productive language for me, and the code has been far from the forced pattern usage that I have read horror stories about.

chamomeal

5 months ago

1 reply

Have you gotten to use loom/virtual threads? I’ve heard pretty interesting stuff about em, but haven’t really spent the time to get into it yet. It’s pretty exciting and tbh gives me an easy elevator pitch to JVM world for people outside of it

62951413

4 months ago

If you have a use-case where you currently allocate ~1K threads mostly waiting on I/O switching to virtual threads is a one-liner ("Thread.ofVirtual()" instead of "Thread.ofPlatform()"). No more golang envy for sure.

Depending on how much memory is used by the Thread stack (presumably 1M-512K by default, allegedly 128K with Alpine base images) that's your 1G-500M heap space usage improvement right off the bat.

The migration from JDK17 to JDK21 was uneventful in production. The only issue is limited monitoring as a thread dump will not show most virtual threads and the micrometer metrics will not even collect the total number of active virtual threads. It's supposed to work better in JDK24.

The Spring Framework directly supports virtual threads with "spring.threads.virtual.enabled=true" but I haven't tried it to comment.

gf000

5 months ago

This is more of a meme, than reality. There are architecture astronauts for every platform and shitty code knows no bounds, regardless of language.

rvz

5 months ago

1 reply

> I'm curious, what exactly feels bloated about Java?

Everything.

Why do you think Kubernetes is NOT written in Java?

AtlasBarfed

5 months ago

1 reply

... Because it came from Google?

Golang has little to distinguish itself technically. It has a more modern std lib (for now) and isn't Oracle.

Which aren't trivial, but they aren't Trump cards.

rvz

5 months ago

2 replies

> ... Because it came from Google?

Nope.

None of what you said are any of the reasons given that it WAS written in Java already [0] but rewrote it all in Go explicitly because of its performance, concurrency and single binary distribution characteristics.

Those were enough technical advantages to abandon any thought of a production-grade version of k8s in Java.

[0] https://archive.fosdem.org/2019/schedule/event/kubernetesclu...

vips7L

5 months ago

> the anti patterns weren’t enough we also observe how Kubernetes has over 20 main() functions in a monolithic “build” directory. We learn how Kubernetes successfully made vendoring even more challenging than it already was, and discuss the pitfalls with this design. We look at what it would take to begin undoing the spaghetti code that is the various Kubernetes binaries built from github.com/kubernetes/kubernetes

It seems to me that perhaps it wasn’t the languages fault but the authors.

gf000

5 months ago

> rewrote it all in Go explicitly because of its performance

Because someone wanted a new, shiny toy.

cameronh90

5 months ago

1 reply

> what exactly feels bloated about Java?

https://docs.spring.io/spring-framework/docs/2.5.x/javadoc-a...

dcminter

5 months ago

Kafka does not use Spring.

burnt-resistor

5 months ago

Have you ever run on-prem Atlassian products or any enterprise JVM apps?

They hog RAM, are slow, and are a bitch to configure.

rvz

5 months ago

1 reply

> Maybe in hyper-scaled applications that need to be ultraperformant (e.g., realtime trading, massive streaming platforms) is where Kafka comes into play.

Kafka is used because the Java folks don't want to learn something new due to job security, even though there are faster and compatible alternatives that exist today.

Rather use Redpanda, than continue to use Kafka and then complain about how resource intensive it is alongside zookeeper and all the circus the comes with it and make AWS smile as you're losing hundreds of thousands a month.

AtlasBarfed

5 months ago

1 reply

I thought Kafka ditched the zookeeper

mdaniel

5 months ago

2 replies

I wish; KIP-500[1] was the "let's use native Raft" implementation but I have never once in my life seen anyone using Kafka in KIP-500 mode

1: https://cwiki.apache.org/confluence/display/kafka/kip-500:+r...

62951413

4 months ago

Very few people want to re-create their production AWS MSK cluster from scratch. And that's the only way currently. MSK can usually upgrade Kafka brokers with minor performance degradation but not for this particular change.

enetherAuthor

5 months ago

Confluent has shared they've migrated thousands of Kafka clusters (their whole cloud fleet) to KIP-500

https://www.confluent.io/blog/zookeeper-to-kraft-with-conflu...

closeparen

5 months ago

Publishing an event to Kafka puts it “out there” in a way that guarantees it won’t be lost and allows any number of interested consumers, including the data warehouse, to deal with it at their leisure (subject to retention period which is typically like 72h). For us, your Kafka topics and their schemas are as much a part of your API as your gRPC IDLs. Something like Redis or 0MQ feels more appropriate for internal coordination between instances of the same service, or at least a producer that has a specific consumer in mind.

majormajor

5 months ago

If you are using this sort of redis queue (https://redis.io/glossary/redis-queue/) with PUSH/POP vs fan-out you're working on a very different sort of problem than what Kafka is built for.

Like the article says, fan-out is a key design characteristic. There are "redis streams" now but they didn't exist back then. The durability story and cluster stories aren't as good either, I believe, so they can probably take you so far but won't be as generally suitable depending on where your system goes in the future. There are also things like RedPanda that speak Kafka w/o the Java.

However, if you CAN run on a single node w/o worrying about partitioning, you should do that as long as you can get away with it. Once you add multiple partitions ordering becomes hard to reason about and while there are things like message keys to address that, they have limitations and can lead to hotspotting and scaling bottlenecks.

But the push/pop based systems also aren't going to give you at-least-once guarantees (looks like Redis at least has a "pop+push" thing to move to a DIFFERENT list that a single consumer would manage but that seems like it gets hairy for scaling out even a little bit...).

gf000

5 months ago

If you have this stance to technology ("feels bloated", meme-level knowledge on Java), then I really hope that you are not responsible for technical decisions.

cortesoft

5 months ago

It doesn't have to be 'hyper-scaled' to be needed, unless we have widely different definitions of hyper scale. Access logs from a few thousand servers with medium traffic will push you past any single instance service, and Kafka works great for that workload.

oulipo2

5 months ago

Have you tried Redpanda?

haddr

5 months ago

Couldn’t disagree more… if you go the ZMQ you are left alone handling many things you get in Kafka for free. If you have any sort of big data problems then good luck. You are going to reinvent the wheel.

zug_zug

5 months ago

2 replies

My only complaint with this article is that it seems to be implying kafka that linkedIn's problem couldn't have been solved with a bunch of off-the-shelf tools.

majormajor

5 months ago

2 replies

What off the shelf tools in 2012 would you propose, exactly?

tomrod

5 months ago

1 reply

Sounds like MQTT?

majormajor

5 months ago

MQTT wouldn't give you the persistence or the decoupling of fast and slow consumers.

zug_zug

5 months ago

4 replies

Make it less event-orchestrated and use a db. It’s just a social network for recruiters it’s not as complicated as they like to pretend.

You don’t need push, it’s just a performance optimization that almost never justifies using a whole new tool.

AtlasBarfed

5 months ago

1 reply

Your solution to a queue and publish subscribe problem is to use a database?

mrkeen

5 months ago

Adding onto this.

> LinkedIn used site activity data (e.g. someone liked this, someone posted this)1 for many things - tracking fraud/abuse, matching jobs to users, training ML models, basic features of the website (e.g who viewed your profile, the newsfeed), warehouse ingestion for offline analysis/reporting and etc.

Who controls the database? Is it the fraud/abuse team responsible for the migrations? Does the ML team tell the Newsfeed team to stop doing so many writes because it's slowing things down?

majormajor

5 months ago

So solve "ETLs into a data warehouse are hard to make low-latency and hard to manage in a large org" by... just hypothetical better "off the shelf tools". Or "don't want low latency because you're 'just' a recruiting tool, so who cares how quickly you can get insights into your business."

Go back to the article, it wasn't about event-sourcing or replacing a DB for application code.

sebastialonso

5 months ago

The only correct answer to the question asked is "I don't know the context, I need more information". Anything else is being a bad engineer.

inkyoto

5 months ago

> It’s just a social network for recruiters it’s not as complicated as they like to pretend.

Dismissing this as «just a social network» understates the real constraints: enormous scale, global privacy rules, graph queries, near-real-time feeds and abuse controls. Periodic DB queries can work at small scale, but at high volume they either arrive late or create bursts that starve the primary. Capturing changes once and pushing them through a distributed transaction log such as Kafka evens out load, improves data timeliness and lets multiple consumers process events safely and independently. It does add operational duties – schema contracts, idempotency and retention – yet those are well-understood trade-offs. The question is not push versus pull in the abstract, but which approach meets the timeliness, fan-out and reliability required.

> You don’t need push, it’s just a performance optimization that almost never justifies using a whole new tool.

It is not about drama but about fit for purpose at scale.

Pull can work well for modest workloads or narrow deltas, especially with DB features such as incremental materialised views or change tables. At large scale, periodic querying becomes costly and late: you either poll frequently and hammer the primary, or poll infrequently and accept stale data. Even with cursoring and jitter, polls create bursty load and poor tail latencies.

Push via change data capture into a distributed log such as Kafka addresses such pain points. The log decouples producers from consumers, smooths load, improves timeliness and lets multiple processors scale independently and replay for backfills. It also keeps the OLTP database focused on transactions rather than fan-out reads.

This is not free: push introduces operational work and design care – schema contracts, ordering being per-partition, duplicate delivery and idempotency, back-pressure and retention governance including data-protection deletes. The usual mitigations are the outbox pattern, idempotent consumers, DLQ's and documented data contracts. The data processing complexity now belongs in each consumer, not the data processing engine (e.g. a DB).

Compute–storage separation in modern databases raises single-cluster ceilings for storage and read scale, yet it does not solve single-writer limits or multi-region active-active writes. For heavy write fan-out and near-real-time propagation, a CDC-to-log pipeline remains the safer bet.

To sum it up, both pull and push are valid – engineering is all about each specific use case assessment and the trade-off analysis. For small or bounded scopes, a well-designed pull loop is simpler. As scale, fan-out and timeliness requirements grow, push delivers better timeliness, correctness and operability.

jyounker

5 months ago

And what would that off-the-shelf software have been?

dktalks

5 months ago

1 reply

I worked on this while I was at LI and I think the major selling point back then was Replayability of messages but it was something similar to what you would get with Pub/Sub. We could also have multiple clients listening and processing same messages for their own purposes so you could use the same queue and have different clients process them as they wanted.

pojzon

5 months ago

Its the ability to replay messages at later notice when needed.

At least this was the reason we decided to use Kafka instead of simple queues.

It was useful when we built new consumer types for the same data we already processed or we knew we gonna have later but cant build now due to prorities.

varbhat

5 months ago

3 replies

Does anyone use https://nats.io here? I have heard good things about it. I would love to hear about the comparisons between nats.io and kafka

dijit

5 months ago

2 replies

I got really pissed off with their field CTO for essentially trying to pull the wool over my eyes regarding performance and reliability.

Essentially their base product (NATs) has a lot of performance but trades it off for reliability. So they add Jetstream to NATs to get reliability, but use the performance numbers of pure NATs.

I got burned by MongoDB for doing this to me, I won’t work with any technology that is marketed in such a disingenuous way again.

AtlasBarfed

5 months ago

1 reply

Don't implement any distributive technology until aphyr has put it through the paces, and even then... Pilot

munksbeer

5 months ago

https://aphyr.com/about

"Unavailable Due to the UK Online Safety Act"

nchmy

5 months ago

1 reply

You mean Jetstream?

Can you point to where they are using core NATS numbers to describe Jetstream?

dijit

5 months ago

2 replies

Yes, I meant Jetstream (I even typed it but second guessed myself, my mistake) I’m typing these when I get a moment as I’m at a wedding- so I apologise.

The issue in the docs was that there are no available Jetstream numbers, so I talked over a video call to the field CTO, who cited the base NATs numbers to me, and when I pressed him on if it was with Jetstream he said that it was without: so I asked for them with Jetstream enabled and he cited the same numbers back to me. Even when I pressed him again that “you just said those numbers are without Jetstream” he said that it was not an issue.

So, I got a bit miffed after the call ended, we spent about 45 minutes on the call and this was the main reason to have the call in the first place so I am a bit bent about it. Maybe its better now, this was a year ago.

jeremyjh

5 months ago

2 replies

This doesn’t really support your position as far as most readers are concerned - it sounds like a disconnect. If they didn’t do this in any ad copy or public docs it’s not really in Mongo territory.

dijit

5 months ago

1 reply

I don’t really care.

I’m telling you why I am skeptical of any tech that intentionally obfuscates trade-offs, I’m not making a comparison on which of these is worse; and I don’t really care if people take my anecdote seriously either: because they should make their own conclusions.

However it might help people go in to a topic about performance and reliability from a more informed position.

nchmy

5 months ago

I don't doubt your experience. But I think it might have been more just that guy, than NATS in general.

The other day i was listening to a podcast with their ceo from maybe 6 months ago, and he talked quite openly about how jetstream and consumers add considerable drag compared to normal pubsub. And, more generally, how users unexpectedly use and abuse nats, and how they've been able to improve things as a result.

zaphirplane

5 months ago

1 reply

It’s deceptive if true, why are you trying to spin it as it’s ok cause the deception were not published

jeremyjh

5 months ago

Its not ok if there was deception, but it sounds just as likely its a communication disconnect in their call. We only have one side of it.

jnmatsynadia

4 months ago

1 reply

As the person in question I feel compelled to answer to this: first of all my apologies if I managed to piss you off, certainly didn't mean to!

It looks like you got frustrated by my refusing to give figures of performance for JetStream: I always say in meetings that because there are too many factors that affect greatly JetStream performance (especially compared to Core NATS which mostly just depends on the network I/O) I can not just give any number as that would likely not accurately reflect (better or worse!) the number that you would actually see in your own usage. And that rather you should use the built-in `nats bench` tool to measure the performance for yourself for your kind of traffic requirements and usage patterns, in your target deployment environment and HA requirements.

On top of that, the performance of the software itself is still evolving as we release new versions that improve things and introduce new features (e.g. JetStream publication batches, batched direct gets) that greatly improve some performance numbers.

I assure you that I just don't want to give anyone some number and then you try it for yourself and you can't match those numbers, nothing more! We literally want you to measure the performance for youself rather than to give you some large number. And that's also why the docs don't have any JetStream performance numbers. There is no attempt at any kind of disingenuity, marketing, or pulling wool over anyone's eyes.

And I would never ever claim that JetStream yields the same performance numbers as Core NATS, that's impossible! JetStream does a lot more and involves a lot more I/O than Core NATS.

However, if I get pressed for numbers in a meeting: I do know the orders of magnitude that NATS and JS operate at, and I will even be willing to say with some confidence that Core NATS performance numbers are pretty much always going to be up to the 'millions of messages per second'. But I will remain very resistant to making any claim any specific JS performance numbers because in the end the answer are 'it depends' and 'how long is a piece of string' and you can scale JetStream throughput horizontally using more streams just like you can scale Kafka's throughput by using more partitions.

Now in some meetings some people don't like that non-answer and really want to hear some kind of performance number so I normally turn the question and ask them what their target message rates and sizes are going to be. If their answer is in the 'few thousands of messages per second' (like it is in your case if I'm not mistaken about the call in question) then, as I do know that JetStream typically comfortably provides performance well in excess of that, I can say with confidence that _at those kinds message rates_ it doesn't matter whether you use Core NATS or JetStream: JetStream is plenty fast enough. That's all I mean!

jnmatsynadia

4 months ago

And I would add, as soon as you are using more than one stream (e.g. do sharding using Core NATS subject transformation) because JetStream throughput scales horizontally, just like Kafka throughput scale horizontally as you add more partitions and more servers in the cluster I feel reasonably confident to say that _in most cases_ it doesn't really matter what the target number of messages per second is, as you can create a cluster large enough to provide that aggregated throughput. In properly distributed systems, the answer to the benchmark number question truly is 'how long is a piece of string'.

sea-gold

5 months ago

2 replies

There is a good comparison between NATS, Kakfa, and others here: https://docs.nats.io/nats-concepts/overview/compare-nats

rockwotj

5 months ago

Maybe needs a neutral party comparison :)

The delivery guarantees section alone doesn’t make me trust it. You can do at least once or at most once with kafka. Exactly once is mostly a lie, it depends on the downstream system: unless going back to the same system, the best you can do is at least once with idempotancy

chatmasta

5 months ago

It’s on the NATS website and “nats” appears in the URL three times, so maybe this isn’t the most objective source.

nchmy

5 months ago

I dont have kafka experience, but nats is absolutely amazing. Just a complete pleasure to use, in every way.

https://www.synadia.com/blog/nats-and-kafka-compared

61 more comments available on Hacker News

View full discussion on Hacker News

ID: 44988845Type: storyLast synced: 11/20/2025, 7:45:36 PM

Want the full context?