Intelligent Kubernetes Load Balancing at Databricks
Posted3 months agoActive3 months ago
databricks.comTechstory
calmpositive
Debate
40/100
KubernetesLoad BalancingGrpc
Key topics
Kubernetes
Load Balancing
Grpc
Databricks shares their approach to intelligent Kubernetes load balancing, sparking a discussion on various load balancing techniques and their trade-offs.
Snapshot generated from the HN discussion
Discussion Activity
Active discussionFirst comment
1h
Peak period
11
0-3h
Avg / period
2.8
Comment distribution25 data points
Loading chart...
Based on 25 loaded comments
Key moments
- 01Story posted
Oct 1, 2025 at 1:06 AM EDT
3 months ago
Step 01 - 02First comment
Oct 1, 2025 at 2:08 AM EDT
1h after posting
Step 02 - 03Peak activity
11 comments in 0-3h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 2, 2025 at 5:38 PM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45434417Type: storyLast synced: 11/20/2025, 6:36:47 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
From the recent grpConf ( https://www.youtube.com/playlist?list=PLj6h78yzYM2On4kCcnWjl... ) it seems gRPC as a standard is also moving in this "proxyless" model - gRPC will read xDS itself.
https://nginx.org/en/docs/http/ngx_http_upstream_module.html...
1: https://github.com/sercand/kuberesolver
In the README.md file, they compare it with a ClusterIP service, but not with a Headless on "ClusterIP: None".
The advantages of using Kuberesolver are that you do not need to change DNS refresh and cache settings. However, I think this is preferable to the application calling the Kubernetes API.
The code in question was: https://github.com/grpc/grpc-go/blob/b597a8e1d0ce3f63ef8a7b6...
That meant that deploying a service which drained in less than 30s would have a little mini-outage for that service until the in-process DNS cache expired, with of course no way to configure it.
Kuberesolver streams updates, and thus lets clients talk to new pods almost immediately.
I think things are a little better now, but based on my reading of https://github.com/grpc/grpc/issues/12295, it looks like the dns resolver still might not resolve new pod names quickly in some cases.
[1] https://github.com/wongnai/xds
the "impact" can be reduced by configuring an overall connection-ttl, so it takes some time when new pods come up but it works out over time.
--
that said, i'm not surprised that even a company as large as databricks feels that adding a service mesh is going to add operational complexity.
looks like they've taken the best parts (endpoint watch, sync to clients with xDS) and moved it client-side. compared to the failure mode of a service mesh, this seems better.
that seems like the tail wagging the dog
It feels like it would solve all the requirement that they laid out, is fully client side, and doesn't require real time updates for the host list via discovery.
[0] https://en.wikipedia.org/wiki/Rendezvous_hashing
this is "partially" true.
if you're using ipvs, you can configure the scheduler to just about anything ipvs supports (including wrr). they removed the validation for the scheduler name quite a while back.
kubernetes itself though doesn't "understand" (i.e., can NOT represent) the nuances (e.g., weights per endpoint with wrr), which is the problem.