Generalized K-Means Clustering
Posted3 months agoActive2 months ago
github.comTechstory
supportivepositive
Debate
20/100
Machine LearningClustering AlgorithmsData Analysis
Key topics
Machine Learning
Clustering Algorithms
Data Analysis
The post shares an open-source implementation of generalized K-Means clustering on GitHub, sparking discussion around its applications and comparisons to other clustering methods.
Snapshot generated from the HN discussion
Discussion Activity
Moderate engagementFirst comment
N/A
Peak period
6
132-144h
Avg / period
2.4
Comment distribution12 data points
Loading chart...
Based on 12 loaded comments
Key moments
- 01Story posted
Oct 20, 2025 at 12:49 PM EDT
3 months ago
Step 01 - 02First comment
Oct 20, 2025 at 12:49 PM EDT
0s after posting
Step 02 - 03Peak activity
6 comments in 132-144h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 28, 2025 at 5:28 AM EDT
2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45646059Type: storyLast synced: 11/20/2025, 1:42:01 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
And so that raises the question of what “nearest” means, and here this allows you to replace Euclidian distance with things like Kullback-Leibler divergence (that’s the KL below) which make more sense than Euclidian distance if you’re trying to measure how close two probability distributions are to each other.
To me, the definition of "nearest" is just a technicality.
The real question is: what is K?
For any given clustering task of interest, there is no single value of K.
Clustering & unsupervised machine learning is as much about creating meaning and structure as it is about discovering or revealing it.
Take the case of biological taxonomy, what K will best segment the animal kingdom?
There is no true value of K. If your answer is for a child, maybe it’ 7 corresponding to what we’re taught in school - mammals, birds, reptiles, amphibians, fish, and invertebrates.
If your answer is for a zoologist, obviously this won’t do.
Every clustering task of interest is like this. And I say of interest because clustering things like digits in the classic MNIST dataset is better posed as a classification problem - the categories are defined analytically.
You find that many clusters and shoehorn the consultant provided categories on to the k clusters you obtain.
For the data I work with at $dayjob I've found the Silhouette algorithm to perform best but I assume it will be extremely field specific. Clustering your data and taking a representative sample of each cluster is such a powerful trick to make big data small but finding an appropriate K is an art more than a science.
So I do a bit of work in geospatial analysis, and hotspots are better represented by DBSCAN (do not need to assign every point a cluster). I just do not even use clustering very often in gig (supervised ML and anomaly detection are much more prevalent in the rest of my work).
## Title
Generalized K-Means Clustering for Apache Spark with Bregman Divergences
## Body (3,982 characters)
I've built a production-ready K-Means library for Apache Spark that supports multiple distance functions beyond Euclidean.
*Why use this instead of Spark MLlib?*
MLlib's KMeans is hard-coded to Euclidean distance, which is mathematically wrong for many data types:
- *Probability distributions* (topic models, histograms): KL divergence is the natural metric. Euclidean treats [0.5, 0.3, 0.2] and [0.49, 0.31, 0.2] as similar even though they represent different distributions. - *Audio/spectral data*: Itakura-Saito respects multiplicative power spectra. Euclidean incorrectly treats -20dB and -10dB as closer than -10dB and 0dB. - *Count data* (traffic, sales): Generalized-I divergence for Poisson-distributed data. - *Outlier robustness*: L1/Manhattan gives median-based clustering vs mean-based (L2).
Using the wrong divergence yields mathematically valid but semantically meaningless clusters.
*Available divergences:* KL, Itakura-Saito, L1/Manhattan, Generalized-I, Logistic Loss, Squared Euclidean
*What's included:* - 6 algorithms: GeneralizedKMeans, BisectingKMeans, XMeans (auto k), SoftKMeans (fuzzy), StreamingKMeans, KMedoids - Drop-in MLlib replacement (same DataFrame API) - 740 tests, deterministic behavior, cross-version persistence (Spark 3.4↔3.5, Scala 2.12↔2.13) - Automatic optimization (broadcast vs crossJoin based on k×dim to avoid OOM) - Python and Scala APIs
*Example:*
```scala // Clustering topic distributions from LDA val topics: DataFrame = // probability vectors
// WRONG: MLlib with Euclidean new org.apache.spark.ml.clustering.KMeans() .setK(10).fit(topics)
// CORRECT: KL divergence for probabilities new GeneralizedKMeans() .setK(10) .setDivergence("kl") .fit(topics)
// For standard data, drop-in replacement: new GeneralizedKMeans() .setDivergence("squaredEuclidean") .fit(numericData) ```
*Quick comparison:*
| Use Case | MLlib | This Library | |----------|-------|--------------| | General numeric | L2 | L2 (compatible) | | Probability distributions | Wrong | KL divergence | | Outlier-robust | | L1 or KMedoids | | Auto k selection | | XMeans (BIC/AIC) | | Fuzzy clustering | | SoftKMeans |
*Performance:* ~870 pts/sec (SE), ~3,400 pts/sec (KL) on modest hardware. Scales to billions of points with automatic strategy selection.
*Production-ready:* - Cross-version model persistence - Scalability guardrails (chunked assignment) - Determinism tests (same seed → identical results) - Performance regression detection - Executable documentation
GitHub: https://github.com/derrickburns/generalized-kmeans-clusterin...
This started as an experiment to understand Bregman divergences. Surprisingly, KL divergence is often faster than Euclidean for probability data. Open to feedback!