Diverse LLM subsets via k-means (100K-1M) [Pretraining, IF, Reasoning] | Not Hacker News!