Faster Argmin on Floats
Posted4 months agoActive3 months ago
algorithmiker.github.ioTechstory
calmpositive
Debate
20/100
Algorithm OptimizationFloat OperationsPerformance Tuning
Key topics
Algorithm Optimization
Float Operations
Performance Tuning
The post discusses optimizing the argmin operation on floats by reinterpreting them as integers, and the discussion explores related performance optimization techniques and trade-offs.
Snapshot generated from the HN discussion
Discussion Activity
Moderate engagementFirst comment
2d
Peak period
8
36-42h
Avg / period
3.7
Comment distribution11 data points
Loading chart...
Based on 11 loaded comments
Key moments
- 01Story posted
Sep 18, 2025 at 12:20 PM EDT
4 months ago
Step 01 - 02First comment
Sep 20, 2025 at 12:25 AM EDT
2d after posting
Step 02 - 03Peak activity
8 comments in 36-42h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 20, 2025 at 6:32 PM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45291538Type: storyLast synced: 11/20/2025, 5:27:03 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
I wonder could that be made faster by using AVX instructions; they allow to find the minimum value among several u32 values, but not immediately its index.
e.g. using 4 accumulators instead of 1 accumulator in the naive for loop gives me around a 15%-20% speedup (Not using rust, extremely scalar terrible naive C code via g++ with -funroll-all-loops -march=native -O3)
if we're expressing argmax via the obvious C style naive for loop, or a functional reduce, with a single accumulator, we've forcing a chain dependency that isn't really part of the problem. but if we don't care which argmax-ing index we get (if there are multiple minimal elements in the array) then instead of evaluating the reductions in a single rigid chain bound by a single accumulator, we can break the chain and get our hardware to do more work in parallel, even if we're only single threaded.
anonymoushn is doing something much cleverer again using intrinsics but there's still that idea of "how do we break the dependency chain between different operations so the cpu can kick them off in parallel"
you can consult the list of toys here: https://www.intel.com/content/www/us/en/docs/intrinsics-guid...
This reminds me of a trick to sort floats faster, even if they have negatives, nans, and inf: map each float to a sortable int version of itself where one can compare them as ints (the precise mapping depending on how you want to order stuff like Nan). The one time conversion is fast and will pay off for the lg(n) comparisons. Then after sorting, map them back.