Continuous Nvidia Cuda Profiling in Production
Posted3 months agoActive2 months ago
polarsignals.comTechstory
supportivepositive
Debate
20/100
GPU ProfilingNvidia CudaPerformance Optimization
Key topics
GPU Profiling
Nvidia Cuda
Performance Optimization
The article discusses a new approach to continuous Nvidia CUDA profiling in production environments, with the author engaging with the community to gather feedback and questions.
Snapshot generated from the HN discussion
Discussion Activity
Moderate engagementFirst comment
7d
Peak period
7
168-180h
Avg / period
5
Key moments
- 01Story posted
Oct 22, 2025 at 10:05 AM EDT
3 months ago
Step 01 - 02First comment
Oct 29, 2025 at 9:07 AM EDT
7d after posting
Step 02 - 03Peak activity
7 comments in 168-180h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 29, 2025 at 2:39 PM EDT
2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45669377Type: storyLast synced: 11/20/2025, 12:41:39 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
[ All from my experience on home GPUs, and in lah with 2 nodes with 2 80GB H100 each. Not extensively benchmarked ]
Events like kernel launch, which this profiler reads right now, is a very small overhead (1-2%). Kernel level metrics like DRAM utilisation, cache hit rate, SM occupancy, etc usually give you a 5-10% overhead. If you want to plot a flame graph at a instruction level (mostly useful for learning purposes) then you go off the rails - even 25% overhead I have seen. And finally full traces add tons of overhead but that's pretty much expected - they anyways produce GBs of profiling data.
[0] https://en.wikipedia.org/wiki/Hardware_performance_counter
https://www.parca.dev/docs/quickstart/
I feel like I've seen Cupti have fairly high overhead depending on the cuda version, but I'm not very confident -- did you happen to benchmark different workloads with cupti on/off?
---
If you're taking feature requests: a way to subscribe to -- and get tracebacks for -- cuda context creation would be very useful; I've definitely been surprised by finding processes on the wrong gpu and being easily able to figure out where they came from would be great.
I did a hack by using LD_PRELOAD to subscribe/publish the event, but never really followed through on getting the python stack trace.