H100 Pcie – 1.86 Tb/s Memcpy Roofline and 8× Uplift

Posted4 months ago

GPUrouter

3 points

0 comments

Techstory

calmpositive

Debate

0/100

GPU OptimizationLLM InferenceCuda Kernels

Key topics

GPU Optimization

LLM Inference

Cuda Kernels

I ran A/B benchmarks on an H100 PCIe 80GB node. Contiguous memcpy sustained ~1.86 TB/s in both baseline and optimized runs, showing no overhead. For strided and misaligned access the baseline was ~230 GB/s, while the optimized version reached ~1.86 TB/s, about an 8× improvement. Large 8–24 GB payloads sustained ~1.86 TB/s as well. Canonical CUDA kernels such as memcpy, strided access, KV-cache, and LayerNorm improved from ~220–330 GB/s to ~1.8–1.86 TB/s, around 7–8× faster with very low jitter.

Using a simple LLM decode cost model (BPT = 1.13 MB/token), throughput improved from ~161.9k tok/s to ~225.1k tok/s (≈1.39×). This suggests memory-bound operations like KV-cache and strided loads can be lifted closer to roofline bandwidth, with direct impact on decode throughput.

I’m interested in feedback on how such memory-bound optimizations might affect LLM training versus inference, and what good public long-context (8k–32k) benchmarks would be to test next ?

The author reports significant performance improvements in memory-bound operations on H100 PCIe GPUs, with potential applications in LLM inference.

Snapshot generated from the HN discussion

Discussion Activity

No activity data yet

We're still syncing comments from Hacker News.

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (0 comments)

Discussion hasn't started yet.

ID: 45236545Type: storyLast synced: 11/17/2025, 2:03:04 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

View on HN