Necessary Tool? Async Lora for Distributed Systems

Posted4 months ago

jfileto

4 points

0 comments

Techstory

calmpositive

Debate

0/100

Distributed SystemsGPU TrainingMachine Learning

Key topics

Distributed Systems

GPU Training

Machine Learning

I’ve been building something I call Async LoRA to scratch an itch I kept running into: training on cheap GPUs (Salad, runpod, spot instances, etc.) is a nightmare for long jobs. One random node dying and suddenly hours of training are gone. Most schedulers just restart the whole container, which doesn’t really help. What I’ve put together so far:

• Aggregator/worker setup where the aggregator hands out small “leases” of work (per token sizes not time slices)

• Async checkpointing so progress gets saved continuously without pausing training.

• Preemption handling — when a worker dies, whatever it managed to do still counts, and the remaining work just gets reassigned.

• Training-aware logic (steps, tokens, loss) instead of treating jobs like black-box containers.

• Out-of-the-box hooks for PyTorch/DeepSpeed so you don’t have to glue it all together yourself. My goal is to make sketchy clusters behave more like reliable ones

I’d love feedback from people here:

• If you run training on spot/preemptible GPUs, how do you usually handle checkpoints/failures?

• What would make this easier to drop into an existing pipeline (Airflow, K8s, Ray, etc.)?

• For monitoring, would you rather see native training metrics (loss, tokens, staleness) or just surface logs/events and let you plug into your own stack?

The author introduces Async LoRA, a tool for reliable distributed training on cheap GPUs, and seeks feedback from the community.

Snapshot generated from the HN discussion

Discussion Activity

No activity data yet

We're still syncing comments from Hacker News.

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (0 comments)

Discussion hasn't started yet.

ID: 45258074Type: storyLast synced: 11/17/2025, 2:07:07 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

View on HN