GPU allocationKaggleTorch XLALarge Language Models
Kaggle announced that they are replacing their TPU v3-8s with v5e-8s, but for some reason I get an OOM when running my code on v5e-8 and not when running it on v3-8. Does anybody know why this might be happening? For reference, I am training a 1.5b GPT model using Torch XLA
Synthesized Answer
Based on 0 community responses
The difference in behavior between TPU v3-8 and v5e-8 when training a 1.5b GPT model using Torch XLA could be due to changes in memory architecture or allocation. TPU v5e has a different memory layout and potentially different memory fragmentation characteristics compared to v3. It's possible that the model's memory requirements are not being met on the v5e-8 due to its different memory management. Additionally, Torch XLA's memory management and optimization strategies might need to be adjusted for the v5e architecture. Checking the memory usage patterns and potentially optimizing the model's memory footprint or adjusting batch sizes could help resolve the OOM issue.
Key Takeaways
Check memory usage patterns on v5e-8
Adjust Torch XLA's memory management for v5e
Optimize the model's memory footprint or batch size
Discussion (0 comments)
No comments available in our database yet.
Comments are synced periodically from Hacker News.