TPU v3-8 vs v5e-8 differences

Question

Kaggle announced that they are replacing their TPU v3-8s with v5e-8s, but for some reason I get an OOM when running my code on v5e-8 and not when running it on v3-8. Does anybody know why this might be happening? For reference, I am training a 1.5b GPT model using Torch XLA

HackerNews · Accepted Answer

The difference in behavior between TPU v3-8 and v5e-8 when training a 1.5b GPT model using Torch XLA could be due to changes in memory architecture or allocation. TPU v5e has a different memory layout and potentially different memory fragmentation characteristics compared to v3. It's possible that the model's memory requirements are not being met on the v5e-8 due to its different memory management. Additionally, Torch XLA's memory management and optimization strategies might need to be adjusted for the v5e architecture. Checking the memory usage patterns and potentially optimizing the model's memory footprint or adjusting batch sizes could help resolve the OOM issue.

TPU v3-8 vs v5e-8 differences

Resources