Show HN: Run Qwen3-Next-80B on 8GB GPU at 1tok/2s throughput
github.comAnd it'll run at like 40t/s depending on which one you have
Prompt: Create a solar system simulation in a single self-contained HTML file.
qwen3-next-80b (MLX format, 44.86 GB), 4bit 42.56 tok/sec , 2523 tokens, 12.79s to first token
- note: looked like ass, simulation broken, didn't work at all.
Then as a comparison for a model with a similar size, I tried GLM.
GLM-4-32B-0414-8bit (MLX format, 36.66 GB), 9.31 tok/sec, 2936 tokens, 4.77s to first token
- note: looked fantastic for a first try, everything worked as expected.
Not a fair comparison 4bit vs 8bit but some data. The tok/sec on Mac is pretty good depending on the models you use.
mlx_lm.server --model mlx-community/Qwen3-Next-80B-A3B-Instruct-4bit --trust-remote-code --port 4444
I'm not sure if there is support for Qwen3-Next in any releases yet, but when I set up the python environment I had to install mlx_lm from source.