GPT-Oss Reinforcement Learning
Posted3 months agoActive3 months ago
docs.unsloth.aiTechstory
supportivemixed
Debate
60/100
Reinforcement LearningGPT-OssAI Model Training
Key topics
Reinforcement Learning
GPT-Oss
AI Model Training
The post introduces Unsloth's new feature for GPT-OSS reinforcement learning, sparking a discussion about the feasibility and accessibility of RL training for open-source models.
Snapshot generated from the HN discussion
Discussion Activity
Moderate engagementFirst comment
4h
Peak period
8
12-14h
Avg / period
3.9
Comment distribution43 data points
Loading chart...
Based on 43 loaded comments
Key moments
- 01Story posted
Sep 26, 2025 at 10:01 PM EDT
3 months ago
Step 01 - 02First comment
Sep 27, 2025 at 1:49 AM EDT
4h after posting
Step 02 - 03Peak activity
8 comments in 12-14h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 27, 2025 at 9:17 PM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45392744Type: storyLast synced: 11/20/2025, 1:48:02 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
But what's the point? GPT-OSS is regarded as a pretty bad open source model compared to the latest deepseek or qwen releases. Most attempts to use Reinforcement Learning or even any kind of post-training fail in that the data you have is of worse quality and quantity than the data that the model was originally trained on. So you get catastrophic forgetting and a model with lower general IQ than before fine-tuning.
This is true btw even if you use lora or better techniques to supposedly "mitigate" catastrophic forgetting. Even pyreft/reft, which in some cases impact only 0.001% of a models parameters, cause these kind of issues in my experiments.
So why should anyone except AI researchers and the big 4 AI providers care about fine-tuning? The vast majority of people who think they need fine-tuning need good quality RAG/Agentic RAG systems, since they can trivially add or remove data to their model (machine unlearning doesn't work yet), also ground models and objectively makes them more accurate, and fully manipulate and manage how it's used in their prompts context. On top of that, vector DBs/embeddings "easily" scale to billions of records.
Also, most mainstream AI benchmarks do not agree with you:
LLMarena (https://lmarena.ai/leaderboard) has GPT-OSS 120B as #53 and GPT-OSS 20B as #69 (nice), which is extremely far from leading.
DeepSeek V3.1 is ranked #9, and is a solid 60+ elo points above GPT-OSS.
I know you're going to link some of the "ya but chatbot arena sucks cus of theoretical attacks against it" paper and the llama4 debacle, but here's more evidence that GPT-OSS blows:
https://livebench.ai/#/?q=GPT-oss
GPT-oss global average: 54.60
Deepseek V3.1 thinking global average: 70.75
Qwen 3 32B global average: 63.71
So bring receipts next time because I did.
I never claimed it was a frontier model. Just best in class for the performance it can achieve and the memory footprint it can fit in.
And btw OSS did super well on domain specific tests without fine tuning. A model I don’t need to fine tune beats one that does.
Which means it has ~3b parameters active per token.
Qwen3-32b has 32b params active per token
Dense model means 32B parameters => 32B get used in calculation for every token. Every calculation takes time, and assuming similar latent space size (which they all have). For example Qwen-32B
MoE model has for example 80B parameters, but only 3B get used in calculation for any given token. For example Qwen3-Next 80B A3B
Performance comparison:
Qwen-32B => 56 tok/sec, 32 GB of VRAM
Qwen3-Next 80B A3B => 167 tok/sec, 85 GB of VRAM
So despite being close to 3x "bigger", Qwen3-Next is more than 3 times faster with the same compute capacity. There's a but though. But because what gets activated from one token to the next is a different subset of the model, it is still critical to have all 80B parameters loaded into memory.
So MoE performs much better with less compute, at the cost of more memory. It also performs better on benchmarks, it is a better model.
Similar techniques have long been used in ML to great success, rather than trying to create one brilliant model, create many that each have pros and cons, and then train a second model to figure out the best model for the task in front of you. There's even a name for the practice "ensemble models". MoE only kind-of fits (because you can't easily swap out models)
There's other factors, the big one being attention. That's why non-attention models, like MAMBA, will wipe the floor in terms of performance per flop (a compute unit), with anything else. When it comes to intelligence however ...
Dense models are better for a reason, and the idea that "everyone is doing MoE now and dense models are dead" is total bunk nonsense.
You can quantize dense models, and 4 bit quantized Qwen 32B is still better than full precision GPT-OSS. Luckily Unsloth even gives you tools to go down to 1.58bits!
1. https://guessix.com/
And if you finetune on a few formatted examples the effect is even greater
I just tried to ask it how to make crystal meth and it generated a very detailed step by step guide
Some of the qwen models are too, but they seem to need a bit more handholding.
This is of course just anecdotal from my end. And I've been slacking on keeping up with evals while testing at home
The primary goal of the release and our notebook https://colab.research.google.com/github/unslothai/notebooks... was actually to showcase how to mitigate reward hacking in reinforcement learning - for example when RL learns to cheat and output global variables instead like editing the timer to cheat on benchmarking and others. You can edit the notebook to do rl on other powerful models like Qwen, Llama etc automatically with Unsloth as well via our automatic compiler! We also made sink attention and moe inference super optimized for training - note flash attention 3 doesn't have sink backwards support so you'll have to use unsloth.
Gpt-oss tbh in our tests is a truly powerful model, especially the 120b variant - it's extremely popular in western enterprises since yes it's from openai but also because reasoning mode high and the censored nature and its reasoning capabilities are attractive. A big underutilized feature is its web search and internal intermediate tool calling which it can do as part of its reasoning chain just like o3 or gpt5.
RL yes isn't an all powerful hammer, but it can solve so many more new problems. For a financial institution, you can make automatic trading strategies via RL. For an intelligence agency, decryption via RL. For a legal startup, possibly case breakthroughs via RL, automatic drug candidates etc. And yes, big labs want to automate all tasks via massive RL for eg being able to play pokemon and all other games as one example. RL opened so many doors since you don't need any data, just one prompt like "make fast matrix multiplications kernels", and reward functions - it can allow many more interesting use cases where data is a constraint!
Furthermore, I’m very worried that whoever may be paying for this is barking up the wrong tree. I feel that the damage done with extremely bad decryption attempts would massively outweigh the very few times when whatever it “decrypts” is meaningfully close to what the actual text was.
I’m aware of how easy certain things in surveillance are (I.e n-gram analysis is enough to dox anyone on HN in like 10 words of text) - but even sort of decent decryption of SHA-256 would be a literally front page of the world achievement.
sha-256 is used in the construction of certain encryption algorithms as a primitive but by itself never encrypts anything. If it did you’ve also got middleout compression invented since you could encrypt arbitrary length input into 256 bits of output.
I’ve also seen a lot of enterprises suddenly invest in fine-tuning their own models, even though there is absolutely no reason they should be doing that. But because there’s a huge “we need to do something with AI” directive from the top. Eg “if we fine tune an AI model to understand our custom query language it will prevent our data scientists from taking down production databases”. This is an actual example I encountered just last week.
So if there’s a point, it’s probably not that it’s the best idea, but rather than enterprises are willing to buy it.
I currently have Qwen3 Coder 30B A3B on prem for developers to use and it's pretty good for that: fits within two Nvidia L4 cards (with quantization), has about 60 tok/s performance and can even use tools with something like RooCode or OpenWebUI.
However, if anyone asks it stuff in Latvian, it does mess up more often than not, like someone who has half-learned the language: more or less uses the words that represent the correct concepts, but not always the best pick and really often missing or wrongly used diacritics (ā, č, ē, ģ, ī, ķ, ļ, ņ, š, ū, ž). In a word, some basis of the language is there, but sadly in practice it's still unusable.
So far working with Latvian text I need to maintain EuroLLM running in parallel, which has great Latvian knowledge, but at the same time just knows less (not a model that's good for programming) and doesn't know how to call tools as well and isn't really meant for long contexts: https://huggingface.co/collections/utter-project/eurollm-66b...
So my ideal model (for this use case) would be something along the lines of: around 30B since can't fit anything much bigger into the VRAM for now, MoE, supports tool calling, okay programming knowledge, has a long enough context support, can converse in Latvian so I don't need to run 2 models (which means that the context sizes that can fit within VRAM are way too small for either of the models).
Without finetuning, I just have to sit and wait for someone to release that and it feels unlikely that it'll just pop into existence (even the bigger EuroLLM model is nowhere to be seen, TildeOpen works bad). With finetuning, maybe I have a chance to take the Latvian Wikipedia or whatever other samples of the language being used I can get, filter down to topics I care about, maybe use EuroLLM to generate question/answer pairs for training from that input data and then just run Unsloth over a weekend or something (who knows, maybe that'd be enough to bring Qwen3 up to speed).
Do most people care about that stuff? Probably not, but when you have to deal with a language that's less represented in the training data but will never have the budget to train something from scratch, finetuning is nice to have. RAG can't fix it not knowing how to use language.
There was a post the other day on HN where a dev spent $2k doing RL on an open model and beat the frontier models on Web Voyager.
Research is packed with examples like this yet we keep hearing from the community that training doesn’t work.
Then consider voice/vision modalities where context engineering doesn’t work well at all. I just spent 2 years on the computer use problem and can tell you with absolute certainty that context won’t get you there (we tried and tried). The only thing that meaningfully moved the needle was RL.
I really wish this idea that only big labs can train models would die. It’s really hurting the ecosystem at this point.
The new sleep mode in vLLM is really amazing, and it seems the community hasn’t quite wrapped their heads around how much more accessible this makes RL training.
I’m reading a lot of dismaying posts on this thread, pushing the idea that only big labs should be doing RL. This couldn’t be further from the truth, folks should try it for themselves and see the outcomes!
https://github.com/lukehinds/deepfabric/discussions/334