Launch HN: RunRL (YC X25) – Reinforcement learning as a service
runrl.comThere's also lots of interesting possibilities such as RLing a model on a bunch of environments and then prompt optimizing it on each specific one, which seems way better than, like, training and hot-swapping many LoRAs. In any case, _someone's_ ought to provide a full RL api, and we're here to do that well!
Curious to see your thoughts and results whenever you get something out.
The way tools currently work in the beta is you add tools via MCP to the configuration, and they get passed in as additional context for the model; the model might then choose to use a tool during inference; the tool is then automatically called and the output is returned as a tool message. If you really want to you could parse the tool output as part of reward calculation, but I expect you'd usually base the reward just on the model's completion. I could give more details if there's a specific tool setup you're envisioning!
I.e., it's not at all like a typical game, because at no point is "success rate without relying on rollback/savestate-reloading" something that actually matters. An agent that spends evenly on abandoned (exploratory) branches, and on the path that becomes part of the solution that the formal verifier checks to confirm, while having a near-100% solve rate for problems fed to the agent, is a VERY GOOD agent.
That's because this task unlike most RL tasks is one where the agent shall utilize discovery to log an interaction trace that can be trivially mechanically trimmed to a verifiable proof for the provided problem. I.e., the hard part is finding ANY path that solves, without spending exponential amounts of compute to brute force the problem over the bounded state size of practical relevance. Because that would be something that takes longer than the heat death of the universe: i.e.,it's theoretically impractical.
Most RL tasks want an agent that is particularly good at it's task; and while effort spent to find a proof is certainly something that matters (if only because lower cost means the agent can train on more instances with the same training budget), it's much less relevant than the solve rate itself (fraction of problems for which any verifiably-correct proof sequence can be found at some definable level of effort, expressed as e.g. number of shots, total compute budget for the instance, ratio of exploration nodes to those nodes that become part of the final proof sequence, etc.).
Considering that non-benchmark usage would mostly entail semi-synthetic crowd-sourced datasets that are open sub-instances from practical applications of formal verification, as well as more-synthetic instances from very coarse high-level questions (that get mechanically broken down into more-manageable chunks before the RL agent gets to work) like "given these more-specific rules of what is _actually_ UB and what is only UB in ANSI but actually defined in the specific toolchain that we use: does that C program over there contain ANY UB?" or "is there ANY way that input at that file/network-socket over there to that program over here could execute arbitrary code", there'd not be economic incentive to solve any given instance more than once, beyond what is necessary to make the RL training process itself stable.
That task also lends itself to semi-online learning as every supplied instance essentially pays once for a verified solution and the overall process should deliver solid ROI. Running a single GPU cluster/pile for both training and inference would allow higher utilization at the cost of running with some variable amount of latency between rolling out an episode and training on the completed episode's oracle-verified rewards.
B) currently, pricing on the deployed API is free, but the startup time is a few minutes and it's run on a small GPU node and is therefore not awfully fast. If you would like more production-level inference, email us at founders@runrl.com and we could set you up with something much faster (where we'd charge per token depending on model size)