Private LLM
synthetic.newKey Features
Tech Stack
Key Features
Tech Stack
That's funny, that's also my favorite coding model as well!
> the rumored sizes of Anthropic's models
Yeah. I've long had a hypothesis that their models are, like, average sized for a SOTA model, but fully dense, like that old llama 3.1 405b model, and that's why their per token inference costs are insane compared to the competition.
> it's kind of nice to launch without prompt caching, since it means if we're flying too close to the sun on tokens we still have some pretty large levers left to pull on the infra side before needing to do anything drastic with rate limits.
That makes sense.
I'm poor as dirt, and my job actually forbids AI code in the main codebase, so I can't justify even a $20 per month prescription right now (especially when, for experimenting with agentic coding, qwen code is currently free (if shitty)) but when or if it becomes financially responsible, you will be at the very top of my list.
Then just plug in your Synthetic API key, and you should be able to use any supported model. For example, to use GLM-4.5, you'd pass the following model string: "hf:zai-org/GLM-4.5"
The AI SDK docs are here for using custom base URLs: https://ai-sdk.dev/docs/ai-sdk-core/provider-management
You can also join our Discord if you need help! https://synthetic.new/discord should redirect you to our Discord server :)
How are messages counted? For example, in Cursor, one request is 25 tool calls. Does 100 messages in a subscription here mean 100 tool calls or 100 requests each with 25 tool calls?
When it comes to privacy, there are also some questions. It says that requests can only be used for debugging purposes, but it later mentions a license for using the requests to improve the platform, which can mean that you can use it not only for debugging purposes.
For requests, it depends on the agent framework to a certain extent. We just count API requests. For frameworks that support parallel tool calls, assuming they're using the standard OpenAI parallel tool call API, the entire parallel batch only counts as one request — since it only generated a single API request, and we just count API requests. I don't know exactly how Cursor structures it but I'd be surprised if they were making 100 API requests per message — I assume they're using the normal parallel tool call API to send all tools in a single batch, which equates to only taking 1 request of your quota in the rate limit.
To run the TPS benchmark, just run:
octo bench tps
All it does is ask the model to write a long story without making tool calls (although we do send the tool definitions over, to accurately benchmark differences in tool call serialization/parsing). It usually consumes a little over 1k tokens so it's fairly cheap to run against different usage-based APIs (and only consumes a single request for subscription APIs that rate limit by request).Edit: forgot to add — for Qwen3 everything should be running in FP8.
Your privacy policy isn't good for a privacy focused provider though. You shouldn't have the rights to use my personal information. The use of Google Tag Manager also not inspire confidence, especially in LLM pages where you might "accidentally" install a user monitoring script and the prompts get logged. I'd suggest looking at how Kagi do the marketing to privacy-conscious customers.
Not affiliated with Hacker News or Y Combinator. We simply enrich the public API with analytics.