Building More with GPT-5.1-Codex-Max
Postedabout 2 months agoActiveabout 2 months ago
openai.comTechstoryHigh profile
calmmixed
Debate
60/100
AICode GenerationOpenai
Key topics
AI
Code Generation
Openai
OpenAI announces GPT-5.1-Codex-Max, a new AI model for code generation, sparking discussion about its capabilities, limitations, and potential applications.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
7m
Peak period
130
0-6h
Avg / period
26.7
Comment distribution160 data points
Loading chart...
Based on 160 loaded comments
Key moments
- 01Story posted
Nov 19, 2025 at 1:01 PM EST
about 2 months ago
Step 01 - 02First comment
Nov 19, 2025 at 1:09 PM EST
7m after posting
Step 02 - 03Peak activity
130 comments in 0-6h
Hottest window of the conversation
Step 03 - 04Latest activity
Nov 23, 2025 at 9:57 AM EST
about 2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45982649Type: storyLast synced: 11/22/2025, 10:35:00 AM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
They were probably sitting on this for a while. That makes me think this is a fairly incremental update for Codex.
It's as easy as Google "placing ads" for the "search term" "ChatGPT" for them to bleed off users. They own every pane of glass and the "URL bar" is now a "search product" that Google owns.
I do not envy folks with OpenAI golden handcuffs.
This might ultimately only be a game that Google can win.
OpenAI better hope its users install its software, native apps, and browsers. Otherwise Google stands in the way and can intrude at any point.
[0]: https://news.ycombinator.com/item?id=45970668
its been essential to my workflow as well
i use both jj and git and jj is great for just creating a snapshot that i can revert to incase it fails
im still exploring it to see what else i can do with it for agentic use
http://github.com/agentify-sh/10x
does minimal overhead with agent orchestration (its just a bash/typescript) as its main focus was adding enhancements to codex like double redundant checkpoint via git and jj (lessons learned from codex being git reset --hard happy), something like claude skills (just a bunch of mds that steer it towards specific activity like think, plan, execute), timeout wrappers (to get you unstuck if codex waits a long time), blacklist commands during yolo (rm -rf, git reset banned even if it by small chance run it) MIT licensed
you can work sequentially (subagents launch one after the other) or parallel (worktrees) but tbh sequentially is better because you understand what is going on with parallel it might be best for dealing with tests and UI.
going to wait and see after being burned by 5.1 before i upgrade back to 0.58
gemini 3 has been a let down tbh to see agentic coding wasn't a top priority im sticking with codex for now and using gemini 3 for frontend
> a new step towards becoming a reliable coding partner
> GPT‑5.1-Codex-Max is built for long-running, detailed work
Does this not sound contradictory? It’s been the shorter form work that has built what little confidence I have in these as a coding partner - a model that goes off and does work without supervision is not a partner to me.
Codex feels like a tool designed to run after all the humans are gone.
This is definitely one of the biggest issues with coding agents at the moment.
That said, from my experience, Codex so often does things that are so useful and save me so much time that the occasional "oh god what the hell did it just go off and do" are an acceptable cost for me.
I regularly get great results with open-ended prompts and agents that spend 15+ minutes working on the task. I'm sure they'll eventually get better at common sense understanding of what kind of work is wasteful/absurd.
The "# of model-generated tokens per response" chart in [the blog introducing gpt-5-codex](https://openai.com/index/introducing-upgrades-to-codex/) shows an example of how we're improving the model good at both.
As a startup founder and engineer, I'm not constrained by the number of 10000+ line diff, 0->1 demos I can ship. I'm constrained by quality of the 100 -> 101, tight 150 line feature additions / code cleanups I can write.
It feels like the demos, funding, and hype all want to sell me entire PR rewrites, but what I need is the best possible iterative work model that will keep me in the loop.
I still use codex - but I use codex incredibly iteratively (give it very narrowly scoped tasks, and I watch it like a hawk, giving tons of feedback). I don't use it because of its ability to code for 24 hours. I use it because when I give it those narrowly scoped tasks, it is better at writing good code than any other model. (Because of its latency, I have 2-4 of these conversations going on at the same time).
But there is a lot of friction the codex product + model adds to this process. I have to prompt aggressively to override whatever "be extremely precise" prompting the model gets natively so that it doesn't send me 20+ bullet points of extraordinarily dense prose on every message. I have to carefully manage its handling of testing; it will widen any DI + keep massive amounts of legacy code to make sure functionality changes don't break old tests (rather than updating them) and to make sure any difficult tests can have their primary challenges mocked away.
In general, codex doesn't feel like an amazing tool that I have sitting at my right hand. It feels like a teenage genius who has been designed to do tasks autonomously, and who I constantly have to monitor and rein in.
Wouldn't the model automatically do that using attention techniques? Why do you need to do it at the token layer and not leave it to the model to automatically decide which tokens are worth paying attention to?
You don't know how an LLM works and you are operating on flawed anthropomorphic metaphors.
Ask a frontier LLM what a context window is, it will tell you.
With that out of the way, parent was wondering why compaction is necessary arguing that "context window is not some physical barrier but rather the attention just getting saturated". We're trying to explain that 3+2=2+3 and you people are sitting in the back going "well, actually, not all groups are abelian".
For example, DeepSeek 3.2, which employs sparse attention [1], is not only faster with long context than normal 3.1, but also seems to be better (perhaps thanks to reducing the noise?).
[1] It uses still quadratic router, but it's small, so it scales well in practice. https://api-docs.deepseek.com/news/news250929
In practice, when training a model, people select a context window so that during inference, you know how much GPU memory to allocate for a prompt and reject the prompt if it exceeds the memory limit.
Of course there's also degrading performance as context gets longer, but I suspect memory limit is the primary factor of why we have context window limits.
Exactly. Standard Multi-Head Attention uses a matrix that grows to 4B parameters for a 64K sequence as a starting place. FlashAttention v2 helps slightly, but as you grow to 128K context length, you still need over 1TB/s memory bandwidth to stay compute-bound in practice even with this optimization.
So there has been a lot of research in this area and model architectures released this year are showing some promising improvements. Sliding windows lose context fidelity and if you go fully linear, you sacrifice math, logic, and long multi-turn (agentic) capabilities, so everyone is searching for a good alternative compromise.
MiniMax-M1 had lightning attention to scale up to 1M context lengths. It's "I/O aware" via tiling and calculates attention two ways block-wise (intra-block traditional attention and inter-block linear attention), thereby avoiding the speed-inhibiting cumulative summation.
DeepSeek V3.2 uses DeepSeek Sparse Attention (DSA), which is sub-linear by only computing "interesting" pairs. For example, in 128K context lengths this requires only 10-20% of attention pairs to be materialized.
Both Qwen3-Next and Kimi Linear adopt a Gated DeltaNet, which is borrowed from Mamba2. In Qwen3-Next it alternates three Gated DeltaNet (linear attention) layers for every one gated [full] attention. The speedup is from a delta rule, which basically amounts to caching in a hand-wavy way.
There's no universally-adopted solution yet, as these are all pretty heavy-duty compromises, but the search is going strong right now for linear or better attention mechanisms that still perform well.
- New benchmark SOTAs with 77.9% on SWE-Bench-Verified, 79.9% on SWE-Lancer, and 58.1% on TerminalBench 2.0
- Natively trained to work across many hours across multiple context windows via compaction
- 30% more token-efficient at the same reasoning level across many tasks
Let us know what you think!
Is this saying that said summarization now happens at the model level? Or are there other differences?
But it's the same concept. Taking tokens in context and removing irreverent ones by summarizing, etc
What does it even mean?
But they're claiming it's more token efficient, so me switching my usage to the new model should _free up_ capacity.
how much more token efficient is this compared to 5.0
had to use 5.0 because 5.1 was eating tokens like crazy and seemed like a slight incremental improvement barely noticeable
I really like the "subagent" feature in Claude Code — it's super useful to manage context in complex codebases. Here are some examples of agents that can be useful: https://github.com/humanlayer/humanlayer/tree/main/.claude/a...
Would it make sense to have a similar feature in Codex CLI? I often do "spec-driven development", which is basically a loop of:
I have multiple subagents that I use for each phase that (based on subjective judgement) improve the output quality (vs keeping everything, every tool use etc. in the "main" context window).Codex CLI is great and I use it often but I'd like to have more of these convenient features for managing context from CC. I'm super happy that compaction is now available, hopefully we'll get more features for managing context.
I found Gemini have horribly slow for anything
It was extremely slow (like, multiple times slower than Sonnet with Claude Code, though that’s partially on me for using thinking-high I guess) to finish the task, with the back-and-forths being on the order of tens of minutes.
Moreover, the context management seems to be really weird. I’m not sure how exactly it works, but - 1. It uses very little tokens / fills up the context slowly (good I guess) 2. Doesn’t seem to actually internalize the contents of files you mention to it, or it edits.
#2 here being the main one - I usually context-dump reference code for Claude Code, and it does a perfect job of adhering to codebase patterns and its architecture, while codex was completely ignorant of the existing code style.
Moreover, it wrote extremely defensive code, even for code where it wrote both ends itself.
All in all, I was really let down after seeing all the praise.
with claude im constantly hitting rate limits with codex getting substantially more and "slow" isn't really a problem for me as long as it keep working
the only complaint i have is that codex itself has usage limited now (Either due to outstanding git issues around tools or by throttling on their end) compared to a few months ago
the true magical moment was codex pro letting me run swarms of agents day in day out without any worries about rate limits it truly felt unlimited
if claude manages to release a smaller model or some way to deal with the rapidly depleting usage limits (this is the top complaint on reddit and they eventually just stopped allowing threads about it) it would definitely be used more
but for now codex is clearly the workhorse and claude used side by side.
But the subscription thing is a non-issue for me as I use the API, and mostly use Claude Code synchronously, with the occasional rare background agent.
have you tried Haiku?
Claude: they barely have a signin system at all. Multiple account support doesn’t exist. The minimum seat count for business is nonsense. The data retention policies are weak.
OpenAI: Make ZDR a thing you can use or buy without talking to sales, already. And for those using containers or a remote system or really anything other than local development with the codex CLI, you really really need to fix this bug. I bet Codex could do at least the client part for you!
https://github.com/openai/codex/issues/2798
(Hint: Claude Code gets this right by default, despite the fact that everything else about Claude sign-in is a joke.)
Google: get all your B2B AI product managers in one room and tell them that they need to make one single product menu on one single webpage with all the pricing on that page and that the Google Cloud people are not permitted to make anything that isn’t actually logically Google Cloud depend on Google Cloud Billing. Your product cannot compete with OpenAI or Anthropic if people need to ask an LLM to figure out what your product is and if your own fancy LLMs can’t give a straight answer. My company pays for a non-Google product primarily because it’s too complicated to pay for the Google product! Right now, trying to use Google’s AI is like trying to ride Bay Area public transit before the Clipper Card.
I just won’t even waste my time with the google stuff cuz I can’t figure out how to pay with it.
And that’s a problem everywhere at google. Our google play account is suspended cuz I can’t verify the company. It won’t let me cuz it says I’m not the owner. I’ve always been the owner of my company. For 18 years. There is no one else.
Once some error said make sure the owner email matches your profile in google payments and I was like, what is google payments and where do I even begin with that? I’ve never paid for google play so what does payments have to do with anything?
It’s totally random stuff. Get your shit together, google. Make your products and payment systems coherent, rather than it obviously looking like it was designed by a fiefdom full of territorial managers.
Utterly ridiculous.
Also, re "Google Payments", I tried to transfer an app from my personal/solo Google Play account to a new business one I set up for my LLC and it was like pulling teeth. They wanted me to find some payment id from the original $20 purchase I made to get access to Google Play, something I did right around when they first launched and while I still have/use the same email, Google came out with approximately 1 googol different "payment solutions" in the interim and their engineers don't care about data migrations. Finally, after many support emails, they just transferred it without me giving that code which just shows how silly the whole thing was from the start.
What's harder than herding cats? Herding cats with MBAs and OKRs.
YES I had this and eventually fixed it. I really don't know what I did but lots of clicking on random links and signing into things in different orders and then one day it somehow worked.
So frustrating.
Sad part is Google does offer a ChatML/OpenAI compliant endpoint to do LLM calls and I believe they in an experiment also reduced friction in getting an API key to start making calls right away but discoverability ever remains a challenge with google services.
This part is very easy now: you sign into https://aistudio.google.com/ and then click "Get API key" in the lower left corner.
The problem is that features and docs are still scattered all over. Some thing can only be done via Vertex, for example.
I'd love to see the Gemini models being available by other providers :) or if they just build a simple prepaid wallet like OpenAI and Anthropic.
Now you CAN NOT get the Google One stuff if your account is part of a workspace. I thought: how awful. I want to pay, but I simply can't?
Oh, but then I noticed: You CAN add a _Gemini AI Ultra_ license via the Google Workspace Admin area, great!
Turns out: you fucking can't. That's _Google AI Ultra FOR BUSINESS_ and that IS NOT supported.
So I had to get the Google One subscription on my personal account after all.
Combine that with the _pathetic_ usage limits: somehow not token-based, but amount of requests per 24 hour window (which is 500 for Gemini 3) and Gemini 3's incredible chattiness (it uses A LOT more requests to get something done compared to Claude) and you hit the usage limits in just 2 hours.
Trying to pay for Gemini-3 is confusing. Maybe an AI Ultra personal subscription? I already pay for OpenAI and Anthropic’s pro/max plans and would happily pay Google too. But the only obvious option is a $250/month tier, and its documentation indicates Google can train on your code unless you find and enable the correct opt-out. If that opt-out exists in all the products, it’s not obvious where it lives or what products it applies to.
Workspace complicates it further. Google advertises that with business workspace accounts your data isn’t used for training. So, I was going to try Antigravity on our codebase. At this point I know I can't trust Google, so I read the ToS carefully. They train on your prompts and source code, and there doesn't appear to be a way to pay them and opt out right now. Be careful, paying for Google Workspace does not protect you, always read the ToS.
Be careful with AI-studio and your Google Workspace accounts. They train on your prompts unless you switch it to API mode.
The result is a lot of uncertainty. I genuinely have no idea how to pay Google for Gemini without risking my code being used for training. And if I do pay, I can’t tell whether they’ll train on my prompts anyway.
The marketing for their coding products does not clearly state when they do or do not train on your prompts and code.
I had to run deep research to understand the risks with using Gemini 3 for agentic work, and I still don't feel confident that I understand the risks. I might have said some incorrect things above, but I am just so confused. I feel like I have a <75% grasp on the situation.
I don't have a lot of trust. And honestly, this feels confusing and deceptive. One could easily confuse it as deliberate strategy to gather training data through ambiguity and dark patterns, it certainly looks like this could be Google's strategy to win the AI race. I assume this is just how it looks, and that they aren't being evil on purpose.
OpenAI in particular has my trust. They get it. They are carefully building the customer experience, they are product and customer driven from the top.
I wouldn't trust Sam Altman. Or any of the big players really.
Hahaha...HAHAhaha. HAHAHHAHAHAHAHAHAHA!!!
Peering into my crystal ball: once all "workers" have been replaced, all humans will spend all of their working hours on nothing but office politics.
Please give me an option for a password (or passkey) or literally anything else that doesn't require either linking with google or going through an email flow for every login
https://github.com/google-gemini/gemini-cli/issues/12121
It is far too easy to accidentally end up under the wrong privacy agreement, to the point of where some workplaces are banning use of the Gemini CLI!
Though that does bring up an interesting point. Anecdotally, Sonnet does a lot more grep-ing while Codex reads files straight up. Might be the difference in speed and maybe smarter models will do better. Once this model is on copilot, I can test it out.
There's an option to "get a quick answer" and I hoped clicking that would revert to previous performance and instead what it does is ignore that I uploaded two files and asks me to upload the files
Literally the only real good task I've found for these dumb things and they still found a way to fuck it up because they need to keep the weirdos and whales addicted. It's now almost easier to go back to comparing these files by eye, or just bite the bullet and finally write a few lines of python to actually do it right and reliably.
It seems like they might still be heavily nerfing / quantizing the models in production a couple weeks before a new release, like they have always (unofficially) done.
One huge difference I notice between Codex and Claude code is that, while Claude basically disregards your instructions (CLAUDE.md) entirely, Codex is extremely, painfully, doggedly persistent in following every last character of them - to the point that i've seen it work for 30 minutes to convolute some solution that was only convoluted because of some sentence I threw in the instructions I had completely forgotten about.
I imagine Codex as the "literal genie" - it'll give you exactly what you asked for. EXACTLY. If you ask Claude to fix a test that accidentally says assert(1 + 1 === 3), it'll say "this is clearly a typo" and just rewrite the test. Codex will rewrite the entire V8 engine to break arithmetic.
Both these tools have their uses, and I don't think one approach is universally better. Because Claude just hacks its way to a solution, it is really fast, so I like using it for iterate web work, where I need to tweak some styles and I need a fast iterative loop. Codex is much worse at that because it takes like 5 minutes to validate everything is correct. Codex is much better for longer, harder tasks that have to be correct -- I can just write some script to verify that what it did work, and let it spin for 30-40 minutes.
A friend of mine tells Claude to always address him as “Mr Tinkleberry”, he says he can tell when Claude is not paying attention to the instructions on CLAUDE.md when Claude stops calling him “Mr Tinkleberry” consistently
This guy has a good write up on the topic
I'd be wary of using any canary material that wouldn't be at home in the sort of work you're doing.
The chat is a simulation, and if you act silly, the model will simulate an appropriate response.
[0]: https://en.wikipedia.org/wiki/A_Void
Wow, I spent last weekend using a tag-team of Claude and Codex and found Codex to more often get better results (TypeScript physics/graphics application). I probably only wrote a few hundred lines of code out of many thousands; it did a really good job.
Now I guess I'll ask the new Codex to review the work of the old!
https://github.com/openai/codex/releases/tag/rust-v0.59.0
152 more comments available on Hacker News