GPT-5.2-Codex
Key topics
The AI coding wars are heating up with the release of GPT-5.2-Codex, sparking debate over its performance compared to rivals Gemini and Claude. While some commenters, like koakuma-chan, swear by GPT-5.2's superiority, others, such as nunodonato, call for more nuanced comparisons, pointing out that "better" is meaningless without context. Mkengin drops a bombshell by citing the SWE-Rebench benchmark, which shows OpenAI and Anthropic neck-and-neck in performance, with GPT-5.2 boasting a significant cost advantage. As the discussion unfolds, it becomes clear that the real test lies not just in raw performance, but in real-world applications and usability.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
8m
Peak period
123
0-12h
Avg / period
22.9
Based on 160 loaded comments
Key moments
- 01Story posted
Dec 18, 2025 at 1:14 PM EST
15 days ago
Step 01 - 02First comment
Dec 18, 2025 at 1:22 PM EST
8m after posting
Step 02 - 03Peak activity
123 comments in 0-12h
Hottest window of the conversation
Step 03 - 04Latest activity
Dec 24, 2025 at 3:52 PM EST
9 days ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
"The most advanced agentic coding model for professional software engineers"
If you want to compare the weakest models from both companies then Gemini Flash vs GPT Instant would seem to be best comparison, although Claude Opus 4.5 is by all accounts the most powerful for coding.
In any case, it will take a few weeks for any meaningful test comparisons to be made, and in the meantime it's hard not to see any release from OpenAI since they announced "Code Red" (aka "we're behind the competition") a few days ago as more marketing than anything else.
Opus 4.5 seems different - Anthropic's best coding model, but also their frontier general purpose model.
Gemini 3 Pro is a great foundation model. I use as a math tutor, and it's great. I previously used Gemini 2.5 Pro as a math tutor, and Gemini 3 Pro was a qualitative improvement over that. But Gemini 3 Pro sucks at being a coding agent inside a harness. It sucks at tool calling. It's borderline unusable in Cursor because of that, and likely the same in Antigravity. A few weeks ago I attended a demo of Antigravity that Google employees were giving, and it was completely broken. It got stuck for them during the demo, and they ended up not being able to show anything.
Opus 4.5 is good, and faster than GPT-5.2, but less reliable. I use it for medium difficulty tasks. But for anything serious—it's GPT 5.2
Just yesterday, in Antigravity, it deleted 500 lines of code and replaced it with a `<rest of code goes here>` comment.
https://cursor.com/pricing
again I’m not saying Codex is worse, they’re just different and claiming the only one you actively use is the best is a stretch
Such as?
> again I’m not saying Codex is worse, they’re just different and claiming the only one you actively use is the best is a stretch
I am testing all models in Cursor.
> I initially dismissed Claude Code at launch, then loved Codex when it released. never really liked Cursor
I also don't actually like Cursor. It's a VSCode fork, and a mediocre harness. I am only using it because my company refuses to buy anything else, because Cursor has all models, and it appears to them that it's not worth having anything else.
> Such as?
changelog is here: https://github.com/anthropics/claude-code/blob/main/CHANGELO...
glhf
> “prompting”/harness that improves how it actually performs
Is an abstract statement without any meaningful details.
https://swe-rebench.com/
But Opus 4.5 exists.
> GPT‑5.2-Codex has stronger cybersecurity capabilities than any model we’ve released so far. These advances can help strengthen cybersecurity at scale, but they also raise new dual-use risks that require careful deployment.
I'm curious what they mean by the dual-use risks.
Just safety nerds being gatekeepers.
They did the same thing for gpt-5.1-codex-max (code name “arcticfox”), delaying its availability in the API and only allowing it to be used by monthly plan users, and as an API user I found it very annoying.
I find it to be incorrectly pattern matching with a very narrow focus and will ignore real documented differences even when explicitly highlighted in the prompt text (this is X crdt algo not Y crdt algo.)
I've canceled my subscription, the idea that on any larger edits it will just start wrecking nuance and then refuse to accept prompts that point this out is an extremely dangerous form of target fixation.
The team is responsive on github and it's clear that user complaints make their way to the team responsible for fine tuning these models.
I really like the GPT 5.2 family of models and I'm interested to see how the codex version behaves. It's presumably based on a completely different pre-train than the 5.1 models so this is exciting.
(Also, I can't imagine who is blessed with so much spare tome that they would look down on an assistant that does decent work)
Yeah, it feels really strange sometimes. Bumping up against something that Codex seemingly can't work out, and you give it to Claude and suddenly it's easy. And you continue with Claude and eventually it gets stuck on something, and you try Codex which gets it immediately. My guess would be that the training data differs just enough for it to have an impact.
But if you want that last 10%, codex is vital.
It's because performance degrades over longer conversations, which decreases the chance that the same conversation will result in a solution, and increases the chance that a new one will. I suspect you would get the same result even if you didn't switch to a different model.
They just have different strengths and weaknesses.
which is analogous to taking your problem to another model and ideally feeding it some sorta lesson
i guess this is a specific example but one i play out a lot. starting fresh with the same problem is unusual for me. usually has a lesson im feeding it from the start
- Planning mode
- Better terminal rendering (Codex seems to go for a "clean" look at the cost of clearly distinguished output)
- It prompts you for questions
- Sub-agents don't pollute your context
Codex is so so good at finding bugs and little inconsistencies, it's astounding to me. Where Claude Code is good at "raw coding", Codex/GPT5.x are unbeatable in terms of careful, methodical finding of "problems" (be it in code, or in math).
Yes, it takes longer (quality, not speed please!) -- but the things that it finds consistently astound me.
I consistently run into limits with CC (Opus 4.5) -- but even though Codex seems to be spending significantly more tokens, it just seems like the quota limit is much higher?
Managing context goes a long way, too. I clear context for every new task and keep the local context files up to date with key info to get the LLM on target quickly
Aggressively recreating your context is still the best way to get the best results from these tools too, so it has a secondary benefit.
My take is that the overlap is strongest with engineering management. If you can learn how to manage a team of human engineers well, that translates to managing a team of agents well.
None of that knowlege will get useless, only working around current limitations of agents will.
The other skill is in knowing exactly when to roll up your sleeves and do it the old fashioned way. Which things they're good/useful for, and which things they aren't.
I do wish that ChatGPT had a toggle next to each project file instead of having to delete and reupload to toggle or create separate projects for various combinations of files.
If I want to start a new task, I /clear and then tell it to re-read the CLAUDE.md document where I put all of the quick context: Description of the project, key goals, where to find key code, reminders for tools to use, and so on. I aggressively update this file as I notice things that it’s always forgetting or looking up. I know some people have the LLM update their context file but I just do it myself with seemingly better results.
Using /compact burns through a lot of your usage quota and retains a lot of things you may not need. Giving it directions like “starting a new task doing ____, only keep necessary context for that” can help, but hitting /clear and having it re-read a short context primer is faster and uses less quota.
So if you look at the total cost of running the benchmark, it's surprisingly similar to other models -- the higher price per token is offset by the significantly fewer tokens required to complete a task.
See "Cost to Run Artificial Analysis Index" and "Intelligence vs Output Tokens" here
https://artificialanalysis.ai/
...With the obligatory caveat that benchmarks are largely irrelevant for actual real world tasks and you need to test the thing on your actual task to see how well it does!
In my mind, they're hardly making any money compared to how much they're spending, and are relying on future modeling and efficiency gains to be able to reduce their costs but are pursuing user growth and engagement almost fully -- the more queries they get, the more data they get, the bigger a data moat they can build.
All the money they keep raising goes to R&D for the next model. But I don't see how they ever get off that treadmill.
It almost certainly is not. Until we know what the useful life of NVIDIA GPUs are, then it's impossible to determine whether this is profitable or not.
The marginal cost of an API call is small relative to what users pay, and utilization rates at scale are pretty high. You don't need perfect certainty about GPU lifespan to see that the spread between cost-per-token and revenue-per-token leaves a lot of room.
And datacenter GPUs have been running inference workloads for years now, so companies have a good idea of rates of failure and obsolescence. They're not throwing away two-year-old chips.
How do you know this?
> You don't need perfect certainty about GPU lifespan to see that the spread between cost-per-token and revenue-per-token leaves a lot of room.
You can't even speculate this spread without knowing even a rough idea of cost-per-token. Currently, it's total paper math on what the cost-per-token is.
> And datacenter GPUs have been running inference workloads for years now,
And inference resource intensity is a moving target. If a new model comes out that requires 2x the amount of resources now.
> They're not throwing away two-year-old chips.
Maybe, but they'll be replaced by either (a) a higher performance GPU that can deliver the same results with less energy, less physical density, and less cooling or (b) the extended support costs becomes financially untenable.
Or is it already saturated?
I don't understand why not. People pay for quality all the time, and often they're begging to pay for quality, it's just not an option. Of course, it depends on how much more quality is being offered, but it sounds like a significant amount here.
Experiencing that repeatedly motivated me to use it as a reviewer (which another commenter noted), a role which it is (from my experience) very good at.
I basically use it to drive Claude Code, which will nuke the codebase with abandon.
I had it whip this up to try and avoid this, while still running it in yolo mode (which is still not recommended).
https://gist.github.com/fragmede/96f35225c29cf8790f10b1668b8...
I asked codex to take a look. It took a couple minutes, but it used some debugging tricks I didn't know about and tracked the issue down. I was blown away. It reran qemu with different flags to get more information about a CPU fault I couldn't see. Then got a hex code of the instruction pointer at the time of the fault, and used some tools I didn't know about to map that pointer to the lines of code which were causing the problem. Then took a read of that part of the code and guessed (correctly) what the issue was. I guess I haven't worked with operating systems much, so I haven't seen any of those tricks before. But, holy cow!
Its tempting to just accept the help and move on, but today I want to go through what it did in detail, including all the tools it used, so I can learn to do the same thing myself next time.
The OpenAI token limits seem more generous than the Anthropic ones too.
I think it is widely accepted that Anthropic is doing very well in enterprise adoption of Claude Code.
In most of those cases that is paid via API key not by subscription so the business model works differently - it doesn't rely on low usage users subsidizing high usage users.
OTOH OpenAI is way ahead on consumer usage - which also includes Codex even if most consumers don't use it.
I don't think it matters - just make use of the best model at the best price. At the moment Codex 5.2 seems best at the mid-price range, while Opus seems slightly stronger than Codex Max (but too expensive to use for many things).
I only ever use it on the high reasoning mode, for what it's worth. I'm sure it's even less of a problem if you turn it down.
So much so that I rely completely on Codex for code reviews and actual coding. Claude is there to do lower risk tasks.
I’m in Claude Code so often (x20 Max) and I’m so comfortable with my environment setup with hooks (for guardrails and context) that I haven’t given Codex a serious shot yet.
It's often not that a different model is better (well, it still has to be a good model). It's that the different chat has a different objective - and will identify different things.
While I prefer the way Claude speaks and writes code, there is no doubt that whatever Codex does is more thorough.
I did found some slip ups in 5.2 where I did a refactor of a client header where I removed two header properties, but 5.2 forgot to remove those from the toArray method of the class. Was using 5.2 on medium (default).
If anyone from OpenAI or Google is reading this, please continue to make your image editing models work with the "previz-to-render" workflow.
Image edits should strongly infer pose and blocking as an internal ControlNet, but should be able to upscale low-fidelity mannequins, cutouts, and plates/billboards.
OpenAI kicks ass at this (but could do better with style controls - if I give a Midjourney style ref, use it) :
https://imgur.com/gallery/previz-to-image-gpt-image-1-x8t1ij...
https://imgur.com/a/previz-to-image-gpt-image-1-5-3fq042U
Google fails the tests currently, but can probably easily catch up :
https://imgur.com/a/previz-to-image-nano-banana-pro-Q2B8psd
This is a privacy and security risk. Your code diffs and prompts are there (seemingly) forever. Best you can do is "archive" them, which is a fancy word for "put it somewhere else so it doesn't clutter the main page".
I use it because it works out cheaper than Codex Cloud and gives you greater flexibility. Although it doesn't have 5.2-codex yet.
Unsure where that could be in Windows.
You know what would be fun to try? Give Code full access and then ask it to delete that folder, lol.
Then again, I wouldn't put much trust into OpenAI's handling of information either way.
My only gripe is I wish they'd publish Codex CLI updates to homebrew the same time as npm. Us macOS users shouldn't have to wait an extra day :)
Claude still tends to add "fluff" around the solution and over-engineer, not that the code doesn't work, it's just it's ugly
Yeah, this makes sense. There's a fine line between good enough to do security research and good enough to be a prompt kiddie on steroids. At the same time, aligning the models for "safety" would probably make them worse overall, especially when dealing with security questions (i.e. analyse this code snippet and provide security feedback / improvements).
At the end of the day, after some KYC I see no reason why they shouldn't be "in the clear". They get all the positive news (i.e. our gpt666-pro-ultra-krypto-sec found a CVE in openBSD stable release), while not being exposed to tabloid style titles like "a 3 year old asked chatgpt to turn on the lights and chatgpt hacked into nasa, news at 5"...
Most of the time spent in vulnerability analysis is automatable grunt work. If you can just take that off the table, and free human testers up to think creatively about anomalous behavior identified for them, you're already drastically improving effectiveness.
153 more comments available on Hacker News