GPT-5.2-Codex

Posted15 days agoActive9 days ago

meetpateltech

581 points

313 comments

openai.comTech DiscussionstoryHigh profile

excitedpositive

Debate

40/100

Artificial IntelligenceConversational UIAI Image Generation

Key topics

Artificial Intelligence

Conversational UI

AI Image Generation

The AI coding wars are heating up with the release of GPT-5.2-Codex, sparking debate over its performance compared to rivals Gemini and Claude. While some commenters, like koakuma-chan, swear by GPT-5.2's superiority, others, such as nunodonato, call for more nuanced comparisons, pointing out that "better" is meaningless without context. Mkengin drops a bombshell by citing the SWE-Rebench benchmark, which shows OpenAI and Anthropic neck-and-neck in performance, with GPT-5.2 boasting a significant cost advantage. As the discussion unfolds, it becomes clear that the real test lies not just in raw performance, but in real-world applications and usability.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

Peak period

123

0-12h

Avg / period

22.9

Comment distribution160 data points

Loading chart...

Based on 160 loaded comments

Key moments

01Story posted
Dec 18, 2025 at 1:14 PM EST
15 days ago
Step 01
02First comment
Dec 18, 2025 at 1:22 PM EST
8m after posting
Step 02
03Peak activity
123 comments in 0-12h
Hottest window of the conversation
Step 03
04Latest activity
Dec 24, 2025 at 3:52 PM EST
9 days ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (313 comments)

Showing 160 comments of 313

exacube

15 days ago

1 reply

would love to see some comparison numbers to Gemini and Claude, especially with this claim:

"The most advanced agentic coding model for professional software engineers"

koakuma-chan

15 days ago

4 replies

I can confirm GPT 5.2 is better than Gemini and Claude. GPT 5.2 Codex is probably even better.

15 days ago

2 replies

Gemini 2.5 or 3? (3 was released yesterday)

koakuma-chan

15 days ago

2 replies

I tried Gemini 3 Flash, and I am unimpressed. It's maybe a competitor to Cursor's Compose-1, but completely different league from GPT 5.2

HarHarVeryFunny

15 days ago

2 replies

Surely Gemini 3.0 Pro would be the appropriate comparison.

If you want to compare the weakest models from both companies then Gemini Flash vs GPT Instant would seem to be best comparison, although Claude Opus 4.5 is by all accounts the most powerful for coding.

In any case, it will take a few weeks for any meaningful test comparisons to be made, and in the meantime it's hard not to see any release from OpenAI since they announced "Code Red" (aka "we're behind the competition") a few days ago as more marketing than anything else.

BeetleB

15 days ago

1 reply

Gemini 3.0 Flash outperforms Pro in many tasks - I believe the coding benchmark was one of them.

HarHarVeryFunny

14 days ago

Presumably that would reflect Gemini 3.0 Flash having more extensive RL for coding training than Pro ? Maybe we can expect a "Gemini 3 Pro Coding" model in the future?

Opus 4.5 seems different - Anthropic's best coding model, but also their frontier general purpose model.

koakuma-chan

15 days ago

3 replies

That's what I said in my original message. By my account, GPT 5.2 is better than Gemini 3 Pro and Opus 4.5

Gemini 3 Pro is a great foundation model. I use as a math tutor, and it's great. I previously used Gemini 2.5 Pro as a math tutor, and Gemini 3 Pro was a qualitative improvement over that. But Gemini 3 Pro sucks at being a coding agent inside a harness. It sucks at tool calling. It's borderline unusable in Cursor because of that, and likely the same in Antigravity. A few weeks ago I attended a demo of Antigravity that Google employees were giving, and it was completely broken. It got stuck for them during the demo, and they ended up not being able to show anything.

Opus 4.5 is good, and faster than GPT-5.2, but less reliable. I use it for medium difficulty tasks. But for anything serious—it's GPT 5.2

postalcoder

15 days ago

1 reply

Gemini 3 is not suitable for agentic coding, the tuning (still) far inferior to Claude and GPT models.

Just yesterday, in Antigravity, it deleted 500 lines of code and replaced it with a `<rest of code goes here>` comment.

misiti3780

15 days ago

lol

HarHarVeryFunny

14 days ago

1 reply

I'm curious how you are testing/trying these latest models? Do you have specific test/benchmark tasks that they struggle with that you are trying, and/or are you working on a real project and just trying alternatives where another model is not performing well ?

koakuma-chan

14 days ago

I am using Cursor, which has all major models—OpenAI, Anthropic, Google, etc. Every time a new model comes out, I test it on a real project (the app that I am working on at work).

Mkengin

15 days ago

Your experience seems to match the recent results from swe-rebench: https://swe-rebench.com/

walthamstow

15 days ago

Glad I'm not alone in thinking Flash 3 was like Composer 1 in speed but smarter

Tostino

15 days ago

1 reply

3 has been out for at least a couple weeks for me.

koakuma-chan

15 days ago

He meant 3 Flash, which came out recently

nunodonato

15 days ago

2 replies

I'm gonna call bs on these kind of comments. "better" on what? Coding models shouldn't even be compared isolated. A big part of making it work in a real/big codebase is the tool that calls the model (claude code, gemini-cli, etc). I'll bet claude code will still keep stealing your lunch every day of the week against any competitor out there

koakuma-chan

15 days ago

3 replies

I haven't used CC in a few months, what killer features have they added? I am using Cursor, it's clunky, but not that clunky so as to completely destroy model performance. I am pretty sure for my tasks (undocumented, buggy, legacy JavaScript project) GPT-5.2 is > all on any decent harness, because it doesn't give up or half-ass. It can run for 5 minutes or for 50 minutes, depending on your request.

nunodonato

15 days ago

1 reply

it's not about features (although they've added plenty), its the internal tooling and the way the model is prompted.

koakuma-chan

15 days ago

1 reply

The only thing I know that CC has that Cursor hasn't, is the ability to spawn agents. You can just prompt CC "spawn 10 agents" and it will make 10 subagents that run concurrently. But otherwise, I don't know what CC does that Cursor doesn't. On the contrary, AFAIK, CC doesn't index your codebase, and Cursor does.

NoveltyEngine

15 days ago

2 replies

Surely CC has a lower price? How much do you have to pay Cursor for equivalent to what's provided in a 20x Claude Max plan?

mejutoco

15 days ago

1 reply

200$

https://cursor.com/pricing

NoveltyEngine

12 days ago

If I was paying API costs, I'd be spending about $1700/mo to get what I use on Claude's 20x plan. My understanding was that Cursor would give me about 1/4 of that, on their 20x plan.

koakuma-chan

15 days ago

I don't know. My company pays for it.

dkdcio

15 days ago

1 reply

lol bold claim initially for not using the primary competitor in months. I try to use all 3 (Claude Code, Codex CLI, Gemini CLI); there are tradeoffs between all 3

koakuma-chan

15 days ago

1 reply

Read my reply to sibling comment. To my knowledge, Claude Code is at most marginally better than Cursor, and it's mostly the model that matters. Not saying there is no room for improvement on the tooling side, but no one seems to have come up with anything so far. Let me know which killer features Claude Code has, I would be happy for learn.

dkdcio

15 days ago

1 reply

it’s the “agentic harness” — they have shipped tons of great features for the DevEx, but it’s the combination of better models (Sonnet 4.5 1M, now Opus 4.5) and the “prompting”/harness that improves how it actually performs

again I’m not saying Codex is worse, they’re just different and claiming the only one you actively use is the best is a stretch

koakuma-chan

15 days ago

1 reply

> they have shipped tons of great features for the DevEx

Such as?

> again I’m not saying Codex is worse, they’re just different and claiming the only one you actively use is the best is a stretch

I am testing all models in Cursor.

> I initially dismissed Claude Code at launch, then loved Codex when it released. never really liked Cursor

I also don't actually like Cursor. It's a VSCode fork, and a mediocre harness. I am only using it because my company refuses to buy anything else, because Cursor has all models, and it appears to them that it's not worth having anything else.

dkdcio

15 days ago

1 reply

you conveniently ignored the most important part of my comment :)

> Such as?

changelog is here: https://github.com/anthropics/claude-code/blob/main/CHANGELO...

glhf

koakuma-chan

15 days ago

You do not seem to be able to tell me anything substantial, i.e. specifically how Claude Code is a better harness than Cursor.

> “prompting”/harness that improves how it actually performs

Is an abstract statement without any meaningful details.

nunodonato

15 days ago

1 reply

we dont have capability to see the inner working of claude code, its not open source. You just use it and you see the difference. I've tried all of them, including anti-gravity. Nothing beats claude code

HumanOstrich

15 days ago

[delayed]

Mkengin

15 days ago

According to SWE-Rebench Anthropic and OpenAI are really close in performance, while GPT-5.2 costs less than half the cost of CC per problem.

https://swe-rebench.com/

speedgoose

15 days ago

It’s significantly slower though. At least for my use cases I rather ask Claude 4.5 opus and switch to GPT if Claude is stuck.

exacube

15 days ago

thanks for the confirmation! what is it better at, and where are the numbers?

whinvik

15 days ago

1 reply

I actually have 0 enthusiasm for this model. When GPT 5 came out it was clearly the best model, but since Opus 4.5, GPT5.x just feels so slow. So, I am going to skip all `thinking` releases from OpenAI and check them again only if they come up with something that does not rely so much on thinking.

famahar

15 days ago

1 reply

It's wild to me how people get used to new ground breaking coding LLM models. Every time a new update comes there are so many people that think it's trash because it made an error or takes some time to think. We all have access to a skilled (enough) pair programmer available 24/7. Like I'm still recovering from the shock of the first coding capable LLM from 2 years ago.

whinvik

15 days ago

Haha the issue is competition. If nothing else existed then every GPT5.X release would have been amazing.

But Opus 4.5 exists.

seneca

15 days ago

1 reply

I hope this makes a big jump forward for them. I used to be a heavy Codex user, but it has just been so much worse than Claude Code both in UX and in actual results that I've completely given up on it. Anthropic needs a real competitor to keep them motivated and they just don't have one right now, so I'd really like to see OpenAI get back in the game.

GenerWork

15 days ago

2 replies

GPT 5.2 has gotten a lot better at building UI elements when given a Figma MCP server link. I used to use Claude for building brand new UI elements based on the Figma link, but 5.2 caught up to a point where I'm probably going to cancel Claude.

misiti3780

15 days ago

1 reply

i didnt realize you can pass it a figma MCP link. is this an undocumented feature ?

GenerWork

12 days ago

I don't think so. I've seen multiple posts from Figma and Codex talking about the best way of working with both tools.

seneca

15 days ago

Nice, I'll have to give that a shot. I often use Claude for exactly that.

larrymcp

15 days ago

5 replies

Can anyone elaborate on what they're referring to here?

> GPT‑5.2-Codex has stronger cybersecurity capabilities than any model we’ve released so far. These advances can help strengthen cybersecurity at scale, but they also raise new dual-use risks that require careful deployment.

I'm curious what they mean by the dual-use risks.

pixl97

15 days ago

1 reply

Finding/patching exploits means you also can exploit them better?

throwaway127482

15 days ago

1 reply

They did some interesting wordsmithing here to cover their ass without saying it directly.

whynotminot

15 days ago

What they said sounded pretty direct to me.

tgtweak

15 days ago

Good at finding/fixing security vulnerabilities = Good at finding/exploiting security vulnerabilities.

dpoloncsak

15 days ago

"Please review this code for any security vulnerabilities" has two very different outcomes depending on if its the maintainer or threat actor prompting the model

runtimepanic

15 days ago

“Dual-use” here usually isn’t about novel attack techniques, but about lowering the barrier to execution. The same improvements that help defenders reason about exploit chains, misconfigurations, or detection logic can also help an attacker automate reconnaissance, payload adaptation, or post-exploitation analysis. Historically, this shows up less as “new attacks” and more as speed and scale shifts. Things that required an experienced operator become accessible to a much wider audience. That’s why deployment controls, logging, and use-case constraints matter as much as the raw capability itself.

baq

15 days ago

probably that it's good on tasks of either color teams, red or blue - and if it is, it means you can automate some... interesting workflows.

OldGreenYodaGPT

15 days ago

1 reply

GPT 5.2 has been very good in codex can't wait to try this new modal. Will see how it compares to Opus 4.5

firemelt

12 days ago

tell me the rrsultz which one do you prefer

trunnell

15 days ago

3 replies

Why aren’t they making gpt-5.2-codex available in the API at launch?

kingstnap

15 days ago

1 reply

> we’re piloting invite-only trusted access to upcoming capabilities and more permissive models

Just safety nerds being gatekeepers.

trunnell

15 days ago

That’s for future unreleased capabilities and models, not the model released today.

They did the same thing for gpt-5.1-codex-max (code name “arcticfox”), delaying its availability in the API and only allowing it to be used by monthly plan users, and as an API user I found it very annoying.

dist-epoch

15 days ago

They say it's because it's too good at hacking stuff.

MallocVoidstar

15 days ago

They can't train on the API.

fellowniusmonk

15 days ago

1 reply

In all my unpublished tests, which focus on 1. unique logic puzzles that are intentionally adjacent to existing puzzles and 2. implementing a specific unique CRDT algorithm that is not particularly common but has an official reference implementation on github (so the models definitely been trained on it) I find that 5.2 overfits to the more common implementation and will actively break working code and puzzles.

I find it to be incorrectly pattern matching with a very narrow focus and will ignore real documented differences even when explicitly highlighted in the prompt text (this is X crdt algo not Y crdt algo.)

I've canceled my subscription, the idea that on any larger edits it will just start wrecking nuance and then refuse to accept prompts that point this out is an extremely dangerous form of target fixation.

pillefitz

15 days ago

1 reply

How does Claude perform?

fellowniusmonk

15 days ago

They all have difficulty with certain crdts types in general, 4.5 opus has to go through a round of ask to give it clarifying instructions but then it's fine. Neither get it perfectly as a one shot, claude if you jump straight into agent won't break code but will churn for a bit.

postalcoder

15 days ago

3 replies

It has become very quickly unfashionable for people to say they like Codex. It's still my favorite coding assistant – my only complaint is it's slow.

The team is responsive on github and it's clear that user complaints make their way to the team responsible for fine tuning these models.

I really like the GPT 5.2 family of models and I'm interested to see how the codex version behaves. It's presumably based on a completely different pre-train than the 5.1 models so this is exciting.

jbm

15 days ago

1 reply

When Claude screws up a task I use Codex and vice versa. It helps a lot when I'm working on libraries that I've never touched before, especially iOS related.

(Also, I can't imagine who is blessed with so much spare tome that they would look down on an assistant that does decent work)

embedding-shape

15 days ago

2 replies

> When Claude screws up a task I use Codex and vice versa

Yeah, it feels really strange sometimes. Bumping up against something that Codex seemingly can't work out, and you give it to Claude and suddenly it's easy. And you continue with Claude and eventually it gets stuck on something, and you try Codex which gets it immediately. My guess would be that the training data differs just enough for it to have an impact.

extr

15 days ago

1 reply

I think Claude is more practically minded. I find that OAI models in general default to the most technically correct, expensive (in terms of LoC implementation cost, possible future maintenance burden, etc) solution. Whereas Claude will take a look at the codebase and say "Looks like a webshit React app, why don't you just do XYZ which gets you 90% of the way there in 3 lines".

But if you want that last 10%, codex is vital.

deaux

15 days ago

Correct, this has been true for all GPT-5 series. They produce much more "enterprise" code by default, sticking to "best practices", so people who need such code will much prefer them. Claude models tend to adapt more to the existing level of the codebase, defaulting to more lightweight solutions. Gemini 3 hasn't been out long enough yet to gauge, but so far seems somewhere in between.

enraged_camel

15 days ago

1 reply

>> My guess would be that the training data differs just enough for it to have an impact.

It's because performance degrades over longer conversations, which decreases the chance that the same conversation will result in a solution, and increases the chance that a new one will. I suspect you would get the same result even if you didn't switch to a different model.

XenophileJKO

15 days ago

1 reply

So not really, certainly models degrade by some degree on context retrieval. However, in Cursor you can just change the model used for the exchange, it still has the same long context. You'll see the different model strengths and weaknesses contrasted.

They just have different strengths and weaknesses.

grimgrin

15 days ago

if claude is stuck on a thing but we’ve made progress (even if that progress is process of elimination) and it’s 120k tokens deep, often when i have claude distill our learnings into a file.. and /clear to start again with said file, i’ll get quicker success

which is analogous to taking your problem to another model and ideally feeding it some sorta lesson

i guess this is a specific example but one i play out a lot. starting fresh with the same problem is unusual for me. usually has a lesson im feeding it from the start

qsort

15 days ago

1 reply

I care very little about fashion, whether in clothes or in computers. I've always liked Anthropic products a bit more but Codex is excellent, if that's your jam more power to you.

postalcoder

15 days ago

Claude Code (outside of some performance nits i've faced) is a great piece of software. I do miss a lot of my CC workflows when using Codex CLI but my preference for its style and prompt adherence outweigh other considerations.

EnPissant

15 days ago

Claude Code is just a better CLI:

- Planning mode

- Better terminal rendering (Codex seems to go for a "clean" look at the cost of clearly distinguished output)

- It prompts you for questions

- Sub-agents don't pollute your context

mccoyb

15 days ago

15 replies

If anyone from OpenAI is reading this -- a plea to not screw with the reasoning capabilities!

Codex is so so good at finding bugs and little inconsistencies, it's astounding to me. Where Claude Code is good at "raw coding", Codex/GPT5.x are unbeatable in terms of careful, methodical finding of "problems" (be it in code, or in math).

Yes, it takes longer (quality, not speed please!) -- but the things that it finds consistently astound me.

ifwinterco

15 days ago

4 replies

I think the issue is for them "quality, not speed" means "expensive, not cheap" and they can't pass that extra cost on to customers

mccoyb

15 days ago

3 replies

I'm happy to pay the same right now for less (on the max plan, or whatever) -- because I'm never running into limits, and I'm running these models near all day every day (as a single user working on my own personal projects).

I consistently run into limits with CC (Opus 4.5) -- but even though Codex seems to be spending significantly more tokens, it just seems like the quota limit is much higher?

Computer0

15 days ago

1 reply

I am on the $20 plan for CC and Codex, I feel like a session of usage on CC == ~20% Codex usage / 5 hours in terms of time spent inferencing. It has always seemed way more geneous than I would expect.

Aurornis

15 days ago

2 replies

Agreed. The $20 plans can go very far when you're using the coding agent as an additional tool in your development flow, not just trying to hammer it with prompts until you get output that works.

Managing context goes a long way, too. I clear context for every new task and keep the local context files up to date with key info to get the LLM on target quickly

girvo

15 days ago

1 reply

> I clear context for every new task and keep the local context files up to date with key info to get the LLM on target quickly

Aggressively recreating your context is still the best way to get the best results from these tools too, so it has a secondary benefit.

heliumtera

15 days ago

4 replies

It is ironic that in the gpt-4 era, when we couldn't see much value in this tools, all we could hear was "skill issues", "prompt engineering skills". Now they are actually quite capable for SOME tasks, specially for something that we don't really care about learning, and they, to a certain extent, can generalize. They perform much better than in gpt-4 era, objectively, across all domains. They perform much better with the absolute minimum input, objectively, across all domains. If someone skipped the whole "prompt engineering" and learned nothing during that time, this person is more equiped to perform well. Now I wonder how much I am leaving behind by ignoring this whole "skills, tools, MCP this and that, yada yada".

conradev

15 days ago

1 reply

Prompt engineering (communicating with models?) is a foundational skill. Skills, tools, MCPs, etc. are all built on prompts.

My take is that the overlap is strongest with engineering management. If you can learn how to manage a team of human engineers well, that translates to managing a team of agents well.

lukan

14 days ago

And if learned how to articulate assignments for humans right in a clear way, you likely also can do a prompt right.

None of that knowlege will get useless, only working around current limitations of agents will.

fragmede

14 days ago

My answer is that the code they generate is still crap, so the new skill is in being able to spot the ways and places it wrote crap code, and how to quickly tell it to refactor to fix specific issues, and still come out ahead on productivity. Nothing like an ultra wide screen monitor (LG 40+) and having parallel codex or claude sessions going, working on a bunch of things at once in parallel. Get good at git worktree. Use them to make tools that make your own life easier that you previously wouldn't even have bothered to make. (chrome extensions and MCPs!)

The other skill is in knowing exactly when to roll up your sleeves and do it the old fashioned way. Which things they're good/useful for, and which things they aren't.

miek

15 days ago

Minimal prompting yielding better results? I haven't found this to be the case at all.

neom

15 days ago

Any thoughts on your wondering? I too am wondering about the same mistake I might be making.

theonething

15 days ago

2 replies

do you mean running /compact often?

dionian

15 days ago

1 reply

I'm not who you asked, but i do the same thing, i keep important state in doc files and recreate sessions from that state. this allows me to clear context and reconstruct my status on that item. I have a skill that manages this

joquarky

15 days ago

1 reply

Using documents for state helps so much with adding guardrails.

I do wish that ChatGPT had a toggle next to each project file instead of having to delete and reupload to toggle or create separate projects for various combinations of files.

dionian

14 days ago

This is why claude code/codex cli is the way to go for me because often they can recompute the state from the minimal description automatically. If i relaly do needto stop the session and come back in i can point it at the task file. it also has docs and scaladocs/javadocs in key places. good naming and project structure helps it very easily find the data it needs without me needing to feed it specific files. I did the 'feed it files and copy paste the code snippet' thing in chatgpt for months. wish i went to claude code sooner.

Aurornis

14 days ago

If I want to continue the same task, I run /compact

If I want to start a new task, I /clear and then tell it to re-read the CLAUDE.md document where I put all of the quick context: Description of the project, key goals, where to find key code, reminders for tools to use, and so on. I aggressively update this file as I notice things that it’s always forgetting or looking up. I know some people have the LLM update their context file but I just do it myself with seemingly better results.

Using /compact burns through a lot of your usage quota and retains a lot of things you may not need. Giving it directions like “starting a new task doing ____, only keep necessary context for that” can help, but hitting /clear and having it re-read a short context primer is faster and uses less quota.

andai

15 days ago

If you look at benchmarks, the Claude models score significantly higher intelligence per token. I'm not sure how that works exactly, but they are offset from the entire rest of the chart on that metric. It seems they need less tokens to get the same result. (I can't speak for how that affects performance on very difficult tasks though, since most of mine are pretty straightforward.)

So if you look at the total cost of running the benchmark, it's surprisingly similar to other models -- the higher price per token is offset by the significantly fewer tokens required to complete a task.

See "Cost to Run Artificial Analysis Index" and "Intelligence vs Output Tokens" here

https://artificialanalysis.ai/

...With the obligatory caveat that benchmarks are largely irrelevant for actual real world tasks and you need to test the thing on your actual task to see how well it does!

hadlock

15 days ago

I noticed I am not hitting limits either. My guess is OpenAI sees CC as a real competitor/serious threat. Had OAI not given me virtually unlimited use I probably would have jumped ship to CC by now. Burning tons of cash at this stage is likely Very Worth It to maintain "market leader" status if only in the eyes of the media/investors. It's going to be real hard to claw back current usage limits though.

golly_ned

15 days ago

1 reply

I wonder how much their revenue really ends up contributes towards covering their costs.

In my mind, they're hardly making any money compared to how much they're spending, and are relying on future modeling and efficiency gains to be able to reduce their costs but are pursuing user growth and engagement almost fully -- the more queries they get, the more data they get, the bigger a data moat they can build.

erik

15 days ago

2 replies

Inference is almost certainly very profitable.

All the money they keep raising goes to R&D for the next model. But I don't see how they ever get off that treadmill.

mbesto

14 days ago

1 reply

> Inference is almost certainly very profitable.

It almost certainly is not. Until we know what the useful life of NVIDIA GPUs are, then it's impossible to determine whether this is profitable or not.

panarky

14 days ago

1 reply

The depreciation schedule isn't as big a factor as you'd think.

The marginal cost of an API call is small relative to what users pay, and utilization rates at scale are pretty high. You don't need perfect certainty about GPU lifespan to see that the spread between cost-per-token and revenue-per-token leaves a lot of room.

And datacenter GPUs have been running inference workloads for years now, so companies have a good idea of rates of failure and obsolescence. They're not throwing away two-year-old chips.

mbesto

14 days ago

1 reply

> The marginal cost of an API call is small relative to what users pay, and utilization rates at scale are pretty high.

How do you know this?

> You don't need perfect certainty about GPU lifespan to see that the spread between cost-per-token and revenue-per-token leaves a lot of room.

You can't even speculate this spread without knowing even a rough idea of cost-per-token. Currently, it's total paper math on what the cost-per-token is.

> And datacenter GPUs have been running inference workloads for years now,

And inference resource intensity is a moving target. If a new model comes out that requires 2x the amount of resources now.

> They're not throwing away two-year-old chips.

Maybe, but they'll be replaced by either (a) a higher performance GPU that can deliver the same results with less energy, less physical density, and less cooling or (b) the extended support costs becomes financially untenable.

Leynos

13 days ago

If a model costs them 2x as much, they charge 2x as much. That much is clear from their API pricing.

ithkuil

15 days ago

Is there a possible future where the inference usage increases because there will be many many more customers and R&D grows Lower than inference?

Or is it already saturated?

zozbot234

15 days ago

The "quality" model can cost $200/month. They'll be fine.

tejohnso

15 days ago

> they can't pass that extra cost on to customers

I don't understand why not. People pay for quality all the time, and often they're begging to pay for quality, it's just not an option. Of course, it depends on how much more quality is being offered, but it sounds like a significant amount here.

kilroy123

15 days ago

2 replies

Interesting what I've seen is it spins and thinks forever. Then just breaks. Which is beyond frustrating.

mccoyb

15 days ago

2 replies

If by "just breaks" means "refuses to write code / gives up or reverts what it does" -- yes, I've experienced that.

Experiencing that repeatedly motivated me to use it as a reviewer (which another commenter noted), a role which it is (from my experience) very good at.

I basically use it to drive Claude Code, which will nuke the codebase with abandon.

kilroy123

15 days ago

1 reply

I've seen it think for a long time and then just timeout or something? It just stops and nothing happens.

JamesSwift

15 days ago

Ive had the same but i only use it through zed so I wasnt sure if it was a codex issue or a zed issue

fragmede

14 days ago

I've had codex rm -rf the git repo it's working in while running in yolo mode. Twice, even! (Play with fire, you're gonna get burnt.)

I had it whip this up to try and avoid this, while still running it in yolo mode (which is still not recommended).

https://gist.github.com/fragmede/96f35225c29cf8790f10b1668b8...

baq

15 days ago

we're all senior continue engineers nowadays it seems

apitman

15 days ago

1 reply

It's annoying though because it keeps (accurately) pointing out critical memory bugs that I clearly need to fix rather than pretending they aren't there. It's slowing me down.

gnatolf

15 days ago

Love it when it circles around a minor issue that I clearly described as temporary hack instead of recognizing the tremendously large gaping hole in my implementation right next to it.

baseonmars

15 days ago

1 reply

absolutely second this. I'm mainly a claude code user, but i have codex running in another tab and for code reviews and it's absolutely killer at analyzing flows and finding subtle bugs.

mkagenius

15 days ago

1 reply

Have you tried Claude Code in the second tab instead, that would be a fair comparison.

hugh-avherald

14 days ago

Claude Code isn't as surgical as Codex at reviews

garbagecoder

15 days ago

3 replies

Agree. Codex just read my source code for a toy lisp I wrote in ARM64 assembly and learned how to code in that lisp and wrote a few demo programs for me. The was impressive enough. Then it spent some time and effort to really hunt down some problems--there was a single bit mask error in my garbage collector that wasn't showing up until then. I was blown away. It's the kind of thing I would have spent forever trying to figure out before.

heliumtera

15 days ago

Maybe you're a garbage programmer and that error was too obvious. Interesting observation, though.

varjag

14 days ago

Interestingly it found a GC bug in my toy Lisp that I wrote in Z80 assembly almost 30 years ago. This kind of work appears to be more common than you'd think!

josephg

15 days ago

I've been writing a little port of SeL4 to rust, mostly as a learning exercise. I ran into a weird bug yesterday where none of my code was running, and I couldn't figure out why.

I asked codex to take a look. It took a couple minutes, but it used some debugging tricks I didn't know about and tracked the issue down. I was blown away. It reran qemu with different flags to get more information about a CPU fault I couldn't see. Then got a hex code of the instruction pointer at the time of the fault, and used some tools I didn't know about to map that pointer to the lines of code which were causing the problem. Then took a read of that part of the code and guessed (correctly) what the issue was. I guess I haven't worked with operating systems much, so I haven't seen any of those tricks before. But, holy cow!

Its tempting to just accept the help and move on, but today I want to go through what it did in detail, including all the tools it used, so I can learn to do the same thing myself next time.

smoe

15 days ago

3 replies

Do you think that for someone who only needs careful, methodical identification of “problems” occasionally, the $20/month plan gets you anywhere, or do you need the $200 plan just to get access to this?

15 days ago

1 reply

The $20 does this fine.

The OpenAI token limits seem more generous than the Anthropic ones too.

rbancroft

15 days ago

1 reply

Listening to Dario at the NYT DealBook summit, and reading between the lines a bit, it seems like he is basically saying Anthropic is trying to be a reponsible, sustainable business and charging customers accordingly, and insinuating that OpenAI is being much more reckless, financially.

15 days ago

1 reply

I think it's difficult to estimate how profitable both are - depends too much on usage and that varies so much.

I think it is widely accepted that Anthropic is doing very well in enterprise adoption of Claude Code.

In most of those cases that is paid via API key not by subscription so the business model works differently - it doesn't rely on low usage users subsidizing high usage users.

OTOH OpenAI is way ahead on consumer usage - which also includes Codex even if most consumers don't use it.

I don't think it matters - just make use of the best model at the best price. At the moment Codex 5.2 seems best at the mid-price range, while Opus seems slightly stronger than Codex Max (but too expensive to use for many things).

grim_io

13 days ago

Consumer subscriptions are an impossible business, that's why openai NEEDS ads to have any chance of sustainability.

hatefulmoron

15 days ago

1 reply

I've had the $20/month plan for a few months alongside a max subscription to Claude; the cheap codex plan goes a really long way. I use it a few times a day for debugging, finding bugs, and reviewing my work. I've ran out of usage a couple of times, but only when I lean on it way more than I should.

I only ever use it on the high reasoning mode, for what it's worth. I'm sure it's even less of a problem if you turn it down.

Foobar8568

15 days ago

$200 on claude for vibe coding, $20 on codex for code review and "brainstorming". I use other LLM for a 2nd - 3rd - 4th opinion.

jvermillard

15 days ago

I use it everyday and the $20 plan is fine

sinatra

15 days ago

2 replies

Piggybacking on this post. I see the same. Codex is not only finding much higher quality issues, it’s also writing code that usually doesn’t leave such higher quality issues behind. Claude is much faster but it definitely leaves serious quality issues behind.

So much so that I rely completely on Codex for code reviews and actual coding. Claude is there to do lower risk tasks.

F7F7F7

15 days ago

3 replies

Every plan Opus creates in Planning mode gets run through ChatGPT 5.2. It catches at least 3 or 4 serious issues that Claude didn’t think of. It typically takes 2 or 3 back and fourths for Claude to ultimately get it right.

I’m in Claude Code so often (x20 Max) and I’m so comfortable with my environment setup with hooks (for guardrails and context) that I haven’t given Codex a serious shot yet.

SkyPuncher

15 days ago

3 replies

The same thing can be said about Opus running through Opus.

It's often not that a different model is better (well, it still has to be a good model). It's that the different chat has a different objective - and will identify different things.

shinycode

15 days ago

1 reply

Every time Claude Code finishes a task, I plan a full review of its own task with a very detailed plan and it catches itself many things it didn’t see before. It works well and it’s part of the process of refinement. We all know it’s almost never 100% hit of the first try on big chunks of code generated.

a24j

14 days ago

3 replies

How exactly do you plan/initiate a review from the terminal? open up a new shell/instance of claude and initiate the review with fresh context?

hombre_fatal

14 days ago

Plan mode gets Claude to save a ~/.claude/plans/{unique}.md (/plan open) which makes it easy to just paste that path into another LLM's session/window.

fragmede

14 days ago

Yeah. It dumps context into various .md files, like TODO.md.

shinycode

9 days ago

It depends on the task but I have different Claude commands that have this role, usually I launch them from the same session. The command has the goal of doing an analysis and generating a md file that I can execute with a specific command and the md as parameter. It works quite well. The generated file is a thorough analysis of hundred of lines with specific coded content. It’s more precise that my few line prompt and help Claude stay on rails

sinatra

15 days ago

My (admittedly one person's anecdotal) experience has been that when I ask Codex and Claude to make a plan/fix and then ask them both to review it, they both agree that Codex's version is better quality.

pietz

14 days ago

That's a fair point and yet I deeply believe Codex is better here. After finishing a big task, I used two fresh instances of Claude and Codex to review it. Codex finds more issues in ~9 out of 10 cases.

While I prefer the way Claude speaks and writes code, there is no doubt that whatever Codex does is more thorough.

derfurth

12 days ago

Thanks for the tip. I was dubious, I tried GPT 5.2 for a start on a large plan and it was way better than reviewing it with Claude itself or Gemini. I then used it to help me with feature I was reviewing, it caught real discrepancies between the plan and the actual integration!

lostmsu

12 days ago

[delayed]

AmazingTurtle

15 days ago

Have you tried telling Claude not to leave serious quality issues behind?

rane

15 days ago

Exactly. This is why the workflow of consulting Gemini/Codex for architecture and overall plan, and then have Claude implement the changes is so powerful.

stared

14 days ago

If you want to combine Claude Code coding with reasoning, it is easy to do it with a plugin - https://github.com/stared/gemini-claude-skills, wrote for myself, but shared in case anyone wants. Somehow bigger context here: https://quesma.com/blog/claude-skills-not-antigravity/.

energy123

15 days ago

Second this but for the chat subscription. Whatever they did with 5.2 compared to 5.0 in ChatGPT increased the test-time compute and the quality shows. If only they would allow more tokens to be submitted in one prompt (it's currently capped at 46k for Plus). I don't touch Gemini 3.0 Pro now (am also subbed there) unless I need the context length.

jvermillard

15 days ago

I use it mainly for embedded programming and I find codex way better than claude. I don't my the delay anyway I'm slower to code carefully crafted C

johnnyfived

15 days ago

Agreed, I'm surprised how much much care the "extra high" reasoning allows. It easily catches bugs in code other LLMs won't, using it to review Opus 4.5 is highly effective.

tgtweak

15 days ago

Anecdotally I've found it very good in the exact same case for multi-agent workflows - as the "reviewer"

mattio

14 days ago

Completely agreed. Used claude and codex both on highest tier next to each other for a month. On complex tasks where Claude would get stuck and not be able to fix it at all, codex would fix the issue in one go. Codex is amazing.

I did found some slip ups in 5.2 where I did a refactor of a client header where I removed two header properties, but 5.2 forgot to remove those from the toArray method of the class. Was using 5.2 on medium (default).

echelon

15 days ago

(unrelated, but piggybacking on requests to reach the teams)

If anyone from OpenAI or Google is reading this, please continue to make your image editing models work with the "previz-to-render" workflow.

Image edits should strongly infer pose and blocking as an internal ControlNet, but should be able to upscale low-fidelity mannequins, cutouts, and plates/billboards.

OpenAI kicks ass at this (but could do better with style controls - if I give a Midjourney style ref, use it) :

https://imgur.com/gallery/previz-to-image-gpt-image-1-x8t1ij...

https://imgur.com/a/previz-to-image-gpt-image-1-5-3fq042U

Google fails the tests currently, but can probably easily catch up :

https://imgur.com/a/previz-to-image-nano-banana-pro-Q2B8psd

79a6ed87

15 days ago

4 replies

My only concern with Codex is that it's not possible to delete tasks.

This is a privacy and security risk. Your code diffs and prompts are there (seemingly) forever. Best you can do is "archive" them, which is a fancy word for "put it somewhere else so it doesn't clutter the main page".

Leynos

15 days ago

1 reply

Terragon is an alternative (hosts Claude and Codex using your OpenAI and Anthropic subscriptions, and also supports Google and Amp) that provides this functionality.

I use it because it works out cheaper than Codex Cloud and gives you greater flexibility. Although it doesn't have 5.2-codex yet.

tgtweak

15 days ago

1 reply

Yes but if it's not getting removed at the origin... it's not fixing the actual issue of the context/conversation surviving past an explicit "delete" request. Also let's not forget that anyone proxying LLMs is also man in the middle to any code that goes up/down.

Leynos

15 days ago

79a6ed87’s comment applies to Codex cloud, not the codex CLI, which is what Terragon is using.

moralestapia

15 days ago

1 reply

`rm -rf ~/.codex/archived_sessions` does the trick

79a6ed87

15 days ago

2 replies

Interesting. Where do I run that?

moralestapia

15 days ago

1 reply

Uhm ... I assumed you were on Linux or OS X, if that's the case just open a terminal and paste that, I swear it's not malicious code, lol.

Unsure where that could be in Windows.

You know what would be fun to try? Give Code full access and then ask it to delete that folder, lol.

rolymath

15 days ago

I think he's joking I.e that command won't delete what's on the openai servers. But I could be wrong.

zenburnmyface

15 days ago

This is A+ satire

sunaookami

15 days ago

Are you talking about Codex Web? This is different from Codex CLI.

throwuxiytayq

15 days ago

It's weird, suspicious, and plain annoying. I like the the tool and my tests have shown it to be very powerful (if a bit rough and buggy), but this is ridiculous - I won't use it for any real world projects until this is fixed.

Then again, I wouldn't put much trust into OpenAI's handling of information either way.

shanev

15 days ago

2 replies

The GPT models, in my experience, have been much better for backend than the Claude models. They're much slower, but produce logic that is more clear, and code that is more maintainable. A pattern I use is, setup a Github issue with Claude plan mode, then have Codex execute it. Then come back to Claude to run custom code review plugins. Then, of course review it with my own eyes before merging the PR.

My only gripe is I wish they'd publish Codex CLI updates to homebrew the same time as npm. Us macOS users shouldn't have to wait an extra day :)

SamDc73

15 days ago

GPT-5 was the first model that occasionally produced code that I could push without any changes

Claude still tends to add "fluff" around the solution and over-engineer, not that the code doesn't work, it's just it's ugly

mrcwinn

15 days ago

I’d agree with you until Opus 4.5.

NitpickLawyer

15 days ago

> In parallel, we’re piloting invite-only trusted access to upcoming capabilities and more permissive models for vetted professionals and organizations focused on defensive cybersecurity work. We believe that this approach to deployment will balance accessibility with safety.

Yeah, this makes sense. There's a fine line between good enough to do security research and good enough to be a prompt kiddie on steroids. At the same time, aligning the models for "safety" would probably make them worse overall, especially when dealing with security questions (i.e. analyse this code snippet and provide security feedback / improvements).

At the end of the day, after some KYC I see no reason why they shouldn't be "in the clear". They get all the positive news (i.e. our gpt666-pro-ultra-krypto-sec found a CVE in openBSD stable release), while not being exposed to tabloid style titles like "a 3 year old asked chatgpt to turn on the lights and chatgpt hacked into nasa, news at 5"...

tptacek

15 days ago

It's interesting that they're foregrounding "cyber" stuff (basically: applied software security testing) this way, but I think we've already crossed a threshold of utility for security work that doesn't require models to advance to make a dent --- and won't be responsive to "responsible use" controls. Zero-shotting is a fun stunt, but in the real world what you need is just hypothesis identification (something the last few generations of models are fine at) and then quick building of tooling.

Most of the time spent in vulnerability analysis is automatable grunt work. If you can just take that off the table, and free human testers up to think creatively about anomalous behavior identified for them, you're already drastically improving effectiveness.

153 more comments available on Hacker News

View full discussion on Hacker News

ID: 46316367Type: storyLast synced: 12/21/2025, 6:00:31 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN