Building More with GPT-5.1-Codex-Max

about 2 months ago

ha! very interesting how slept on jj is

its been essential to my workflow as well

i use both jj and git and jj is great for just creating a snapshot that i can revert to incase it fails

im still exploring it to see what else i can do with it for agentic use

http://github.com/agentify-sh/10x

about 2 months ago

1 reply

lots of tools that do this and I ended up going down this rabbit hole something that could just plug in to codex instead of requiring a fork

does minimal overhead with agent orchestration (its just a bash/typescript) as its main focus was adding enhancements to codex like double redundant checkpoint via git and jj (lessons learned from codex being git reset --hard happy), something like claude skills (just a bunch of mds that steer it towards specific activity like think, plan, execute), timeout wrappers (to get you unstuck if codex waits a long time), blacklist commands during yolo (rm -rf, git reset banned even if it by small chance run it) MIT licensed

you can work sequentially (subagents launch one after the other) or parallel (worktrees) but tbh sequentially is better because you understand what is going on with parallel it might be best for dealing with tests and UI.

poly2it

about 2 months ago

Your link is a 404.

bradly

about 2 months ago

Would this be similar to how Charlie and Jules work?

lysecret

about 2 months ago

Cursor has this too

ygouzerh

about 2 months ago

I am curious: why would you you like to have that? (Genuine question, I am personally so scared about the AI going crazy and putting slop everywhere that I often ask it to focus on a single well defined area first)

rane

about 2 months ago

tmux users might find this useful: https://github.com/raine/workmux

about 2 months ago

1 reply

so this was arctic fox it seems, lot of us ended up downgrading to codex 5.0 because of the token burn was too much, i see codex max is a step up which is welcome but still unsure if they solved that github issue around tool use that impacts tokens

going to wait and see after being burned by 5.1 before i upgrade back to 0.58

gemini 3 has been a let down tbh to see agentic coding wasn't a top priority im sticking with codex for now and using gemini 3 for frontend

GenerWork

about 2 months ago

Have you found that Gemini is better than Codex for front end generation? I'm trying to bring some Figma screens into a small React project I have, and Codex will occasionally screw up the implementation despite the fact that I'm using the MCP server.

taurath

about 2 months ago

3 replies

These 2 sentences right next to each other stood out to me:

> a new step towards becoming a reliable coding partner

> GPT‑5.1-Codex-Max is built for long-running, detailed work

Does this not sound contradictory? It’s been the shorter form work that has built what little confidence I have in these as a coding partner - a model that goes off and does work without supervision is not a partner to me.

causal

about 2 months ago

2 replies

Absolutely contradictory. The long-running tendency for Codex is why I cannot understand the hype around it: if you bother to watch what it does and read its code the approaches it takes are absolutely horrifying. It would rather rewrite a TLS library from scratch than bother to ask you if the network is available.

keeganpoppen

about 2 months ago

1 reply

these things are actually fixable with prompting. is it easy? no. is it PEBKaC if you don’t do anything to change course as it builds a TLS library? yes, but paperclip maximized! xD

causal

about 2 months ago

Or you can have a model with some semblance of common sense that will stop and say "Hey I can I have access to the network to do X?"

Codex feels like a tool designed to run after all the humans are gone.

meowface

about 2 months ago

>It would rather rewrite a TLS library from scratch than bother to ask you if the network is available.

This is definitely one of the biggest issues with coding agents at the moment.

That said, from my experience, Codex so often does things that are so useful and save me so much time that the occasional "oh god what the hell did it just go off and do" are an acceptable cost for me.

I regularly get great results with open-ended prompts and agents that spend 15+ minutes working on the task. I'm sure they'll eventually get better at common sense understanding of what kind of work is wasteful/absurd.

ntonozzi

about 2 months ago

1 reply

If you haven't, give Cursor's Composer model a shot. It might not be quite as good as the top models, but in my experience it's almost as good, and the lightning fast feedback is more than worth the tradeoff. You can give it a task, wait ten seconds, and evaluate the results. It's quite common for it to not be good enough, but no worse than Sonnet, and if it doesn't work you just wasted 30 seconds instead of 10 minutes.

ineedasername

about 2 months ago

Also: Qwen3 coder. Highly usable, in it's smaller form as well.

embirico

about 2 months ago

1 reply

(Disclaimer: Am on the Codex team.) We're basically trying to build a teammate that can do both short, iterative work with you, then as you build trust (and configuration), you can delegate longer tasks to it.

The "# of model-generated tokens per response" chart in [the blog introducing gpt-5-codex](https://openai.com/index/introducing-upgrades-to-codex/) shows an example of how we're improving the model good at both.

Jaiden333

about 2 months ago

I really wish model performance messaging and benchmarks were more focused on perfecting short, iterative tasks instead of long-running work.

As a startup founder and engineer, I'm not constrained by the number of 10000+ line diff, 0->1 demos I can ship. I'm constrained by quality of the 100 -> 101, tight 150 line feature additions / code cleanups I can write.

It feels like the demos, funding, and hype all want to sell me entire PR rewrites, but what I need is the best possible iterative work model that will keep me in the loop.

I still use codex - but I use codex incredibly iteratively (give it very narrowly scoped tasks, and I watch it like a hawk, giving tons of feedback). I don't use it because of its ability to code for 24 hours. I use it because when I give it those narrowly scoped tasks, it is better at writing good code than any other model. (Because of its latency, I have 2-4 of these conversations going on at the same time).

But there is a lot of friction the codex product + model adds to this process. I have to prompt aggressively to override whatever "be extremely precise" prompting the model gets natively so that it doesn't send me 20+ bullet points of extraordinarily dense prose on every message. I have to carefully manage its handling of testing; it will widen any DI + keep massive amounts of legacy code to make sure functionality changes don't break old tests (rather than updating them) and to make sure any difficult tests can have their primary challenges mocked away.

In general, codex doesn't feel like an amazing tool that I have sitting at my right hand. It feels like a teenage genius who has been designed to do tasks autonomously, and who I constantly have to monitor and rein in.

simianwords

about 2 months ago

2 replies

> Compaction enables GPT‑5.1-Codex-Max to complete tasks that would have previously failed due to context-window limits, such as complex refactors and long-running agent loops by pruning its history while preserving the most important context over long horizons. In Codex applications, GPT‑5.1-Codex-Max automatically compacts its session when it approaches its context window limit, giving it a fresh context window. It repeats this process until the task is completed.

Wouldn't the model automatically do that using attention techniques? Why do you need to do it at the token layer and not leave it to the model to automatically decide which tokens are worth paying attention to?

about 2 months ago

1 reply

> due to context-window limits

simianwords

about 2 months ago

3 replies

context window is not some physical barrier but rather the attention just getting saturated. what did i get wrong here?

about 2 months ago

2 replies

> what did i get wrong here?

You don't know how an LLM works and you are operating on flawed anthropomorphic metaphors.

Ask a frontier LLM what a context window is, it will tell you.

about 2 months ago

1 reply

Parent is likely thinking of sparse attention which allows a significantly longer context to fit in memory

about 2 months ago

My comment was harsher than it needed to be and I'm sorry, I think I should have gotten my point across in a better way.

With that out of the way, parent was wondering why compaction is necessary arguing that "context window is not some physical barrier but rather the attention just getting saturated". We're trying to explain that 3+2=2+3 and you people are sitting in the back going "well, actually, not all groups are abelian".

Palmik

about 2 months ago

It's a fair question, even if it might be coming from a place of misunderstanding.

For example, DeepSeek 3.2, which employs sparse attention [1], is not only faster with long context than normal 3.1, but also seems to be better (perhaps thanks to reducing the noise?).

[1] It uses still quadratic router, but it's small, so it scales well in practice. https://api-docs.deepseek.com/news/news250929

kenjackson

about 2 months ago

I think attention literally doesn't see anything beyond the context window. Even within the context window you may start to see attentional issues, but that's a different problem.

paradite

about 2 months ago

In theory, auto-regressive models should not have limit on context. It should generate the next token with all previous tokens.

In practice, when training a model, people select a context window so that during inference, you know how much GPU memory to allocate for a prompt and reject the prompt if it exceeds the memory limit.

Of course there's also degrading performance as context gets longer, but I suspect memory limit is the primary factor of why we have context window limits.

adastra22

about 2 months ago

1 reply

Attention is quadratic, so you have to pick a cutoff for context window size. In addition, the error/noise in state space increases with longer contexts, resulting in poorer performance. So even if you're willing to take the O(n^2) slowdown of a larger context window, it still won't work.

fancy_pantser

about 2 months ago

> Attention is quadratic

Exactly. Standard Multi-Head Attention uses a matrix that grows to 4B parameters for a 64K sequence as a starting place. FlashAttention v2 helps slightly, but as you grow to 128K context length, you still need over 1TB/s memory bandwidth to stay compute-bound in practice even with this optimization.

So there has been a lot of research in this area and model architectures released this year are showing some promising improvements. Sliding windows lose context fidelity and if you go fully linear, you sacrifice math, logic, and long multi-turn (agentic) capabilities, so everyone is searching for a good alternative compromise.

MiniMax-M1 had lightning attention to scale up to 1M context lengths. It's "I/O aware" via tiling and calculates attention two ways block-wise (intra-block traditional attention and inter-block linear attention), thereby avoiding the speed-inhibiting cumulative summation.

DeepSeek V3.2 uses DeepSeek Sparse Attention (DSA), which is sub-linear by only computing "interesting" pairs. For example, in 128K context lengths this requires only 10-20% of attention pairs to be materialized.

Both Qwen3-Next and Kimi Linear adopt a Gated DeltaNet, which is borrowed from Mamba2. In Qwen3-Next it alternates three Gated DeltaNet (linear attention) layers for every one gated [full] attention. The speedup is from a delta rule, which basically amounts to caching in a hand-wavy way.

There's no universally-adopted solution yet, as these are all pretty heavy-duty compromises, but the search is going strong right now for linear or better attention mechanisms that still perform well.

hansonwAuthor

about 2 months ago

13 replies

Rest assured that we are better at training models than naming them ;D

- New benchmark SOTAs with 77.9% on SWE-Bench-Verified, 79.9% on SWE-Lancer, and 58.1% on TerminalBench 2.0

- Natively trained to work across many hours across multiple context windows via compaction

- 30% more token-efficient at the same reasoning level across many tasks

Let us know what you think!

EnPissant

about 2 months ago

4 replies

Compaction is just what Claude Code has done forever, right?

enraged_camel

about 2 months ago

2 replies

I am also trying to understand the difference between compaction, and what IDEs like Cursor do when they "summarize" context over long-running conversations.

Is this saying that said summarization now happens at the model level? Or are there other differences?

about 2 months ago

Codex couldnt do what claude did before when reaching full context window

typpilol

about 2 months ago

Afaik, there's no difference besides how aggressive or not it is.

But it's the same concept. Taking tokens in context and removing irreverent ones by summarizing, etc

GardenLetter27

about 2 months ago

1 reply

I think the point here is not that it does compaction (which Codex also already does) - but that the model was trained with examples of the Codex compaction, so it should perform better when compaction has taken place (a common source for drops in performance for earlier models).

EnPissant

about 2 months ago

Codex previously did only manual compaction, but yeah, maybe some extra training for compaction, too?

d4rkp4ttern

about 2 months ago

My understanding is that they trained it to explicitly use a self-prune/self-edit tool that trims/summarizes portions of its message history (e.g. use tool results from file explorations, messages that are no longer relevant, etc) during the session, rather than "panic-compact" at the end. In any case, it would be good if it does something like this.

about 2 months ago

Yes. It was missing in codex until now

NitpickLawyer

about 2 months ago

1 reply

Will -minis come for the codex family of models? About two months ago I used 5-mini as a daily driver for a few weeks and quite liked it, it seemed capable enough on small tasks with some hand holding and the speed/price were great as well.

coder543

about 2 months ago

1 reply

codex-mini was released a couple of weeks ago: https://platform.openai.com/docs/models/gpt-5.1-codex-mini

NitpickLawyer

about 2 months ago

Thanks! I somehow missed that. Will check it out.

andai

about 2 months ago

1 reply

So context window is still 400k but the model got good at removing irrelevant context?

about 2 months ago

Or is more succinct in its thoughts

blks

about 2 months ago

1 reply

I think your company will fail soon.

meowface

about 2 months ago

1 reply

I would bet a lot of money it will not.

blks

about 2 months ago

I don’t see how their business would succeed. So far they are burning billions of investment dollars on compute with barely any revenue. Side hustles like Sora are a disaster that costs so much money for each video and will never bring any money

SoKamil

about 2 months ago

2 replies

> Natively trained

What does it even mean?

ineedasername

about 2 months ago

Continuous pre training or fine tuning, instead of inference-time instructions. It's also possible synthetic data for this purpose was in the pre training as well, and they're now getting it to behave the way they'd like.

kaveh_h

about 2 months ago

Probably that before it was given system instructions on how to do compaction and now the compaction is learned by the model making it a native ability of the model without any extra instruction used in the prompt.

killcoder

about 2 months ago

1 reply

It would be nice if users of the codex-cli that are just using API keys as a way to handle rate limits and billing could receive these new models at the same time. I appreciate the reasoning behind delayed 'actual API' release, but I've found the rate limiting to be quite annoying, and my own API keys don't have this limitation.

ineedasername

about 2 months ago

1 reply

Re: rate limits, I'm not sure they can, yet, on capacity. See Jensen's comment today about their cloud GPUs being sold out. So capacity increased await the ongoing data center build out.

killcoder

about 2 months ago

> 30% more token-efficient at the same reasoning level across many tasks

But they're claiming it's more token efficient, so me switching my usage to the new model should _free up_ capacity.

sinatra

about 2 months ago

I currently use GPT‑5.1-Codex High and have a workflow that works well with the 5-hour/weekly limits, credits, et al. If I use GPT‑5.1-Codex-Max Medium or GPT‑5.1-Codex-Max High, how will that compare cost / credits / limits wise to GPT‑5.1-Codex High? I don't think that's clear. "Reduced tokens" makes me think it'll be priced similarly / lower. But, "Max" makes me think it'll be priced higher.

about 2 months ago

did you address this https://github.com/openai/codex/issues/6426 ?

how much more token efficient is this compared to 5.0

had to use 5.0 because 5.1 was eating tokens like crazy and seemed like a slight incremental improvement barely noticeable

about 2 months ago

Did you guys fix not being able to enable websearches or configure no timeouts for specific commands in the SDk (error 124 is way too common for long running tasks)

iyn

about 2 months ago

Looks like a great change! I'll take it for a spin in a moment.

I really like the "subagent" feature in Claude Code — it's super useful to manage context in complex codebases. Here are some examples of agents that can be useful: https://github.com/humanlayer/humanlayer/tree/main/.claude/a...

Would it make sense to have a similar feature in Codex CLI? I often do "spec-driven development", which is basically a loop of:

    research -> implementation plan -> actual implementation (based on research + plan) -> validation

I have multiple subagents that I use for each phase that (based on subjective judgement) improve the output quality (vs keeping everything, every tool use etc. in the "main" context window).

Codex CLI is great and I use it often but I'd like to have more of these convenient features for managing context from CC. I'm super happy that compaction is now available, hopefully we'll get more features for managing context.

robotswantdata

about 2 months ago

Sorry don’t like the max model, feels like it needs a lot more guiding. The plans it writes however are better, so I tried feeding it back in (meta prompt style) and working okay so far. Very large repository.

about 2 months ago

Codex is an outstanding product and incremental upgrades are always welcome. I'll make sure to give it a try in the coming days. Great work! :)

carbocation

about 2 months ago

It would be great to have access to this model via the chat interface, even if it was gated behind the "other models" dropdown or something.

EcommerceFlow

about 2 months ago

1 reply

Gemini 3 had a great 24 hour SOTA run for coding

CuriouslyC

about 2 months ago

1 reply

Gemini is still the best oracle/planner by a mile. It's just a bad agent. Give it a bundle of your repo and get it to plan your changes, then hand it off to codex to implement.

ygouzerh

about 2 months ago

Good idea!

I found Gemini have horribly slow for anything

croes

about 2 months ago

2 replies

The new detergent now washes even whiter

pton_xd

about 2 months ago

I love how programming discussions du jour have basically devolved into "really? my socks definitely smell better after using 2 scoops of last month's soap. what spin cycle are you using?"

bgwalter

about 2 months ago

Come on folks, this is funny. They also have industrial strength laundromats to go with the detergent.

cube2222

about 2 months ago

1 reply

Somewhat related, after seeing the praise for codex in the Sonnet 4.5 release thread I gave it a go, and I must say, that CLI is much worse than Claude Code (even if the model is great, I’m not sure where the issue really lies between the two).

It was extremely slow (like, multiple times slower than Sonnet with Claude Code, though that’s partially on me for using thinking-high I guess) to finish the task, with the back-and-forths being on the order of tens of minutes.

Moreover, the context management seems to be really weird. I’m not sure how exactly it works, but - 1. It uses very little tokens / fills up the context slowly (good I guess) 2. Doesn’t seem to actually internalize the contents of files you mention to it, or it edits.

#2 here being the main one - I usually context-dump reference code for Claude Code, and it does a perfect job of adhering to codebase patterns and its architecture, while codex was completely ignorant of the existing code style.

Moreover, it wrote extremely defensive code, even for code where it wrote both ends itself.

All in all, I was really let down after seeing all the praise.