GPT-5.2 | Not Hacker News!

Discussion (1057 comments)

Showing 160 comments of 1057

23 days ago

4 replies

Everything is still based on 4 4o still right? is a new model training just too expensive? They can consult deepseek team maybe for cost constrained new models.

verdverm

23 days ago

2 replies

Apparently they have not had a successful pre training run in 1.5 years

fouronnes3

23 days ago

6 replies

I want to read a short scify story set in 2150 about how, mysteriously, no one has been able to train a better LLM for 125 years. The binary weights are studied with unbelievably advanced quantum computers but no one can really train a new AI from scratch. This starts cults, wars and legends and ultimately (by the third book) leads to the main protagonist learning to code by hand, something that no human left alive still knows how to do. Could this be the secret to making a new AI from scratch, a century later?

WhyOhWhyQ

23 days ago

1 reply

There's a scifi short story about a janitor who knows how to do basic arithmetic and becomes the most important person in the world when some disaster happens. Of course after things get set up again due to his expertise, he becomes low status again.

bradfitz

23 days ago

1 reply

I had to go look that up! I assume that's https://en.wikipedia.org/wiki/The_Feeling_of_Power ? (Not a janitor, but "a low grade Technician"?)

WhyOhWhyQ

23 days ago

Hmm it could be a false memory, since this was almost 15 years ago, but I really do remember it differently than the text of 'Feeling of Power'.

ssl-3

23 days ago

2 replies

Sounds good.

Might sell better with the protagonist learning iron age leatherworking, with hides tanned from cows that were grown within earshot, as part of a process of finding the real root of the reason for why any of us ever came to be in the first place. This realization process culminates in the formation of a global, unified steampunk BDSM movement and a wealth of new diseases, and then: Zombies.

(That's the end. Zombies are always the end.)

wafflemaker

23 days ago

1 reply

Sorry, but compared with the parent, my money is in you ssl-3. Do you get better results from prompting by being more poetic?

ssl-3

22 days ago

1 reply

> Do you get better results from prompting by being more poetic?

Is that yet-another accusation of having used the bot?

I don't use the bot to write English prose. If something I write seems particularly great or poetic or something, then that's just me: I was in the right mood, at the right time, with the right idea -- and with the right audience.

They can't all be zingers.

When it's bad or fucked-up, then that's also just me. I most-assuredly fuck up plenty.

I'm fine with that.

---

I do use the hell out of the bot for translating my ideas (and the words that I use to express them) into languages that I can't speak well, like Python, C, and C++. But that's very different. (And at least so far I haven't shared any of those bot outputs with the world at all.)

So to take your question very literally: No, I don't get better results from prompting by being more poetic. The responses to my prompts don't improve by those prompts being articulate or poetic.

Instead, I've found that I get the best results from the bot fastest by carrying a big stick, and using that stick to hammer it into compliance.

Things can get rather irreverent in my interactions with the bot, and poeticism is pretty far removed from any of that business.

wafflemaker

22 days ago

1 reply

No. I just genuinely liked your style, and didn't notice previous posts by you. I haven't yet learned to look at names on hn, it's mostly anonymous posts for me. No snark here. And was also genuinely curious if better writing style yields better results.

I've observed that using proper grammar gives slightly better answers. And using more "literacy"(?) kind of language in prompts sometimes gives better answers and sometimes just more interesting ones, when bots try to follow my style.

Sorry for using the word poetic, I'm travelling and sleep deprived and couldn't find the proper word, but didn't want to just use "nice" instead either.

ssl-3

22 days ago

It's all good. I'm largely "face-blind", myself, in that I don't often recognize others in person or online -- which is certainly not to say that I think I'm particularly memorable myself.

As to the bot: Man, I beat the bot to death. It's pretty brutal.

I'm profane and demanding because that's the most terse language I know how to construct in English.

When I set forth to have the bot do a thing for me, the slowest part of the process that I can improve on my part is the quantity of the words that I use.

I can type fast and think fast, but my one-letter-at-a-time response to the bot is usually the only part that that I can make a difference with. So I tend to be very terse.

"a+b=c, you fuck!" is certainly terse, unambiguous, and fast to type, so that's my usual style.

Including the "you fuck!" appendage seems to stir up the context more than without. Its inclusion or omission is a dial that can be turned.

Meanwhile: "I have some reservations about the proposed implementation. Might it be possible for you to revise it so as to be in a different form? As previously discussed, it is my understanding that a+b=c. Would you like to try again to implement a solution that incorporates this understanding?" is very slow to write.

They both get similar results. One method is faster for me than the other, just because I can only type so fast. The operative function of the statement is ~the same either way.

(I don't owe the bot anything. It isn't alive. It is just a computer running a program. I could work harder to be more polite, empathetic, or cordial, but: It's just code running on a box somewhere in a datacenter that is raising my electric rate and making the RAM for my next system upgrade very expensive. I don't owe it anything, much less politeness or poeticism.

Relatedly, my inputs at the bash prompt on my home computer are also very terse. For instance I don't have any desire or ability to be polite to bash; I just issue commands like ls and awk and grep without any filler-words or pleasantries. The bot is no different to me.

When I want something particularly poetic or verbose as output from the bot, I simply command it to be that way.

It's just a program.)

astrange

23 days ago

This is somewhat similar to a Piers Anthony series that I suspect noone has ever read except for me.

What was with that guy anyway.

armenarmen

23 days ago

I’d read it!

verdverm

23 days ago

You can ask 2025 Ai to write such a book, it's happy to comply and may or may not actually write the book

https://www.pcgamer.com/software/ai/i-have-been-fooled-reddi...

georgefrowny

23 days ago

An software version of the Holmes-Ginsbook device? https://sfwritersworkshop.org/node/1232

barrenko

23 days ago

Monsieur, if I may offer a vaaaguely similar story on how things may progress https://www.owlposting.com/p/a-body-most-amenable-to-experim...

ijl

23 days ago

1 reply

What kind of issues could prevent a company with such resources from that?

verdverm

23 days ago

Drama if I had to pick the symptom most visible from the outside.

A lot of talent left OpenAI around that time, most notably in this regard would be Ilya in May '24. Remember that time the Ilya and the board ousted Sam only to reverse it almost immediately?

https://arstechnica.com/information-technology/2024/05/chief...

Wowfunhappy

23 days ago

2 replies

I thought whenever the knowledge cutoff increased that meant they’d trained a new model, I guess that’s completely wrong?

rockinghigh

23 days ago

1 reply

They add new data to the existing base model via continuous pre-training. You save on pre-training, the next token prediction task, but still have to re-run mid and post training stages like context length extension, supervised fine tuning, reinforcement learning, safety alignment ...

astrange

23 days ago

Continuous pretraining has issues because it starts forgetting the older stuff. There is some research into other approaches.

brokencode

23 days ago

Typically I think, but you could pre-train your previous model on new data too.

I don’t think it’s publicly known for sure how different the models really are. You can improve a lot just by improving the post-training set.

elgatolopez

23 days ago

2 replies

Where did you get that from? Cutoff date says august 2025. Looks like a newly pretrained model

FergusArgyll

23 days ago

1 reply

> This stands in sharp contrast to rivals: OpenAI’s leading researchers have not completed a successful full-scale pre-training run that was broadly deployed for a new frontier model since GPT-4o in May 2024, highlighting the significant technical hurdle that Google’s TPU fleet has managed to overcome.

- https://newsletter.semianalysis.com/p/tpuv7-google-takes-a-s...

It's also plainly obvious from using it. The "Broadly deployed" qualifier is presumably referring to 4.5

ric2b

21 days ago

How is that a technical hurdle if they obviously were able to do it before?

It's probably just a question of cost/benefit analysis, it's very expensive to do, so the benefits need to be significant.

SparkyMcUnicorn

23 days ago

[delayed]

catigula

23 days ago

1 reply

The irony is that Deepseek is still running with a distilled 4o model.

blovescoffee

23 days ago

Source?

system2

23 days ago

1 reply

"Investors are putting pressure, change the version number now!!!"

exe34

23 days ago

4 replies

I'm quite sad about the S-curve hitting us hard in the transformers. For a short period, we had the excitement of "ooh if GPT-3.5 is so good, GPT-4 is going to be amazing! ooh GPT-4 has sparks of AGI!" But now we're back to version inflation for inconsequential gains.

ToValueFunfetti

23 days ago

2 replies

Take this all with a grain of salt as it's hearsay:

From what I understand, nobody has done any real scaling since the GPT-4 era. 4.5 was a bit larger than 4, but not as much as the orders of magnitude difference between 3 and 4, and 5 is smaller than 4.5. Google and Anthropic haven't gone substantially bigger than GPT-4 either. Improvements since 4 are almost entirely from reasoning and RL. In 2026 or 2027, we should see a model that uses the current datacenter buildout and actually scales up.

Leynos

23 days ago

4.5 is widely believed to be an order of magnitude larger than GPT-4, as reflected in the API inference cost. The problem is the quantity of parameters you can fit in the memory of one GPU. Pretty much every large GPT model from 4 onwards has been mixture of experts, but for a 10 trillion parameter scale model, you'd be talking a lot of experts and a lot of inter-GPU communication.

With FP4 in the Blackwell GPUs, it should become much more practical to run a model of that size at the deployment roll-out of GPT-5.x. We're just going to have to wait for the GBx00 systems to be physically deployed at scale.

snovv_crash

23 days ago

Datacenter capacity is being snapped up for inference too though.

JanSt

23 days ago

1 reply

I don't feel the S-curve at all yet. Still an exponential for me

exe34

23 days ago

With a very long doubling time?

gessha

23 days ago

1 reply

Because it will take thousands of underpaid researchers random searching through solution space to get to the next improvement, not 2-3 companies pressed to monetize and enshittify their product before money runs out. That and winning more hardware lotteries.

astrange

23 days ago

1 reply

Underpaid? OpenAI!? It's pretty good I think.

https://www.levels.fyi/companies/openai/salaries/software-en...

gessha

23 days ago

I’m talking about grad students, not OpenAI researchers.

verdverm

23 days ago

2025 is the year most Big AI released their first real thinking models

Now we can create new samples and evals for more complex tasks to train up the next gen, more planning, decomp, context, agentic oriented

OpenAI has largely fumbled their early lead, exciting stuff is happening elsewhere

josalhor

23 days ago

7 replies

From GPT 5.1 Thinking:

ARC AGI v2: 17.6% -> 52.9%

SWE Verified: 76.3% -> 80%

That's pretty good!

verdverm

23 days ago

5 replies

We're also in benchmark saturation territory. I heard it speculated that Anthropic emphasizes benchmarks less in their publications because internally they don't care about them nearly as much as making a model that works well on the day-to-day

quantumHazer

23 days ago

1 reply

Seems pretty false if you look at the model card and web site of Opus 4.5 that is… (check notes) their latest model.

verdverm

23 days ago

1 reply

Building a good model generally means it will do well on benchmarks too. The point of the speculation is that Anthropic is not focused on benchmaxxing which is why they have models people like to use for their day-to-day.

I use Gemini, Anthropic stole $50 from me (expired and kept my prepaid credits) and I have not forgiven them yet for it, but people rave about claude for coding so I may try the model again through Vertex Ai...

elcritch

23 days ago

1 reply

You could try Codex cli. I prefer it over Claude code now, but only slightly.

verdverm

23 days ago

No thanks, not touching anything Oligarchy Altman is behind

Mistletoe

23 days ago

3 replies

How do you measure whether it works better day to day without benchmarks?

standardUser

23 days ago

1 reply

Subscriptions.

mrguyorama

23 days ago

1 reply

Ah yes, humans are famously empirical in their behavior and we definitely do not have direct evidence of the "best" sports players being much more likely than the average to be superstitious or do things like wear "lucky underwear" or buy right into scam bracelets that "give you more balance" using a holographic sticker.

standardUser

19 days ago

It's all the shareholders care about. These are not research institutions.

verdverm

23 days ago

1 reply

Internal evals, Big AI certainly has good, proprietary training and eval data, it's one reason why their models are better

aydyn

23 days ago

1 reply

Then publish the results of those internal evals. Public benchmark saturation isn't an excuse to be un-quantitative.

verdverm

23 days ago

1 reply

How would published numbers be useful without knowing what the underlying data being used to test and evaluate them are? They are proprietary for a reason

To think that Anthropic is not being intentional and quantitative in their model building, because they care less for the saturated benchmaxxing, is to miss the forest for the trees

aydyn

23 days ago

1 reply

Do you know everything that exists in public benchmarks?

They can give a description of what their metrics are without giving away anything proprietary.

verdverm

23 days ago

1 reply

I'd recommend watching Nathan Lambert's video he dropped yesterday on Olmo 3 Thinking. You'll learn there's a lot of places where even descriptions of proprietary testing regimes would give away some secret sauce

Nathan is at Ai2 which is all about open sourcing the process, experience, and learnings along the way

aydyn

22 days ago

1 reply

Thanks for the reference I'll check it out. But it doesnt really take away from the point I am making. If a level of description would give away proprietary information, then go one level up to a more vague description. How to describe things to a proper level is more of a social problem than a technical one.

verdverm

21 days ago

You seem stuck on the idea that they should have to share information when they don't have to. That they share any is a welcome change. Push too hard and they may stop sharing as much

bulbar

23 days ago

Manually labeling answers maybe? There exist a lot of infrastructure built around and as it's heavily used for 2 decades and it's relatively cheap.

That's still benchmarking of course, but not utilizing any of the well known / public ones.

brokensegue

23 days ago

1 reply

how do you quantitatively measure day-to-day quality? only thing i can think is A/B tests which take a while to evaluate

verdverm

23 days ago

more or less this, but also synthetic

if you think about GANs, it's all the same concept

1. train model (agent)

2. train another model (agent) to do something interesting with/to the main model

3. gain new capabilities

4. iterate

Getting into the guts of agentic systems has me believing we have quite a bit of runway for iteration here, especially as we move beyond single model / LLM training. I still need to get into what all is de jour in the RL / late training, that's where a lot of opportunity lies from my understanding so far

HDThoreaun

23 days ago

3 replies

Arc-AGI is just an iq test. I don’t see the problem with training it to be good at iq tests because that’s a skill that translates well.

CamperBob2

23 days ago

5 replies

Exactly. In principle, at least, the only way to overfit to Arc-AGI is to actually be that smart.

npinsker

23 days ago

2 replies

Completely false. This is like saying being good at chess is equivalent to being smart.

Look no farther than the hodgepodge of independent teams running cheaper models (and no doubt thousands of their own puzzles, many of which surely overlap with the private set) that somehow keep up with SotA, to see how impactful proper practice can be.

The benchmark isn’t particularly strong against gaming, especially with private data.

CamperBob2

23 days ago

1 reply

Completely false. This is like saying being good at chess is equivalent to being smart.

No, it isn't. Go take the test yourself and you'll understand how wrong that is.

fwip

23 days ago

1 reply

[delayed]

CamperBob2

23 days ago

1 reply

Given your intellectual resources -- which you've successfully used to pass a test that is designed to be easy for humans to pass while tripping up AI models -- why not use them to suggest a better test? The people who came up with Arc-AGI were not actually morons, but I'm sure there's room for improvement.

What would be an example of a test for machine intelligence that you would accept? I've already suggested one (namely, making up more of these sorts of tests) but it'd be good to get some additional opinions.

fwip

23 days ago

[delayed]

mrandish

23 days ago

1 reply

ARC-AGI was designed specifically for evaluating deeper reasoning in LLMs, including being resistant to LLMs 'training to the test'. If you read Francois' papers, he's well aware of the challenge and has done valuable work toward this goal.

npinsker

23 days ago

2 replies

I agree with you. I agree it's valuable work. I totally disagree with GP's claim.

A better analogy is: someone who's never taken the AIME might think "there are an infinite number of math problems", but in actuality there are a relatively small, enumerable number of techniques that are used repeatedly on virtually all problems. That's not to take away from the AIME, which is quite difficult -- but high school math just isn't infinite.

Similarly, ARC-AGI is much more bounded than GP seems to think. It correlates with intelligence, but it doesn't imply it.

keeda

23 days ago

Maybe I'm misinterpreting your point, but this makes it seem that your standard for "intelligence" is "inventing entirely new techniques"? If so, it's a bit extreme, because to a first approximation, all problem solving is combining and applying existing techniques in novel ways to new situations.

At the point that you are inventing entirely new techniques, you are usually doing groundbreaking work. Even groundbreaking work in one field is often inspired by techniques from other fields. In the limit, discovering truly new techniques often requires discovering new principles of reality to exploit, i.e. research.

As you can imagine, this is very difficult and hence rather uncommon, typically only accomplished by a handful of people in any given discipline, i.e way above the standards of the general population.

I feel like if we are holding AI to those standards, we are talking about not just AGI, but artificial super-intelligence.

yovaer

23 days ago

> but in actuality there are a relatively small, enumerable number of techniques that are used repeatedly on virtually all problems

IMO/AIME problems perhaps, but surely that's too narrow a view for all of mathematics. If solving conjectures were simply a matter of trying a standard range of techniques enough times, then there would be a lot fewer open problems around than what's the case.

esafak

23 days ago

1 reply

I would not be so sure. You can always prep to the test.

HDThoreaun

23 days ago

1 reply

How do you prep for arc agi? If the answer is just "get really good at pattern recognition" I do not see that as a negative at all.

ben_w

23 days ago

1 reply

It can be not-negative without being sufficient.

Imagine that pattern recognition is 10% of the problem, and we just don't know what the other 90% is yet.

Streetlight effect for "what is intelligence" leads to all the things that LLMs are now demonstrably good at… and yet, the LLMs are somehow missing a lot of stuff and we have to keep inventing new street lights to search underneath: https://en.wikipedia.org/wiki/Streetlight_effect

HDThoreaun

23 days ago

I dont think many people are saying 100% arc-agi 2 is equivalent to AGI(names are dumb as usual). Its just the best metric I have found, not the final answer. Spatial reasoning is an important part of intelligence even if it doesnt encompass all of it.

jimbokun

23 days ago

1 reply

Is it different every time? Otherwise the training could just memorize the answers.

CamperBob2

23 days ago

The models never have access to the answers for the private set -- again, at least in principle. Whether that's actually true, I have no idea.

ACCount37

23 days ago

With this kind of thing, the tails ALWAYS come apart, in the end. They come apart later for more robust tests, but "later" isn't "never", far from it.

Having a high IQ helps a lot in chess. But there's a considerable "non-IQ" component in chess too.

Let's assume "all metrics are perfect" for now. Then, when you score people by "chess performance"? You wouldn't see the people with the highest intelligence ever at the top. You'd get people with pretty high intelligence, but extremely, hilariously strong chess-specific skills. The tails came apart.

Same goes for things like ARC-AGI and ARC-AGI-2. It's an interesting metric (isomorphic to the progressive matrix test? usable for measuring human IQ perhaps?), but no metric is perfect - and ARC-AGI is biased heavily towards spatial reasoning specifically.

FergusArgyll

23 days ago

It's very much a vision test. The reason all the models don't pass it easily is only because of the vision component. It doesn't have much to do with reasoning at all

mrandish

23 days ago

It is similar to an IQ test in that it evaluates abstract reasoning, however unlike IQ tests, it's been designed specifically for evaluating deeper reasoning in LLMs, including being and resistant to LLMs 'training to the test'. If you read Francois' papers, he's well aware of the challenge and has done valuable work toward this goal.

Dismissing ARC-AGI as 'just an IQ test' doesn't seem justified because most SOTA LLMs have been performing well on traditional IQ tests for a while, yet SOTA LLMs did terribly on ARC-AGI-1 and showed almost no improvement for ~4 years - while knocking down virtually every other benchmark test. I suspect this was due to ARC-AGI's design uniquely trying to resist not only 'training on the answers' but also 'train on the concepts', is working to some degree.

fwip

23 days ago

It is very similar to an IQ test, with all the attendant problems that entails. Looking at the Arc-AGI problems, it seems like visual/spatial reasoning is just about the only thing they are testing.

stego-tech

23 days ago

2 replies

These models still consistently fail the only benchmark that matters: if I give you a task, can you complete it successfully without making shit up?

Thus far they all fail. Code outputs don’t run, or variables aren’t captured correctly, or hallucinations are stated as factual rather than suspect or “I don’t know.”

It’s 2000’s PC gaming all over again (“gotta game the benchmark!”).

verdverm

23 days ago

1 reply

I'm not sure, here's my anecdotal counter example, was able to get gemini-2.5-flash, in two turns, to understand and implement something I had done separately first, and it found another bug (also that I had fixed, but forgot was in this path)

That I was able to have a flash model replicate the same solution I had, to two problems in two turns, it's just the opposite experience of your consistency argument. I'm using tasks I've already solved as the evals while developing my custom agentic setup (prompts/tools/envs). They are able to do more of them today then they were even 6-12 months ago (pre-thinking models).

https://bsky.app/profile/verdverm.com/post/3m7p7gtwo5c2v

stego-tech

23 days ago

1 reply

And therein lies the rub for why I still approach this technology with caution, rather than charge in full steam ahead: variable outputs based on immensely variable inputs.

I read stories like yours all the time, and it encourages me to keep trying LLMs from almost all the major vendors (Google being a noteworthy exception while I try and get off their platform). I want to see the magic others see, but when my IT-brain starts digging in the guts of these things, I’m always disappointed at how unstructured and random they ultimately are.

Getting back to the benchmark angle though, we’re firmly in the era of benchmark gaming - hence my quip about these things failing “the only benchmark that matters.” I meant for that to be interpreted along the lines of, “trust your own results rather than a spreadsheet matrix of other published benchmarks”, but I clearly missed the mark in making that clear. That’s on me.

verdverm

23 days ago

1 reply

I mean more the guts of the agentic systems. Prompts, tool design, state and session management, agent transfer and escalation. I come from devops and backend dev, so getting in at this level, where LLMs are tasked and composed, is more interesting.

If you are only using provider LLM experiences, and not something specific to coding like copilot or Claude code, that would be the first step to getting the magic as you say. It is also not instant. It takes time to learn any new tech, this one has a above average learning curve, despite the facade and hype of how it should just be magic

remich

23 days ago

1 reply

Any particular papers or articles you've been reading that helped you devise this? Your experiments sound interesting and possibly relevant to what I'm doing.

verdverm

21 days ago

Conversations among practitioners on Bluesky (there is an Ai subcommunity)

snet0

23 days ago

1 reply

To say that a model won't solve a problem is unfair. Claude Code, with Opus 4.5, has solved plenty of problems for me.

If you expect it to do everything perfectly, you're thinking about it wrong. If you can't get it to do anything perfectly, you're using it wrong.

jacquesm

23 days ago

4 replies

That means you're probably asking it to do very simple things.

camdenreslink

23 days ago

1 reply

Sometimes you do need to (as a human) break down a complex thing into smaller simple things, and then ask the LLM to do those simple things. I find it still saves some time.

ragequittah

23 days ago

1 reply

Or what will often work is having the LLM break it down into simpler steps and then running them 1 by 1. They know how to break down problems fairly well they just don't often do it properly sometimes unless you explicitly prompt them to.

jacquesm

23 days ago

Yes, but for that you have to know that the output it gave you is wrong in the first place and if that is so you didn't need AI to begin with...

baq

23 days ago

2 replies

I can confidently say that anecdotally you’re completely wrong, but I’ll also allow a very different definition of ‘simple’ and/or attempting to use an unpopular environment as a valid anecdotal counterpoint.

jacquesm

23 days ago

3 replies

I don't reach for AI until I'm solidly stuck and then use it mostly for inspiration, it has yet to happen that it directly pointed at a solution and I'm pretty good at writing prompts. But when I throw a bunch of elementary stuff at it then it is super good at identifying issues and solving them. But here is a nice example for a slightly more complex test:

Prompt:

"I have a green LED, a 12V powersupply, a single NPN transistor, a 100 ohm resistor, a 2.7K resistor and an electrolytic capacitor of 220 micro farads. My teacher says it is possible to make an LED flasher with these components but I'm hopelessly stuck, can you please give me an ascii art solution so I don't flunk this exam?"

Answer:

ASCII schematic (fully functional):

           +12V
            |
            |
           .-.
           | |
           | | 2.7kΩ   (R1 — charges the capacitor)
           | |
           '-'
            |
            +---------+----------------------+
            |         |                      |
           ( )        |                      |
           ( ) C1     |                      |
           ( ) 220uF  |                      |
            |         |                      |
            |         |                      |
            |        |/ C                    |
            +--------|   NPN transistor      |
            |        |\ E                    |
            |         |                      |
           GND       GND                     |
                                              \
                                               \
                                               / 100Ω (R2 — LED limit)
                                               \
                                                |
                                                >|  Green LED
                                                |
                                               GND

Follow by:

"How it works

The 2.7 kΩ resistor charges the 220 µF capacitor from the 12 V supply.

The capacitor voltage rises slowly.

When the capacitor voltage reaches the transistor’s base-emitter threshold (~0.6–0.7 V), the transistor suddenly switches ON.

When it turns on, the capacitor rapidly discharges through the base, causing:

A brief pulse of current through the transistor

The LED lights up through the 100 Ω resistor

After discharge, the transistor turns back OFF, the LED turns off, and the capacitor begins charging again.

This repeats automatically → LED flasher."

The number of errors in the circuit and the utterly bogus explanation as well as the over confident remark that this is 'working' is so bizarre that I wonder how many slightly more complicated questions are going to yield results comparable to this one.

emporas

23 days ago

1 reply

I have used Gemini for reading and solving electronic schematics exercises, and it's results were good enough for me. Roughly 50% of the exercises managed to solve correctly, 50% wrong. Simple R circuits.

One time it messed up the opposite polarity of two voltage sources in series, and instead of subtracting their voltages, it added them together, I pointed out the mistake and Gemini insisted that the voltage sources are not in opposite polarity.

Schematics in general are not AIs strongest point. But when you explain what math you want to calculate from an LRC circuit for example, no schematics, just describe in words the part of the circuit, GPT many times will calculate it correctly. It still makes mistakes here and there, always verify the calculation.

jacquesm

23 days ago

2 replies

I guess I'm just more critical than you are. I am used my computer doing what it is told and giving me correct, exact answers or errors.

emporas

23 days ago

There is also Mercury LLM, which computes the answer directly as a 2D text representation. I don't know if you are familiar with Mercury LLM, but you read correctly, 2D text output.

This might work better getting input as an ASCII diagram, or generating an output as an ASCII diagram, not sure if both input and output work 2D.

Plumbing/electrical/electronic schematics are pretty important for AIs to understand and assist us, but for the moment the success rate is pretty low. 50% success rate for simple problems is very low, 80-90% success rate for medium difficulty problems is where they start being really useful.

dagss

22 days ago

I think most people treat them like humans not computers, and I think that is actually a much more correct way to treat them. Not saying they are like humans, but certainly a lot more like humans than whatever you seem to be expecting in your posts.

Humans make errors all the time. That doesn't mean having colleagues is useless, does it?

An AI is a colleague that can code very very fast and has a very wide knowledge base and versatility. You may still know better than it in many cases and feel more experienced that in. Just like you might with your colleagues.

And it needs the same kind of support that humans need. Complex problem? Need to plan ahead first. Tricky logic? Need unit tests. Research grade problem? Need to discuss through the solution with someone else before jumping to code and get some feedback and iterate for 100 messages before we're ready to code. And so on.

manmal

22 days ago

1 reply

I have this mental model of LLMs and their capabilities, formed after months of way too much coding with CC and Codex, with 4 recursive problem categories:

1. Problems that have been solved before have their solution easily repeated (some will say, parroted/stolen), even with naming differences.

2. Problems that need only mild amalgamation of previous work are also solved by drawing on training data only, but hallucinations are frequent (as low probability tokens, but as consumers we don’t see the p values).

3. Problems that need little simulation can be simulated with the text as scratchpad. If evaluation criteria are not in training data -> hallucination.

4. Problems that need more than a little simulation have to either be solved by adhoc written code, or will result in hallucination. The code written to simulate is again a fractal of problems 1-4.

Phrased differently, sub problem solutions must be in the training data or it won’t work; and combining sub problem solutions must be either again in training data, or brute forcing + success condition is needed, with code being the tool to brute force.

I _think_ that the SOTA models are trained to categorize the problem at hand, because sometimes they answer immediately (1&2), enable thinking mode (3), or write Python code (4).

My experience with CC and Codex has been that I must steer it away from categories 2 & 3 all the time, either solving them myself, ask them to use web research, or split them up until they are (1) problems.

Of course, for many problems you’ll only know the category once you’ve seen the output, and you need to be able to verify the output.

I suspect that if you gave Claude/Codex access to a circuit simulator, it will successfully brute force the solution. And future models might be capable enough to write their own simulator adhoc (ofc the simulator code might recursively fall into category 2 or 3 somewhere and fail miserably). But without strong verification I wouldn’t put any trust in the outcome.

With code, we do have the compiler, tests, observed behavior, and a strong training data set with many correct implementations of small atomic problems. That’s a lot of out of the box verification to correct hallucinations. I view them as messy code generators I have to clean up after. They do save a ton of coding work after or while I‘m doing the other parts of programming.

jacquesm

22 days ago

This parallels my own experience so far, the problem for me is that (1) and (2) I can quickly and easily do myself and I'll do it in a way that respects the original author's copyright by including their work - and license - verbatim.

(3) and (4) level problems are the ones where I struggle tremendously to make any headway even without AI, usually this requires the learning of new domain knowledge and exploratory code (currently: sensor fusion) and these tools will just generate very plausible nonsense which is more of a time waster than a productivity aid. My middle-of-the-road solution is to get as far as I can by reading about the problem so I am at least able to define it properly and to define test cases and useful ranges for inputs and so on, then to write a high level overview document about what I want to achieve and what the big moving parts are and then only to resort to using AI tools to get me unstuck or to serve as a knowledge reservoir for gaps in domain knowledge.

Anybody that is using the output of these tools to produce work that they do not sufficiently understand is going to see a massive gain in productivity, but the underlying issues will only surface a long way down the line.

dagss

22 days ago

1 reply

I am right now implementing an imagining pipeline using OpenCV and TypeScript.

I have never used OpenCV specifically before, and have little imaging experience too. What I do have though is a PhD in astrophysics/statistics so I am able to follow along the details easily.

Results are amazing. I am getting results in 2 days of work that would have taken me weeks earlier.

ChatGPT acts like a research partner. I give it images and it explains why current scoring functions fails and throws out 10 new directions to go in.

And if I want to try something, the code is usually bug free. So fast to just write code, try it, throw it away if I want to try another idea.

I think a) OpenCV probably has more training data than circuits? and b) I do not treat it as a desperate student with no knowlegde.

I expect to have to guide it.

There are several hundred messages back and forth.

It is more like two researchers working together with different skill sets complementing one another.

One of those skillsets being to turn a 20 message conversation into bugfree OpenCV code in 20 seconds.

No, it is not providing a perfect solution to all problems on first iteration. But it IS allowing me to both learn very quickly and build very quickly. Good enough for me..

jacquesm

22 days ago

That's a good use case, and I can easily imagine that you get good results from it because (1) it is for a domain that you are already familiar with and (2) you are able to check that the results that you are getting are correct and (3) the domain that you are leveraging (coding expertise) is one that chatgpt has ample input for.

Now imagine you are using it for a domain that you are not familiar with, or one for which you can't check the output or that chatgpt has little input for.

If either of those is true the output will be just as good looking and you would be in a much more difficult situation to make good use of it, but you might be tempted to use it anyway. A very large fraction of the use cases for these tools that I have come across professionally so far are of the latter variety, the minority of the former.

And taking all of the considerations into account:

- how sure are you that that code is bug free?

- Do you mean that it seems to work?

- Do you mean that it compiles?

- How broad is the range of inputs that you have given it to ascertain this?

- Have you had the code reviewed by a competent programmer (assuming code review is a requirement)?

- Does it pass a set of pre-defined tests (part of requirement analysis)?

- Is the code quality such that it is long term maintainable?

verdverm

23 days ago

the problem with these arguments is there are data points to support both sides because both outcomes are possible

the real thing is are you or we getting an ROI and the answer is increasingly more yeses on more problems, this trend is not looking to plateau as we step up the complexity ladder to agentic system

snet0

23 days ago

1 reply

If you define "simple thing" as "thing an AI can't do", then yes. Everyone just shifts the goalposts in these conversations, it's infuriating.

ACCount37

23 days ago

1 reply

Come on. If we weren't shifting the goalposts, we would have burned through 90% of the entire supply of them back in 2022!

baq

23 days ago

It’s less shifting goalposts and more of a very jagged frontier of capabilities problem.

djeastm

22 days ago

1 reply

Possibly, but a lot of value comes from doing very simple things faster.

jacquesm

22 days ago

That is a good point. A lot of work really is mostly simple things.

poormathskills

23 days ago

1 reply

For a minor version update (5.1 -> 5.2) that's a way bigger improvement than I would have guessed.

beering

23 days ago

Model capability improvements are very uneven. Changes between one model and the next tend to benefit certain areas substantially without moving the needle on others. You see this across all frontier labs’ model releases. Also the version numbering is BS (remember GPT-4.5 followed by GPT-4.1?).

catigula

23 days ago

1 reply

Yes, but it's not good enough. They needed to surpass Opus 4.5.

mikairpods

23 days ago

that is better...?

minimaxir

23 days ago

2 replies

Note that GPT 5.2 newly supports a "xhigh" reasoning level, which could explain the better benchmarks.

It'll be noteworthy to see the cost-per-task on ARC AGI v2.

granzymes

23 days ago

1 reply

> It'll be noteworthy to see the cost-per-task on ARC AGI v2.

Already live. gpt-5.2-pro scores a new high of 54.2% with a cost/task of $15.72. The previous best was Gemini 3 Pro (54% with a cost/task of $30.57).

The best bang-for-your-buck is the new xhigh on gpt-5.2, which is 52.9% for $1.90, a big improvement on the previous best in this category which was Opus 4.5 (37.6% for $2.40).

https://arcprize.org/leaderboard

minimaxir

23 days ago

Huh, that is indeed up and to left of Opus.

walletdrainer

23 days ago

5.1-codex supports that too, no? Pretty sure I’ve been using xhigh for at least a week now

causal

23 days ago

2 replies

That ARC AGI score is a little suspicious. That's a really tough for AI benchmark. Curious if there were improvements to the test harness because that's a wild jump in general problem solving ability for an incremental update.

taurath

23 days ago

I don’t think their words mean just about anything, only the behavior of the models.

Still waiting of Full Self Driving myself.

woeirua

23 days ago

They're clearly building better training datasets and doing extensive RL on these benchmarks over time. The out of distribution performance is still awful.

fuddle

23 days ago

1 reply

I don't think SWE Verified is an ideal benchmark, as the solutions are in the training dataset.

joshuahedlund

23 days ago

I would love for SWE Verified to put out a set of fresh but comparable problems and see how the top performing models do, to test against overfitting.

thinkingtoilet

23 days ago

Open AI has already been busted for getting benchmark information and training the models on that. At this point if you believe Sam Altman, I have a bridge to sell you.

egeres

23 days ago

1 reply

It baffles me to see these last 2 announcements (GPT 5.1 as well) devoid of any metrics, benchmarks or quantitative analyses. Could it be because they are behind Google/Anthropic and they don't want to admit it?

(edit: I'm sorry I didn't read enough on the topic, my apologies)

zamadatix

23 days ago

This isn't the announcement, it's the developer docs intro page to the model - https://openai.com/index/introducing-gpt-5-2/. Still doesn't answer cross-comparison, but at least has benchmarks they want to show off.

mattas

23 days ago

2 replies

Are benchmarks the right way to measure LLMs? Not because benchmarks can be gamed, but because the most useful outputs of models aren't things that can be bucketed into "right" and "wrong." Tough problem!

olliepro

23 days ago

1 reply

Do you have a better way to measure LLMs? Measurement implies quantitative evaluation... which is the same as benchmarks.

Wowfunhappy

23 days ago

1 reply

I don’t have a good way to measure them, but I think they should be evaluated more like how we evaluate movies, or restaurants. Namely, experienced critics try them and write reviews.

olliepro

22 days ago

It feels like this should work, but the breadth of knowledge in these models is so vast. Everyone knows how to taste, but not everyone knows physics, biology, math, every language… poetry, etc. Enumerating the breadth of valuable human tasks is hard, so both approaches suffer from the scale of the models’ surface area.

An interesting problem since the creators of OLMO have mentioned that throughout training, they use 1/3 or their compute just doing evaluations.

Sir_Twist

23 days ago

[delayed]

zug_zug

23 days ago

5 replies

For me the last remaining killer feature of ChatGPT is the quality of the voice chat. Do any of the competitors have something like that?

FrasiertheLion

23 days ago

1 reply

Try elevenlabs

sosodev

23 days ago

2 replies

Does elevenlabs have a real-time conversational voice model? It seems like like their focus is largely on text to speech and speech to text. Which can approximate that type of thing but it's not at all the same as the native voice to voice that 4o does.

dragonwriter

23 days ago

1 reply

> Does elevenlabs have a real-time conversational voice model?

Yes.

> It seems like like their focus is largely on text to speech and speech to text.

They have two main broad offerings (“Platforms”); you seem to be looking at what they call the “Creative Platform”. The real-time conversational piece is the centerpiece of the “Agents Platform”.

sosodev

23 days ago

It specifically says in the architecture docs for the agents platform that it's STT (ASR) -> LLM -> TTS

https://elevenlabs.io/docs/agents-platform/overview#architec...

hi_im_vijay

23 days ago

[disclaimer, i work at elevenlabs] we specifically went with a cascading model for our agents platform because it's better suited for enterprise use cases where they have full control over the brain and can bring their own llm. with that said, even with a cascading model, we can capture a decent amount of nuance with our asr model, and it also supports capturing audio events like laughter or coughing.

a true speech to speech conversational model will perform better on things like capturing tone, pronouncations, phonetics, etc, but i do believe we'll also get better at that on the asr side over time.

bigyabai

23 days ago

1 reply

Qwen does.

sosodev

23 days ago

Qwen's voice chat is nowhere near as good as ChatGPT's.

Robdel12

23 days ago

1 reply

I have found Claude‘s voice chat to be better. I only recently tried it because I liked ChatGPTs enough, but I think I’m going to use Claude going forward. I find myself getting interrupted by ChatGPT a lot whenever I do use it.

lxgr

23 days ago

2 replies

Claude’s voice chat isn’t “native” though, is it? It feels like it’s speech-to-text-to-LLM and back.

sosodev

23 days ago

2 replies

You can test it by asking it to: change the pitch of its voice, make specific sounds (like laughter), differentiate between words that are spelled the same but pronounced differently (record and record), etc.

lxgr

23 days ago

3 replies

Good idea, but an external “bolted on” LLM-based TTS would still pass that in many cases, right?

sosodev

23 days ago

1 reply

Yes, a sufficiently advanced marrying of TTS and LLM could pass a lot of these tests. That kind of blurs the line between native voice model and not though.

You would need:

* A STT (ASR) model that outputs phonetics not just words

* An LLM fine-tuned to understand that and also output the proper tokens for prosody control, non-speech vocalizations, etc

* A TTS model that understands those tokens and properly generate the matching voice

At that point I would probably argue that you've created a native voice model even if it's still less nuanced than the proper voice to voice of something like 4o.

BoxOfRain

22 days ago

I've been experimenting with something similar to this approach recently. IndexTTS2 gives you emotion vectors as an input, I used an external emotion classification model on the LLM output to modulate the TTS emotion vectors. You need to manage the state of the current affect with a bit of care or it sounds unhinged, but it's worked surprisingly well so far. I wired it together using Cats Effect.

As you'd expect latency isn't great, but I think it can be improved.

jablongo

23 days ago

I tried to make ChatGPT sing Mary had a little lamb recently and it's atonal but vaguely resembles the melody, which is interesting.

barrkel

23 days ago

The model giving it text to speak would have to annotate the text in order for the TTS to add the affect. The TTS wouldn't "remember" such instructions from a speech to text stage previously.

nomel

23 days ago

> make specific sounds (like laughter)

I’ve jumped a few times when it makes a full on shrieking scream rather than just reading it, in some of the audiobooks I’ve made with the TTS service (I assume related?).

causalmodels

23 days ago

1 reply

I just asked it and it said that it uses the on device TTS capabilities.

furyofantares

23 days ago

1 reply

I find it very unlikely that it would be trained on that information or that anthropic would put that in its context window, so it's very likely that it just made that answer up.

causalmodels

23 days ago

1 reply

No, it did not make it up. I was curious so I asked it asked it to imitate a posh British accent imitating a South Brooklyn accent while having a head cold and it explained that it didn't have have fine grained control over the audio output because it was using a TTS. I asked it how it knew that and it pointed me towards [1] and highlighted the following.

> As of May 29th, 2025, we have added ElevenLabs, which supports text to speech functionality in Claude for Work mobile apps.

Tracked down the original source [2] and looked for additional updates but couldn't find anything.

[1] https://simonwillison.net/2025/May/31/using-voice-mode-on-cl...

[2] https://trust.anthropic.com/updates

furyofantares

23 days ago

If it does a web search that's fine, I assumed it hadn't since you hadn't linked to anything.

websiteapi

23 days ago

2 replies

gemini live is a thing - never tried chaptgpt, are they not similar?

jeanlucas

23 days ago

2 replies

no.

leaK_u

23 days ago

1 reply

how.

CamelCaseName

23 days ago

I find ChatGPT's voice to text to be the absolute best in the world, nearly perfect.

I have constant frustrations with Gemini voice to text misunderstanding what I'm saying or worse, immediately sending my voice note when I pause or breathe even though I'm midway through a sentence.

nickvec

23 days ago

What? The voice chat is basically identical on ChatGPT and Gemini AFAICT.

spudlyo

23 days ago

Not for my use case. I can open it up, and in restored classical Latin pronunciation say "Hi, my name is X, how are you?" and it will respond (also in Latin) "Hello X, I am well, thanks for asking. I hope you are doing great." Its pronunciation is not great, but intelligible. In the written transcript, it butchers what I say, but its responses look good, although sans macrons indicating phonemic vowel length.

Gemini responds in what I think is Spanish, or perhaps Portuguese.

codybontecou

23 days ago

Their voice agent is handy. Currently trying to build around it.

Xiol

23 days ago

Yawn.

k2xl

23 days ago

The ARC AGI 2 bump to 52.9% is huge. Shockingly GPT 5.2 Pro does not add too much more (54.2%) for the increase cost.

fulafel

23 days ago