AI Will Make Formal Verification Go Mainstream

17 days ago

22 replies

I'm convinced now that the key to getting useful results out of coding agents (Claude Code, Codex CLI etc) is having good mechanisms in place to help those agents exercise and validate the code they are writing.

At the most basic level this means making sure they can run commands to execute the code - easiest with languages like Python, with HTML+JavaScript you need to remind them that Playwright exists and they should use it.

The next step up from that is a good automated test suite.

Then we get into quality of code/life improvement tools - automatic code formatters, linters, fuzzing tools etc.

Debuggers are good too. These tend to be less coding-agent friendly due to them often having directly interactive interfaces, but agents can increasingly use them - and there are other options that are a better fit as well.

I'd put formal verification tools like the ones mentioned by Martin on this spectrum too. They're potentially a fantastic unlock for agents - they're effectively just niche programming languages, and models are really good at even niche languages these days.

If you're not finding any value in coding agents but you've also not invested in execution and automated testing environment features, that's probably why.

formerly_proven

17 days ago

2 replies

Not so sure about formal verification though. ime with Rust LLM agents tend to struggle with semi-complex ownership or trait issues and will typically reach for unnecessary/dangerous escape hatches ("unsafe impl Send for ..." instead of using the correct locks, for example) fairly quickly. Or just conclude the task is impossible.

> automatic code formatters

I haven't tried this because I assumed it'll destroy agent productivity and massively increase number of tokens needed, because you're changing the file out under the LLM and it ends up constantly re-reading the changed bits to generate the correct str_replace JSON. Or are they smart enough that this quickly trains them to generate code with zero-diff under autoformatting?

17 days ago

1 reply

I'm finding my agents generate code that conforms to Black quite effectively, I think it's probably because I usually start them in existing projects that were already formatted using Black so they pick up those patterns.

formerly_proven

16 days ago

I still quite often have even Opus 4.5 generate empty indented lines (regardless of explicit instructions in AGENTS.md not to (besides explicitly referencing the style guide as well), the code not containing any before and the auto-formatter removing them), for example. Trailing whitespace is much rarer but happens as well. Personally I don't care too much, since I've found LLMs to be most efficient when performing roughly the work of a handful commits at most in one thread, so I let the pre-commit hook sort it out after being done with a thread.

17 days ago

I've directly faced this problem with automatic code formatters, but it was back around Claude 3.5 and 3.7. It would consistently write nonconforming code - regardless of having context demanding proper formatting. This caused both extra turns/invocations with the LLM and would cause context issues - both filling the context with multiple variants of the file and also having a confounding/polluting/poisoning effect due to having these multiple variations.

I haven't had this problem in a while, but I expect current LLMs would probably handle those formatting instructions more closely than the 3.5 era.

mccoyb

17 days ago

1 reply

Claude Code was a big jump for me. Another large-ish jump was multi-agents and following the tips from Anthropic’s long running harnesses post.

I don’t go into Claude without everything already setup. Codex helps me curate the plan, and curate the issue tracker (one instance). Claude gets a command to fire up into context, grab an issue - implements it, and then Codex and Gemini review independently.

I’ve instructed Claude to go back and forth for as many rounds as it takes. Then I close the session (\new) and do it again. These are all the latest frontier models.

This is incredibly expensive, but it’s also the most reliable method I’ve found to get high-quality progress — I suspect it has something to do with ameliorating self-bias, and improving the diversity of viewpoints on the code.

I suspect rigorous static tooling is yet another layer to improve the distribution over program changes, but I do think that there is a big gap in folk knowledge between “vanilla agent” and something fancy with just raw agents.

idiotsecant

17 days ago

1 reply

How expensive is incredibly expensive?

mccoyb

17 days ago

1 reply

If you're maxing out the plans across the platforms, that's 600 bucks -- but if you think about your usage and optimize, I'm guessing somewhere between 200-600 dollars per month.

jazzyjackson

17 days ago

2 replies

It's pretty easy to hit a couple hundred dollars a day filling up Opus's context window with files. This is via Anthropic API and Zed.

Going full speed ahead building a Rails app from scratch it seemed like I was spending $50/hour, but it was worth it because the App was finished in a weekend instead of weeks.

I can't bear to go in circles with Sonnet when Opus can just one shot it.

fragmede

17 days ago

The $200/month Max plan has limits, but making a couple of those seems way cheaper than $50/hr for the ~172 hrs in a month.

mkagenius

17 days ago

Anthropic via Azure has sent me an invoice of around $8000 for 3-5 days of Opus 4.1 usage and there is no way to track how many tokens during those days and how many cached etc. (And I thought its part of the azure sponsorship but that's another story)

ManuelKiessling

17 days ago

1 reply

Isn‘t it funny how that’s exactly the kind of stuff that helps a human developer be successful and productive, too?

Or, to put it the other way round, what kind of tech leads would we be if we told our junior engineers „Well, here’s the codebase, that’s all I‘ll give you. No debuggers, linters, or test runners for you. Using a browser on your frontend implementation? Nice try buddy! Now good luck getting those requirements implemented!“

Wowfunhappy

17 days ago

1 reply

> Isn‘t it funny how that’s exactly the kind of stuff that helps a human developer be successful and productive, too?

I think it's more nuanced than that. As a human, I can manually test code in ways an AI still can't. Sure, maybe it's better to have automated test suites, but I have other options too.

victorbjorklund

17 days ago

2 replies

AI can do that too? If you have a web app it can use playwright to test functionality and take screenshots to see if it looks right.

Wowfunhappy

17 days ago

1 reply

Yeah, but it doesn't work nearly as well. The AI frequently misinterprets what it sees. And it isn't as good at actually using the website (or app, or piece of hardware, etc) as a human would.

UncleEntity

17 days ago

I've been using Claude to implement an ISO specification and I have to keep telling it we're not interested if the repl is correct but that the test suite is ensuring the implementation is correctly following the spec. But when we're tracking down why a test is failing then it'll go to town using the repl to narrow down out what code path is causing the issue. The only reason there's even is a repl at this point is so it can do its 'spray and pray' debugging outside the code and Claude constantly tried to use it to debug issues so I gave in and had it write a pretty basic one.

Horses for courses, I suppose. Back in the day, when I wanted to play with some C(++) library, I'd quite often write a Python C-API extension so I could do the same thing using Python's repl.

Capricorn2481

17 days ago

1 reply

But then the AI would theoretically have to write the playwright code. How does it verify it's getting the right page to begin with?

17 days ago

The recent models are pretty great at this. They read the source code for e.g. a Python web application and use that to derive what the URLs should be. Then they fire up a localhost development server and write Playwright scripts to interact with those pages at the predicted URLs.

The vision models (Claude Opus 4.5, Gemini 3 Pro, GPT-5.2) can even take screenshots via Playwright and then "look at them" with their vision capabilities.

It's a lot of fun to watch. You can even tell them to run Playwright not in headless mode at which point a Chrome window will pop up on your computer and you can see them interact with the site via it.

WhyOhWhyQ

17 days ago

2 replies

I've tried getting claude to set up testing frameworks, but what ends up happening is it either creates canned tests, or it forgets about tests, or it outright lies about making tests. It's definitely helpful, but feels very far from a robust thing to rely on. If you're reviewing everything the AI does then it will probably work though.

17 days ago

1 reply

LLMs are very good at looking at a change set and finding untested paths. As a standard part of my workflow, I always pass the LLM's work through a "reviewer", which is a fresh LLM session with instructions to review the uncommitted changes. I include instructions for reviewing test coverage.

I've also found that LLMs typically just partially implement a given task/story/spec/whatever. The reviewer stage will also notice a mismatch between the spec and the implementation.

I have an orchestrator bounce the flow back and forth between developing and reviewing until the review comes back clean, and only then do I bother to review its work. It saves so much time and frustration.

akrauss

17 days ago

What tooling are you using for the orchestration?

17 days ago

Something I find helps a lot is having a template for creating a project that includes at least one passing test. That way the agent can run the tests at the start using the correct test harness and then add new tests as it goes along.

I use cookiecutter for this, here's my latest Python library template: https://github.com/simonw/python-lib

CobrastanJorji

17 days ago

1 reply

I might go further and suggest that the key to getting useful results out of HUMAN coding agents is also to have good mechanisms in place to help them exercise and validate the code.

We valued automated tests and linters and fuzzers and documentation before AI, and that's because it serves the same purpose.

dkdcio

17 days ago

we call this the “AI trojan horse” — if it’s good for agents, it's good for developers

oxag3n

17 days ago

6 replies

Where they'd get training data?

Source code generation is possible due to large training set and effort put into reinforcing better outcomes.

I suspect debugging is not that straightforward to LLM'ize.

It's a non-sequential interaction - when something happens, it's not necessarily caused the problem, timeline may be shuffled. LLM would need tons of examples where something happens in debugger or logs and associate it with another abstraction.

I was debugging something in gdb recently and it was a pretty challenging bug. Out of interest I tried chatgpt, and it was hopeless - try this, add this print etc. That's not how you debug multi-threaded and async code. When I found the root cause, I was analyzing how I did it and where did I learn that specific combination of techniques, each individually well documented, but never in combination - it was learning from other people and my own experience.

jimmaswell

17 days ago

1 reply

How long ago was this? I've had outstansingly impressive results asking Copilot Chat with Sonnet 4.5 or ChatGPT to debug difficult multithreaded C++.

oxag3n

17 days ago

Few months back with ChatGPT 5. Multi-threaded Rust & custom async runtime, data integrity bug, reproduced every ~5th run.

17 days ago

2 replies

Have you tried running gdb from a Claude Code or Codex CLI session?

oxag3n

17 days ago

No, I'm in academia and the goal is not code or product launch. I find research process to struggle a lot once someone solves a problem instead of you.

I understand that AI can help with writing, coding, analyzing code bases and summarizing other papers, but going through these myself makes a difference, at least for me. I tried ChatGPT 3.5 when I started and while I got a pile of work done, I had to throw it away at some point because I didn't fully understand it. AI could explain to me various parts, but it's different when you create it.

17 days ago

For interactive programs like this, I use tmux and mention "send-keys" and "capture-pane" and it's able to use it to drive an interactive program. My demo/poc for this is making the agent play 20 questions with another agent via tmux

christophilus

17 days ago

Debugging has been excellent for me with Opus 4.5 and Claude Code.

fragmede

17 days ago

> Where they'd get training data?

They generated it, and had a compiler compile it, and then had it examine the output. Rinse, repeat.

RA_Fisher

17 days ago

LLMs are okay at bisecting programs and identifying bugs in my experience. Sometimes they require guidance but often enough I can describe the symptom and they identify the code causing the issue (and recommend a fix). They’re fairly methodical, and often ask me to run diagnostic code (or do it themselves).

anon-3988

17 days ago

> I suspect debugging is not that straightforward to LLM'ize.

Debugging is not easy but there should be a lot of training corpus for "bug fixing" from all the commits that have ever existed.

zahlman

17 days ago

1 reply

One objection: all the "don't use --yolo" training in the world is useless if a sufficiently context-poisoned LLM starts putting malware in the codebase and getting to run it under the guise of "unit tests".

[1] https://www.andrew.cmu.edu/user/bparno/papers/ironfleet.pdf

17 days ago

For now, this is mitigated by only including trusted content in the context; for instance, absolutely do not allow it to access general web content.

I suspect that as it becomes more economical to play with training your own models, people will get better at including obscured malicious content in data that will be used during training, which could cause the LLM to intrinsically carry a trigger/path that would cause malicious content to be output by the LLM under certain conditions.

And of course we have to worry about malicious content being added to sources that we trust, but that already exists - we as an industry typically pull in public repositories without a complete review of what we're pulling. We outsource the verification to the owners of the repository. Just as we currently have cases of malicious code sneaking into common libraries, we'll have malicious content targeted at LLMs

roadside_picnic

17 days ago

4 replies

I very much agree, and believe using languages with powerful types systems could be a big step in this direction. Most people's first experience with Haskell is "wow this is hard to write a program in, but when I do get it to compile, it works". If this works for human developers, it should also work for LLMs (especially if the human doesn't have to worry about the 'hard to write a program' part).

> The next step up from that is a good automated test suite.

And if we're going for a powerful type system, then we can really leverage the power of property tests which are currently grossly underused. Property tests are a perfect match for LLMs because they allow the human to create a small number of tests that cover a very wide surface of possible errors.

The "thinking in types" approach to software development in Haskell allows the human user to keep at a level of abstraction that still allows them to reason about critical parts of the program while not having to worry about the more tedious implementation parts.

Given how much interest there has been in using LLMs to improve Lean code for formal proofs in the math community, maybe there's a world where we make use of an even more powerful type systems than Haskell. If LLMs with the right language can help prove complex mathematical theorems, they it should certain be possible to write better software with them.

bcrosby95

17 days ago

1 reply

Ada when?

It even lets you separate implementation from specification.

jaggederest

17 days ago

Even going beyond Ada into dependently typed languages like (quoth wiki) "Agda, ATS, Rocq (previously known as Coq), F*, Epigram, Idris, and Lean"

I think there are some interesting things going on if you can really tightly lock down the syntax to some simple subset with extremely straightforward, powerful, and expressive typing mechanisms.

astrostl

17 days ago

2 replies

This is why I use Go as much as reasonably possible with vibe coding: types, plus great quality-checking ecosystem, plus adequate training data, plus great distribution story. Even when something has stuff like JS and Python SDKs, I tend to skip them and go straight to the API with Go.

apitman

16 days ago

Also a fast compiler which lets the agent iterate more times.

antonvs

16 days ago

Go has types? I didn’t notice.

justatdotin

17 days ago

I love ML types, but I've concluded they serve humans more than they do agents. I'm sure it affects the agent, maybe just not as much as other choices.

I've noticed real advantages of functional languages to agents, for disposable code. Which is great, cos we can leverage those without dictating the human's experience.

I think the correct way forward is to choose whatever language the humans on your team agree is most useful. For my personal projects, that means a beautiful language for the bits I'll be touching, and whatever gets the job done elsewhere.

nextos

17 days ago

That's my opinion as well. Some functional language, that can also offer access to imperative features when needed, plus an expressive type system might be the future.

My bet is on refinement types. There are already serious industrial efforts to generate Dafny using LLMs.

Dafny fits that bill quite well, it's simple, it offers refinement types, and verification is automated with SAT/SMT.

Some of the largest verification efforts have been achieved with this language [1].

Davidzheng

17 days ago

1 reply

OK but if the verification loop really makes the agents MUCH more useful, then this usefulness difference can be used as a training signal to improve the agents themselves. So this means the current capabilities levels are certainly not going to remain for very long (which is also what I expect but I would like to point out it's also supported by this)

hamiecod

17 days ago

Thats a strong RL technique that could equal the quality of RLHF.

thomasfromcdnjs

17 days ago

2 replies

shameless plug: i'm working on an open source project https://blocksai.dev/ to attempt to solve this. (and just added a note for me to add formal verification)

Elevator pitch: "Blocks is a "semantic linter" for human-AI collaboration. Define your domain in YAML, let anyone (humans or AI) write code freely, then validate for drift."

The gist being you define a whole bunch of validators for a collection of modules you are building (with agentic coding) with a focus on qualifying semantic things;

- domain / business rules/measures - branding - data flow invariants — "user data never touches analytics without anonymization" - accessibility - anything you can think of

And you just tell your agentic coder to use the cli tool before committing, so it keeps the code in line with your engineering/business/philosophical values.

(boring) example of it detecting if blog posts have humour in them running in claude code -> https://imgur.com/diKDZ8W

akrauss

17 days ago

1 reply

Quick feedback: both the „learn more“ link at the very top and the „Explore all examples“ link lead to 404

thomasfromcdnjs

17 days ago

Thanks will fix that up shortly.

baq

17 days ago

Reminder YAML is a serialization format. IaC standardizing on it (hashicorp being an outlier) was a mistake. It’s a good compilation target, but please add a higher level language for whatever you’re doing.

rmah

17 days ago

1 reply

I think the main limitation is not code validation but assumption verification. When you ask an LLM to write some code based on a few descriptive lines of text, it is, by necessity, making a ton of assumptions. Oddly, none of the LLM's I've seen ask for clarification when multiple assumptions might all be likely. Moreover, from the behavior I've seen, they don't really backtrack to select a new assumption based on further input (I might be wrong here, it's just a feeling).

What you don't specify, it must to assume. And therein lies a huge landscape of possibilities. And since the AI's can't read your mind (yet), its assumptions will probably not precisely match your assumptions unless the task is very limited in scope.

crazygringo

17 days ago

> Oddly, none of the LLM's I've seen ask for clarification when multiple assumptions might all be likely.

It's not odd, they've just been trained to give helpful answers straight away.

If you tell them not to make assumptions and to rather first ask you all their questions together with the assumptions they would make because you want to confirm before they write the code, they'll do that too. I do that all the time, and I'll get a list of like 12 questions to confirm/change.

That's the great thing about LLM's -- if you want them to change their behavior, all you need to do is ask.

rodphil

17 days ago

1 reply

> At the most basic level this means making sure they can run commands to execute the code - easiest with languages like Python, with HTML+JavaScript you need to remind them that Playwright exists and they should use it.

So I've been exploring the idea of going all-in on this "basic level" of validation. I'm assembling systems out of really small "services" (written in Go) that Claude Code can immediately run and interact with using curl, jq, etc. Plus when building a particular service I already have all of the downstream services (the dependencies) built and running so a lot of dependency management and integration challenges disappear. Only trying this out at a small scale as yet, but it's fascinating how the LLMs can potentially invert a lot of the economics that inform the current conventional wisdom.

(Shameless plug: I write about this here: https://twilightworld.ai/thoughts/atomic-programming/)

My intuition is that LLMs will for many use cases lead us away from things like formal verification and even comprehensive test suites. The cost of those activities is justified by the larger cost of fixing things in production; I suspect that we will eventually start using LLMs to drive down the cost of production fixes, to the point where a lot of those upstream investments stop making sense.

jcranmer

17 days ago

> My intuition is that LLMs will for many use cases lead us away from things like formal verification and even comprehensive test suites. The cost of those activities is justified by the larger cost of fixing things in production; I suspect that we will eventually start using LLMs to drive down the cost of production fixes, to the point where a lot of those upstream investments stop making sense.

There is still a cost to having bugs, even if deploying fixes becomes much cheaper. Especially if your plan is to wait until they actually occur in practice to discover that you have a bug in the first place.

Put differently: would you want the app responsible for your payroll to be developed in this manner? Especially considering that the bug in question would be "oops, you didn't get paid."

htrp

17 days ago

Better question is which tools at what level

jijijijij

17 days ago

> At the most basic level this means making sure they can run commands to execute the code

Yeah, it's gonna be fun waiting for compilation cycles when those models "reason" with themselves about a semicolon. I guess we just need more compute...

simianwords

17 days ago

Claude code and other AI coding tools must have a * mandatory * hook for verification.

For front end - the verification is make sure that the UI looks expected (can be verified by an image model) and clicking certain buttons results in certain things (can be verified by chatgpt agent but its not public ig).

For back end it will involve firing API requests one by one and verifying the results.

To make this easier, we need to somehow give an environment for claude or whatever agent to run these verifications on and this is the gap that is missing. Claude Code, Codex should now start shipping verification environments that make it easy for them to verify frontend and backend tasks and I think antigravity already helps a bit here.

ramoz

17 days ago

Claude Code hooks is a great way to integrate these things

apitman

16 days ago

That's bad news for Rust and other slow compilers.

QuercusMax

17 days ago

I've only done a tiny bit of agent-assisted coding, but without the ability to run tests the AI will really go off the rails super quick, and it's kinda hilarious to watch it say "Aha! I know what the problem is!" over and over as it tries different flavors until it gives up.

Buttons840

17 days ago

On topic:

Why not something like a dependently typed language with an effect system?

The type system can say "this function returns valid Sudoku puzzle solution without accessing the internet or hard drive", and then you just let AI go wild trying to fill in the definition. Once it compiles, it works. The type system can check all of this at compile time.

dionian

17 days ago

you've done some great articles on this topic and my experience aligns with your view completely.

agumonkey

17 days ago

gemini and claude do that already IIUC, self debugging iterations

17 days ago

Maybe in the short term, but that doesn't solve some fundamental problems. Consider, NP problems, problems whose solutions can be easily verified. But that they can all be easily verified does not (as far as we know) mean they can all be easily solved. If we look at the P subset of NP, the problems that can be easily solved, then the easy verification is no longer all the key feature.

So let's say that, similarly, there are programming tasks that are easier and harder for agents to do well. If we know that a task is in the easy category, of course having tests is good, but since we already know that an agent does it well, the test isn't the crucial aspect. On the other hand, for a hard task, all the testing in the world may not be enough for the agent to succeed.

Longer term, I think it's more important to understand what's hard and what's easy for agents.

nrhrjrjrjtntbt

17 days ago

2 replies

I love HN because HN comments have talked about this a fair bit already. I think on the recent Erdos problem submission.

I like the idea that languages even like Rust and Haskell may be more accessible. Learn them of course but LLM can steer you out of getting stuck.

QuercusMax

17 days ago

1 reply

Or it will lead you to unsafe workarounds when you get stuck

ben_w

17 days ago

1 reply

Always a risk.

Some students whose work I had to fix (pre AI), was crashing a lot all over the place, due to !'s instead of ?'s followed by guard let … {} and if let … {}

QuercusMax

17 days ago

Forums used to be full of ppl using ON ERROR RESUME NEXT in VB

iwontberude

17 days ago

1 reply

The idea that LLMs are steering anything correctly with Rust reference management is hilarious to me, but only due to my experiences.

nrhrjrjrjtntbt

17 days ago

Id definitely use LLMs to troubleshoot but not blindly vibecode.

Sevii

17 days ago

1 reply

If AI is good enough to write formal verification, why wouldn't it be good enough to do QA? Why not just have AI do a full manual test sweep after every change?

https://chatgpt.com/share/6941df90-789c-8005-8783-6e1c76cdfc...

17 days ago

1 reply

You can do that as well, but it's non-deterministic and also expensive to run.

Much better to have AI write deterministic test suites for your project.

apercu

17 days ago

I guess I am luddite-ish in that I think people still need to decide what must always be true in a system. Tests should exist to check those rules.

AI can help write test code and suggest edge cases, but it shouldn’t be trusted to decide whether behavior is correct.

When software is hard to test, that’s usually a sign the design is too tightly coupled or full of side effects, or that the architecture is unnecessarily complicated. Not that the testing tools are bad.

rocqua

17 days ago

1 reply

I've been trying to convince others of this, and gotten very little traction. One interesting response I got from someone very familiar with the Tamarin prover was that there just wasn't enough example code out there.

Another take is that LLMs don't have enough conceptual understanding to actually create proofs for the correctness of code.

Personally I believe this kind of work is predicated on more ergonomic proof systems. And those happen to be valuable even without LLMs. Moreover the built in guarantees of rust seem like they are a great start for creating more ergonomic proof systems. Here I am both in awe of Kani, and disappointed by it. The awe is putting in good work to make things more ergonomic. The disappointment is using bounded model checking for formal analysis. That can barely make use of the exclusion of mutable aliasing. Kani, but with equational reasoning, that's the way forward. Equational reasoning was long held back by needing to do a whole lot of pointer work to exclude worries of mutable aliasing. Now you can lean on the borrow checker for that!

nomadygnt

17 days ago

1 reply

Another cool tool that’s being developed for rust is verus. It’s not the same as Kani and is more of a fork of the rust compiler but it lets you do some cool verification proofs combined with the z3 SMT solver. It’s really a cool system for verified programs.

rocqua

11 days ago

I had a look, and it seems cool. But it doesn't seem to do the thing I love about Kani: work with only very partial annotations, only proving the annotations and being a very light lift on code not relevant for the annotations.

throwaway613745

17 days ago

2 replies

GPT 5.2 can't even tell me how many rs are in garlic.

bglazer

17 days ago

4 replies

This is a very tiring criticism. Yes, this is true. But, it's an implementation detail (tokenization) that has very little bearing on the practical utility of these tools. How often are you relying on LLM's to count letters in words?

101008

17 days ago

3 replies

It's an example that shows that if these models aren't trained in a specific problem, they may have a hard time solving it for you.

Uehreka

17 days ago

1 reply

No, it’s an example that shows that LLMs still use a tokenizer, which is not an impediment for almost any task (even many where you would expect it to be, like searching a codebase for variants of a variable name in different cases).

8note

17 days ago

1 reply

the question remains: is the tokenizer going to be a fundamental limit to my task? how do i know ahead of time?

worldsayshi

17 days ago

Would it limit a person getting your instructions in Chinese? Tokenisation pretty much means that the LLM is reading symbols instead of phonemes.

This makes me wonder if LLMs works better in Chinese.

altruios

17 days ago

1 reply

An analogy is asking someone who is colorblind how many colors are on a sheet of paper. What you are probing isn't reasoning, it's perception. If you can't see the input, you can't reason about the input.

9rx

17 days ago

2 replies

> What you are probing isn't reasoning, it's perception.

Its both. A colorblind person will admit their shortcoming and, if compelled to be helpful like an LLM is, will reason their way to finding a solution that works around their limitations.

But as LLMs lack a way to reason, you get nonsense instead.

altruios

14 days ago

What tools does the LLM have access to that would reveal sub-token characters to it?

This assumes the colorblind person both believes it is true that they are colorblind, in a world where that can be verified, and possesses tools to overcome these limitations.

You have to be much more clever to 'see' an atom before the invention of a microscope, if the tool doesn't exist: most of the time you are SOL.

Nevermark

17 days ago

> But as LLMs lack a way to reason, you get nonsense instead.

I suppose we should conclude humans don't reason either, since they do this all the time. Reason over blindspots without realizing it, is endemic in humans.

LLMs reason differently because their architecture is different, the way they learn is different, and their mode of inference is different. They will be bad at some things we are good at.

And yet their reasoning is far wider than ours (able to weave together virtually any disparate topics sensibly), and almost always better than humans at moderate depth one-shot responses.

Imperfect, not always correct, sometimes dumb. But I have yet to meet this mythical super reliable reasoner, who by their existence shames LLM reasoning abilities so much.

Human's are truly awful at one-shot reasoning.

victorbjorklund

17 days ago

No, it is the issue with the tokenizer.

andy99

17 days ago

1 reply

At this point if I was openAI I wouldn’t bother fixing this to give pedants something to get excited about.

blazingbanana

17 days ago

Unless they fixed this in 25 minutes (possible?) it correctly counts 1 `r`.

1970-01-01

17 days ago

The implementation detail is that we keep finding them! After this, it couldn't locate a seahorse emoji without freaking out. At some point we need to have a test: there are two drinks before you. One is water, the other is whatever the LLM thought you might like to drink after it completed refactoring the codebase. Choose wisely.

iAMkenough

17 days ago

The criticism would stop if the implementation issue was fixed.

It's an example of a simple task. How often are you relying on LLMs to complete simple tasks?

desman

17 days ago

1 reply

This is like complaining that your screwdriver is bad at measuring weight.

If you really need an answer and you really need the LLM to give it to you, then ask it to write a (Python?) script to do the calculation you need, execute it, and give you the answer.

bgwalter

17 days ago

1 reply

Yet people here go ecstatic if SVG pelicans are being generated (badly), which is also not what LLMs are for.

worldsayshi

17 days ago

1 reply

That's a problem that is at least possible for the LLM to perceive and learn through training, while counting letters is much more like asking a colour blind person to count flowers by colour.

charcircuit

17 days ago

Why is it not possible? Do you think it's impossible to count? Do you think it's imposing to learn the letters in each token? Do you think combining the two is not possible.

yuppiemephisto

17 days ago

1 reply

I vibe code extremely extensively with Lean 4, enough to run out 2 claude code $200 accounts api limits every day for a week.

I added LSP support for images to get better feedback loops and opus was able to debug github.com/alok/LeanPlot. The entire library was vibe coded by older ai. It also wrote GitHub.com/alok/hexluthor (a hex color syntax highlighting extension that uses lean’s metaprogramming and lsp to show you what color a hex literal is) by using feedback and me saying “keep goign” (yes i misspelled it).

It has serious issues with slop and the limitations of small data, but the rate of progress is really really fast. Opus 4.5 and Gemini were a huge step change.

The language is also improving very fast. not as fast as AI.

The feedback loop is very real even for ordinary programming. The model really resists it though because it’s super hard, but again this is rapidly improving.

I started vibe coding Lean about 3 years ago and I’ve used Lean 3 (which was far worse). It’s my favorite language after churning through idk 30?

A big aspect of being successful with them is not being all or nothing with proofs. It’s very useful to write down properties as executable code and then just not prove them because they still have to type check and fit together and make sense. github.com/lecopivo/scilean is a good example (search “sorry_proof”).

Even the documentation system for lean (verso) is (dependently!) typed.

Check out my Lean vibe codes at https://github.com/alok?tab=repositories&q=Lean&type=&langua...

pooooooooooooop

17 days ago

1 reply

cool quote about smooth operator ... I notice none of the vibe codes are proofs of anything and rather frameworks for using lean. this seems like a waste of tokens - what is your thinking behind this?

yuppiemephisto

17 days ago

There’s 4000 lines of nonstandard analysis which are definitely proofs, including equivalence to the standard definitions.

The frameworks are to improve lean’s programming ecosystem and not just its proving. Metaprogramming is pretty well covered already too, but not ordinary programs.

arjie

17 days ago

2 replies

Interesting prediction. It sort of makes sense. I have noticed that LLMs are very good at solving problems whose solutions are easy to check[0]. It ends up being quite an advantage to be able to work on such problems because rarely does an LLM truly one-shot a solution through token generation. Usually the multi-shot is 'hidden' in the reasoning tokens, or for my use-cases it's usually solved via the verification machine.

A formally verified system is easier for the model to check and consequently easier for it to program to. I suppose the question is whether or not formal methods are sufficiently tractable that they actually do help the LLM be able to finish the job before it runs out of its context.

Regardless, I often use coding assistants in that manner:

1. First, I use the assistant to come up with the success condition program

2. Then I use the assistant to solve the original problem by asking it to check with the success condition program

3. Then I check the solution myself

It's not rocket science, and is just the same approach we've always taken to problem-solving, but it is nice that modern tools can also work in this way. With this, I can usually use Opus or GPT-5.2 in unattended mode.

0: https://wiki.roshangeorge.dev/w/Blog/2025-12-11/LLMs_Excel_A...

imiric

17 days ago

1 reply

The issue is that many problems aren't easy to verify, and LLMs also excel at producing garbage output that appears correct on the surface. There are fields of science where verification is a long and arduous process, even for content produced by humans. Throwing LLMs at these problems can only produce more work for a human to waste time verifying.

arjie

17 days ago

Yes, that is true. And for those problems, those who use LLMs will not get very far.

As for those who use LLMs to impersonate humans, which is the kind of verification (verify that this solution that is purported to be built by a human actually works), I have no doubt we will rapidly evolve norms that make us more resistant to them. The cost of fraud and anti-fraud is not zero, but I suspect it will be much less than we fear.

degamad

17 days ago

1 reply

> 1. First, I use the assistant to come up with the success condition program

> 2. Then I use the assistant to solve the original problem by asking it to check with the success condition program

This sounds a lot like Test-Driven Development. :)

arjie

17 days ago

It is the exact same flow. I think a lot of things in programming follow that pattern. The other one I can think of is identifying the commit that introduces a regression: write the test program, git-bisect.

Havoc

17 days ago

1 reply

I've been toying with vibecoding rust - hardly formal verification, but it is a step closer than python that's for sure.

So far so good, though the smaller amount of training data is noticeable.

iwontberude

17 days ago

2 replies

vibecoding rust sounds cool, which model are you using? I have tried in the past with GPT4o and Sonnet 4, but they were so bad I thought I should just wait a few years.

Havoc

16 days ago

Using GLM. It works but chews tokens pretty hard

iwontberude

17 days ago

Or downvote me without a reply proving my point

andy99

17 days ago

4 replies

Maybe a stupid question, how do you verify the verification program? If an LLM is writing it too, isn’t it turtles all the way down, especially with the propensity of AI to modify tests so they pass?

pegasus

17 days ago

1 reply

The proof is verified mechanically - it's very easy to verify that a proof is correct, what's hard is coming up with the proof (it's an NP problem). There can still be gotchas, especially if the statement proved is complex, but it does help a lot in keeping bugs away.

worldsayshi

17 days ago

1 reply

How often can the hardness be exchanged with tediousness though? Can at least some of those problems be solved by letting the AI try until it succeeds?

jdub

17 days ago

To simplify for a moment, consider asking an LLM to come up with tests for a function. The tests pass. But did it come up with exhaustive tests? Did it understand the full intent of the function? How would it know? How would the operator know? (Even if it's wrangling simpler iterative prop/fuzz/etc testing systems underneath...)

Verification is substantially more challenging.

Currently, even for an expert in the domains of the software to be verified and the process of verification, defining a specification (even partial) is both difficult and tedious. Try reading/comparing the specifications of e.g. a pure crypto function, then a storage or clustering algorithm, then seL4.

(It's possible that brute force specification generation, iteration, and simplification by an LLM might help. It's possible an LLM could help eat away complexity from the other direction, unifying methods and languages, optimising provers, etc.)

crvdgc

17 days ago

> how do you verify the verification program?

The program used to check the validity of a proof is called a kernel. It just need to check one step at a time and the possible steps can be taken are just basic logic rules. People can gain more confidence on its validity by:

- Reading it very carefully (doable since it's very small)

- Having multiple independent implementations and compare the results

- Proving it in some meta-theory. Here the result is not correctness per se, but relative consistency. (Although it can be argued all other points are about relative consistency as well.)

mdnahas

15 days ago

Yes, you’re right, it is turtles all the way down. But, a huge part of the prover can be untrusted or proven correct by a smaller part of the prover! That leaves a small part that cannot prove itself correct. That is called the “kernel”.

Kernels are usually verified by humans. A good design can make them very small: 500 to 5000 lines of code. Systems designers brag about how small their kernels are!

The kernel could be proved correct by another system, which introduces another turtle below. Or the kernel can be reflected upwards and proved by the system itself. That is virtually putting the bottom turtle on top of a turtle higher in the stack. It will find some problems, but it still leaves the possibility that the kernel has a flaw that accepts bad proofs, including the kernel itself.

newAccount2025

17 days ago

There is tons of work on this question. Super recent: https://link.springer.com/article/10.1007/s10817-025-09743-8

17 days ago

3 replies

> but it’s not hard to extrapolate and imagine that process becoming fully automated in the next few years. And when that happens, it will totally change the economics of formal verification.

There is a problem with this argument similar to one made about imagining the future possibilities of vibe coding: once we imagine AI to do this task i.e. fully automatically prove software correct, and/or reliably write relibale software (and the two aren't all that different), we can just as easily imagine it to not have to do these things in the first place. If AI can do the hardest things, those it is currently not very good at doing, there's no reason to assume it won't be able to do easier things/things it currently does better.

In particular, we won't need it to verify our software for us, because there's no reason to believe that it won't be able to come up with what software we need better than us in the first place.

Of course, we can imagine any conceivable capability or limitation we like, but then we are literally entering the realm of science fiction.

1121redblackgo

17 days ago

1 reply

First human robot war is us telling the AI/robots 'no', and them insisting that insert technology here is good for us and is the direction we should take. Probably already been done, but yeah, this seems like the tipping point into something entirely different for humanity.

17 days ago

... if it's achievable at all in the near future! But we don't know that. It's just that if we assume AI can do X, why do we assume it cannot, at the same level of capability, also do Y? Maybe the tipping point where it can do both X and Y is near, but maybe in the near future it will be able to do neither.

thatoneengineer

17 days ago

2 replies

I disagree. Right now, feedback on correctness is a major practical limitation on the usefulness of AI coding agents. They can fix compile errors on their own, they can _sometimes_ fix test errors on their own, but fixing functionality / architecture errors takes human intervention. Formal verification basically turns (a subset of) functionality errors into compile errors, making the feedback loop much stronger. "Come up with what software we need better than us in the first place" is much higher on the ladder than that.

TL;DR: We don't need to be radically agnostic about the capabilities of AI-- we have enough experience already with the software value chain (with and without AI) for formal verification to be an appealing next step, for the reasons this author lays out.

17 days ago

2 replies

I completely agree it's appealing, I just don't see a reason to assume that an agent will be able to succeed at it and at the same time fail at other things that could make the whole exercise redundant. In other words, I also want agents to be able to consistently prove software correct, but if they're actually able to accomplish that, then they could just as likely be able to do everything else in the production of that software (including gathering requirements and writing the spec) without me in the loop.

DoctorOetker

16 days ago

1 reply

>I just don't see a reason to assume that an agent will be able to succeed at it and at the same time fail at other things that could make the whole exercise redundant.

But that is much simpler to understand: eventually finding a proof using guided search (machines searching for proofs, multiple inference attempts) takes more effort than verifying a proof. Formal verification does not disappear, because communicating a valuable succinct proof is much cheaper than having to search for the proof anew. The proofs will become inevitable lingua franca (like it is among capable humans) for computers as well. Basic economics will result in adoption of formal verification.

Whenever humans found an original proof, their notes will contain a lot of deductions that were ultimately not used, they were searching for a proof, using intuition gained in reading and finding proofs of other theorems. It's just that LLM's are similarily gaining intuition, and at some point become better than humans at finding proofs. It is currently already much better than the average human at finding proofs. The question is how long it takes until it gets better than any human being at finding proofs.

The future you see where the whole proving exercise (if by humans or by LLMs) becomes redundant because it immediately emits the right code is nonsensical: the frontier of what LLM's are capable of will move gradually, so for each generation of LLMs. there will always be problems it can not instantly generate provably correct software (but omitting the according-to-you-unnecessary proof). That doesn't mean they can't find the proofs, just that it would have to search by reasoning, with no guarantee if it ever finds a proof.

That search heuristic is Las Vegas, not Monte Carlo.

Companies will compare the levelized operating costs of different LLM's to decide which LLMs to use in the future on hard proving tasks.

Satellite data centers will consume ever more resources in a combined space/logic race for cryptographic breakthroughs.

4d ago

> eventually finding a proof using guided search (machines searching for proofs, multiple inference attempts) takes more effort than verifying a proof. Formal verification does not disappear, because communicating a valuable succinct proof is much cheaper than having to search for the proof anew. The proofs will become inevitable lingua franca (like it is among capable humans) for computers as well. Basic economics will result in adoption of formal verification.

But AI isn't used to verify the proof. It's used to find it. And if it can find it - one of the hardest things in software development - there's no reason to believe it can't do anything else associated with software development. If AI agents find formal verification helpful, they would probably opt to use it, but there would also be no need for a human in the loop at all.

> It is currently already much better than the average human at finding proofs.

The average human also can't write hello world. LLMs are currently significantly worse at finding proofs than the average formal-verification person (I say this as someone who's done formal verification for many years), though, just as they're significantly worse at writing code than the average programmer. I'm not saying they won't become better, it's just strange to expect that they'll become better than the average formal-verification person and at the same time they won't be better than the average product manager.

> there will always be problems it can not instantly generate provably correct software

Nobody said anything about "instantly". If the AI finds formal verification helpful, it will choose to use it, but if it can find proofs better than humans, why expect that it won't be able to do easier tasks better than humans?

UncleEntity

17 days ago

> I also want agents to be able to consistently prove software correct...

I know this is just an imprecision of language thing but they aren't 'proving' the software is correct but writing the proofs instead of C++ (or whatever).

I had a but of a discussion with one of them about this a while ago to determine the viability of having one generate the proofs and use those to generate the actual code, just another abstraction over the compiler. The main takeaway I got from that (which may or may not be the way to do) is to use the 'result' to do differential testing or to generate the test suite but that was (maybe, don't remember) in the context of proving existing software was correct.

I mean, if they get to the point where they can prove an entire codebase is correct just in their robot brains I think we'll probably have a lot bigger things to worry about...

qingcharles

17 days ago

It's getting better every day, though, at "closing the loop."

When I recently booted up Google Antigravity and had it make a change to a backend routine for a web site, I was quite surprised when it opened Chrome, navigated to the page, and started trying out the changes to see if they had worked. It was janky as hell, but a year from now it won't be.

16 days ago

To make this more constructive, I think that today AI tools are useful when they do things you already know how to do and can assess the quality of the output. So if you know how to read and write a formal specification, LLMs can already help translating natural-language descriptions to a formal spec.

It's also possible that LLMs can, by themselves, prove the correctness of some small subroutines, and produce a formal proof that you can check in a proof checker, provided you can at least read and understand the statement of the proposition.

This can certainly make formal verification easier, but not necessarily more mainstream.

But once we extrapolate the existing abilities to something that can reliably verify real large or medium-sized programs for a human who cannot read and understand the propositions (and the necessary simplifying assumptions) that it's hard to see a machine do that and at the same time not able to do everything else.

bkettle

17 days ago

2 replies

I think formal verification shines in areas where implementation is much more complex than the spec, like when you’re writing incomprehensible bit-level optimizations in a cryptography implementation or compiler optimization phases. I’m not sure that most of us, day-to-day, write code that would benefit from formal verification, since to me it seems like high-level programming languages are already close to a specification language. I’m not sure how much easier to read a specification format that didn’t concern itself with implementation would be, especially when we currently use all kinds of frameworks and libraries that already abstract away implementation details.

Sure, formal verification might give stronger guarantees about various levels of the stack, but I don’t think most of us care about having such strong guarantees now and I don’t think AI really introduces a need for new guarantees at that level.

17 days ago

4 replies

Yes. I feel like people who are trying to push software verification have never worked on typical real-world software projects where the spec is like 100 pages long and still doesn't fully cover all the requirements and you still have to read between the lines... Implementing such software to meet the spec takes a very long time and then you have to invest a lot of effort and deep thought to ensure that what you've produced fits within the spec so that the stakeholder will be satisfied.

It's hard even for a human who understands the full business, social and political context to disambiguate the meaning and intent of the spec; to try to express it mathematically is a nightmare. You would literally need some kind of super intelligence... And the amount of stream-of-thought tokens which would have to be generated to arrive at a correct, consistent, unambiguous formal spec is probably going to cost more than just hiring top software engineers to build the thing with 100% test coverage of all the main cases.

And then there's the fact that most software can tolerate bugs... If big tech software which literally has millions of concurrent users can tolerate bugs, then most software can tolerate bugs.

robot-wrangler

17 days ago

1 reply

The whole perspective of this argument is hard for me to grasp. I don't think anyone is suggesting that formal specs are an alternative to code, they are just an alternative to informal specs. And actually with AI the new spin is that they aren't even a mutually exclusive alternative.

A bidirectional bridge that spans multiple representations from informal spec to semiformal spec to code seems ideal. You change the most relevant layer that you're interested in and then see updates propagating semi-automatically to other layers. I'd say the jury is out on whether this uses extra tokens or saves them, but a few things we do know. Chain of code works better than chain of thought, and chain-of-spec seems like a simple generalization. Markdown-based planning and task-tracking agent workflows work better than just YOLOing one-shot changes everywhere, and so intermediate representations are useful.

It seems to me that you can't actually get rid of specs, right? So to shoot down the idea of productive cooperation between formal methods and LLM-style AI, one really must successfully argue that informal specs are inherently better than formal ones. Or even stronger: having only informal specs is better than having informal+formal.

17 days ago

1 reply

> A bidirectional bridge that spans multiple representations from informal spec

What I'm hearing is quite literally "I have a bridge to sell you."

robot-wrangler

17 days ago

1 reply

There's always a bridge, dude. The only question is whether you want to buy one that's described as "a pretty good one, not too old, sold as is" or if you'd maybe prefer "spans X, holds Y, money back guarantee".

17 days ago

I get it. Sometimes complexity is justified. I just don't feel this particular bridge is justified for 'mainstream software' which is what the article is about.

DennisP

17 days ago

1 reply

Software verification has gotten some use for smart contracts. The code is fairly simple, it's certain to be attacked by sophisticated hackers who know the source, and the consequence of failure is theft of funds, possibly in large amounts. 100% test coverage is no guarantee that an attack can't be found.

People spend gobs of money on human security auditors who don't necessarily catch everything either, so verification easily fits in the budget. And once deployed, the code can't be changed.

Verification has also been used in embedded safety-critical code.

17 days ago

If the requirements you have to satisfy arise out of a fixed, deterministic contract (as opposed to a human being), I can see how that's possible.

qingcharles

17 days ago

2 replies

I used to work adjacent to a team who worked from closely-defined specs for web sites, and it used to infuriate the living hell out of me. The specs had all sorts of horrible UI choices and bugs and stuff that just plain wouldn't work when coded. I tried my best to get them to implement the intent of the spec, not the actual spec, but they had been trained in one method only and would not deviate at any cost.

17 days ago

1 reply

Yeah, IMO, the spec almost always needs refinement. I've worked for some companies where they tried to write specs with precision down to every word; but what happened is; if the spec was too detailed, it usually had to be adjusted later once it started to conflict with reality (efficiency, costs, security/access restrictions, resource limits, AI limitations)... If it wasn't detailed enough, then we had to read between the lines and fill in a lot of gaps... And usually had to iterate with the stakeholder to get it right.

amw-zero

17 days ago

How do you or the client know that the software is doing what they want?

mrkeen

17 days ago

What formal verification system did they use? Did they even execute it?

ad_hockey

17 days ago

I agree that trying to produce this sort of spec for the entire project is probably a fool's errand, but I still see the value for critical components of the system. Formally verifying the correctness of balance calculation from a ledger, or that database writes are always persisted to the write ahead log, for example.

[0] https://learntla.com/core/operators.html

17 days ago

2 replies

> to me it seems like high-level programming languages are already close to a specification language

They are not. The power of rich and succinct specification languages (like TLA+) comes from the ability to succinctly express things that cannot be efficiently computed, or at all. That is because a description of what a program does is necessarily at a higher level of abstraction than the program (i.e. there are many possible programs or even magical oracles that can do what a program does).

To give a contrived example, let's say you want to state that a particular computation terminates. To do it in a clear and concise manner, you want to express the property of termination (and prove that the computation satisfies it), but that property is not, itself, computable. There are some ways around it, but as a rule, a specification language is more convenient when it can describe things that cannot be executed.

anon-3988

17 days ago

1 reply

> To give a contrived example, let's say you want to state that a particular computation terminates. To do it in a clear and concise manner, you want to express the property of termination (and prove that the computation satisfies it), but that property is not, itself, computable. There are some ways around it, but as a rule, a specification language is more convenient when it can describe things that cannot be executed.

Do you really think it is going to be easier for the average developer to write a specification for their program that does not terminate

Giving them a framework or a language that does not have for loop?

DennisP

17 days ago

Maybe it's difficult for the average developer to write a formal specification, but the point of the article is that an AI can do it for them.

nyrikki

17 days ago

2 replies

TLA+ is not a silver bullet, and like all temporal logic, has constraints.

You really have to be able to reduce your models to: “at some point in the future, this will happen," or "it will always be true from now on”

Have probabilistic outcomes? Or even floats [0] and it becomes challenging and strings are a mess.

> Note there is not a float type. Floats have complex semantics that are extremely hard to represent. Usually you can abstract them out, but if you absolutely need floats then TLA+ is the wrong tool for the job.

TLA+ works for the problems it is suitable for, try and extend past that and it simply fails.

nextos

17 days ago

1 reply

There are excellent probabilistic extensions to temporal logic out there that are very useful to uncover subtle performance bugs in protocol specifications, see e.g. what PRISM [1] and STORM [2] implement. That is not within the scope of TLA+.

Formal methods are really broad, ranging from lightweight type systems to theorem proving. Some techniques are fantastic for one type of problem but fail at another one. This is quite natural, the same thing happens with different programming paradigms.

[1] https://www.prismmodelchecker.org

[2] https://www.stormchecker.org

hwayne

17 days ago

I really do wish that PRISM can one day add some quality of life features like "strings" and "functions"

(Then again, AIUI it's basically a thin wrapper over stochastic matrices, so maybe that's asking too much...)

17 days ago

1 reply

> You really have to be able to reduce your models to: “at some point in the future, this will happen," or "it will always be true from now on”

You really don't. It's not LTL. Abstraction/refinement relations are built into core TLA.

> Or even floats [0] and it becomes challenging and strings are a mess.

No problem with floats or strings as far as specification goes. The particular verification tools you choose to run on your TLA+ spec may or may not have limitations in these areas, though.

> TLA+ works for the problems it is suitable for, try and extend past that and it simply fails.

TLA+ can specify anything that could be specified in mathematics. You're talking about the verfication tools you can run on TLA+ spec, and like all verification tools, they have their limitations. I never claimed otherwise.

hwayne

17 days ago

1 reply

> No problem with floats or strings as far as specification goes. The particular verification tools you choose to run on your TLA+ spec may or may not have limitations in these areas, though.

I think it's disingenuous to say that TLA+ verifiers "may or may not have limitations" wrt floats when none of the available tools support floats. People should know going in that they won't be able to verify specs with floats!

16 days ago

1 reply

I'm not sure how a "spec with floats" differs from a spec with networks, RAM, 64-bit integers, multi-level cache, or any computing concept, none of which exists as a primitive in mathematics. A floating point number is a pair of integers, or sometimes we think about it as a real number plus some error, and TLAPS can check theorems about specifications that describe floating-point operations.

hwayne

16 days ago

1 reply

Those things, unlike floats, have approximable-enough facsimiles that you can verify instead. No tools support even fixed point decimals.

This has burned me before when I e.g needed to take the mean of a sequence.

16 days ago

1 reply

You say no tools but you can "verify floats" with TLAPS. I don't think that RAM or 64-bit integers have facsimiles in TLA+. They can be described mathematically in TLA+, but so can floating point numbers. The complications in describing machine-representable numbers also apply to integers (and, BTW, integers work differently in C and in Java), but these complications are pretty important, and the level of detail matters just as it matters when representing RAM or any other computing concept. For example, you may want to think about floating point numbers as a real + error, or you may want to think about them as a mantissa-exponent pair, perhaps with overflow or without. The "right" representation of a floating point number highly depends on the properties you wish to examine - just as any other computing construct. These complications are essential, and they exist, pretty much in the same form, in other languages for formal mathematics.