AI Will Make Formal Verification Go Mainstream
Key topics
The debate around AI-assisted coding and formal verification has sparked a lively discussion, with some commenters convinced that AI will revolutionize the field, while others express skepticism. Proponents argue that AI coding agents, like Claude Code, can produce high-quality code when paired with robust testing and validation mechanisms, with some noting that agents can even conform to specific coding styles like Black. However, others point out that AI agents still struggle with complex issues, such as ownership and trait problems in Rust, and can produce non-conforming code despite explicit instructions. As the cost of using these AI tools is discussed, estimates range from $200 to $600 per month for heavy users, sparking curiosity about the financial investment required to leverage AI-assisted coding effectively.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
21m
Peak period
150
Day 1
Avg / period
26.7
Based on 160 loaded comments
Key moments
- 01Story posted
Dec 16, 2025 at 4:14 PM EST
17 days ago
Step 01 - 02First comment
Dec 16, 2025 at 4:35 PM EST
21m after posting
Step 02 - 03Peak activity
150 comments in Day 1
Hottest window of the conversation
Step 03 - 04Latest activity
Dec 29, 2025 at 3:41 PM EST
4d ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
At the most basic level this means making sure they can run commands to execute the code - easiest with languages like Python, with HTML+JavaScript you need to remind them that Playwright exists and they should use it.
The next step up from that is a good automated test suite.
Then we get into quality of code/life improvement tools - automatic code formatters, linters, fuzzing tools etc.
Debuggers are good too. These tend to be less coding-agent friendly due to them often having directly interactive interfaces, but agents can increasingly use them - and there are other options that are a better fit as well.
I'd put formal verification tools like the ones mentioned by Martin on this spectrum too. They're potentially a fantastic unlock for agents - they're effectively just niche programming languages, and models are really good at even niche languages these days.
If you're not finding any value in coding agents but you've also not invested in execution and automated testing environment features, that's probably why.
> automatic code formatters
I haven't tried this because I assumed it'll destroy agent productivity and massively increase number of tokens needed, because you're changing the file out under the LLM and it ends up constantly re-reading the changed bits to generate the correct str_replace JSON. Or are they smart enough that this quickly trains them to generate code with zero-diff under autoformatting?
I haven't had this problem in a while, but I expect current LLMs would probably handle those formatting instructions more closely than the 3.5 era.
I don’t go into Claude without everything already setup. Codex helps me curate the plan, and curate the issue tracker (one instance). Claude gets a command to fire up into context, grab an issue - implements it, and then Codex and Gemini review independently.
I’ve instructed Claude to go back and forth for as many rounds as it takes. Then I close the session (\new) and do it again. These are all the latest frontier models.
This is incredibly expensive, but it’s also the most reliable method I’ve found to get high-quality progress — I suspect it has something to do with ameliorating self-bias, and improving the diversity of viewpoints on the code.
I suspect rigorous static tooling is yet another layer to improve the distribution over program changes, but I do think that there is a big gap in folk knowledge between “vanilla agent” and something fancy with just raw agents.
Going full speed ahead building a Rails app from scratch it seemed like I was spending $50/hour, but it was worth it because the App was finished in a weekend instead of weeks.
I can't bear to go in circles with Sonnet when Opus can just one shot it.
Or, to put it the other way round, what kind of tech leads would we be if we told our junior engineers „Well, here’s the codebase, that’s all I‘ll give you. No debuggers, linters, or test runners for you. Using a browser on your frontend implementation? Nice try buddy! Now good luck getting those requirements implemented!“
I think it's more nuanced than that. As a human, I can manually test code in ways an AI still can't. Sure, maybe it's better to have automated test suites, but I have other options too.
Horses for courses, I suppose. Back in the day, when I wanted to play with some C(++) library, I'd quite often write a Python C-API extension so I could do the same thing using Python's repl.
The vision models (Claude Opus 4.5, Gemini 3 Pro, GPT-5.2) can even take screenshots via Playwright and then "look at them" with their vision capabilities.
It's a lot of fun to watch. You can even tell them to run Playwright not in headless mode at which point a Chrome window will pop up on your computer and you can see them interact with the site via it.
I've also found that LLMs typically just partially implement a given task/story/spec/whatever. The reviewer stage will also notice a mismatch between the spec and the implementation.
I have an orchestrator bounce the flow back and forth between developing and reviewing until the review comes back clean, and only then do I bother to review its work. It saves so much time and frustration.
I use cookiecutter for this, here's my latest Python library template: https://github.com/simonw/python-lib
We valued automated tests and linters and fuzzers and documentation before AI, and that's because it serves the same purpose.
Source code generation is possible due to large training set and effort put into reinforcing better outcomes.
I suspect debugging is not that straightforward to LLM'ize.
It's a non-sequential interaction - when something happens, it's not necessarily caused the problem, timeline may be shuffled. LLM would need tons of examples where something happens in debugger or logs and associate it with another abstraction.
I was debugging something in gdb recently and it was a pretty challenging bug. Out of interest I tried chatgpt, and it was hopeless - try this, add this print etc. That's not how you debug multi-threaded and async code. When I found the root cause, I was analyzing how I did it and where did I learn that specific combination of techniques, each individually well documented, but never in combination - it was learning from other people and my own experience.
I understand that AI can help with writing, coding, analyzing code bases and summarizing other papers, but going through these myself makes a difference, at least for me. I tried ChatGPT 3.5 when I started and while I got a pile of work done, I had to throw it away at some point because I didn't fully understand it. AI could explain to me various parts, but it's different when you create it.
They generated it, and had a compiler compile it, and then had it examine the output. Rinse, repeat.
Debugging is not easy but there should be a lot of training corpus for "bug fixing" from all the commits that have ever existed.
I suspect that as it becomes more economical to play with training your own models, people will get better at including obscured malicious content in data that will be used during training, which could cause the LLM to intrinsically carry a trigger/path that would cause malicious content to be output by the LLM under certain conditions.
And of course we have to worry about malicious content being added to sources that we trust, but that already exists - we as an industry typically pull in public repositories without a complete review of what we're pulling. We outsource the verification to the owners of the repository. Just as we currently have cases of malicious code sneaking into common libraries, we'll have malicious content targeted at LLMs
> The next step up from that is a good automated test suite.
And if we're going for a powerful type system, then we can really leverage the power of property tests which are currently grossly underused. Property tests are a perfect match for LLMs because they allow the human to create a small number of tests that cover a very wide surface of possible errors.
The "thinking in types" approach to software development in Haskell allows the human user to keep at a level of abstraction that still allows them to reason about critical parts of the program while not having to worry about the more tedious implementation parts.
Given how much interest there has been in using LLMs to improve Lean code for formal proofs in the math community, maybe there's a world where we make use of an even more powerful type systems than Haskell. If LLMs with the right language can help prove complex mathematical theorems, they it should certain be possible to write better software with them.
It even lets you separate implementation from specification.
I think there are some interesting things going on if you can really tightly lock down the syntax to some simple subset with extremely straightforward, powerful, and expressive typing mechanisms.
I've noticed real advantages of functional languages to agents, for disposable code. Which is great, cos we can leverage those without dictating the human's experience.
I think the correct way forward is to choose whatever language the humans on your team agree is most useful. For my personal projects, that means a beautiful language for the bits I'll be touching, and whatever gets the job done elsewhere.
My bet is on refinement types. There are already serious industrial efforts to generate Dafny using LLMs.
Dafny fits that bill quite well, it's simple, it offers refinement types, and verification is automated with SAT/SMT.
Some of the largest verification efforts have been achieved with this language [1].
[1] https://www.andrew.cmu.edu/user/bparno/papers/ironfleet.pdf
Elevator pitch: "Blocks is a "semantic linter" for human-AI collaboration. Define your domain in YAML, let anyone (humans or AI) write code freely, then validate for drift."
The gist being you define a whole bunch of validators for a collection of modules you are building (with agentic coding) with a focus on qualifying semantic things;
- domain / business rules/measures - branding - data flow invariants — "user data never touches analytics without anonymization" - accessibility - anything you can think of
And you just tell your agentic coder to use the cli tool before committing, so it keeps the code in line with your engineering/business/philosophical values.
(boring) example of it detecting if blog posts have humour in them running in claude code -> https://imgur.com/diKDZ8W
What you don't specify, it must to assume. And therein lies a huge landscape of possibilities. And since the AI's can't read your mind (yet), its assumptions will probably not precisely match your assumptions unless the task is very limited in scope.
It's not odd, they've just been trained to give helpful answers straight away.
If you tell them not to make assumptions and to rather first ask you all their questions together with the assumptions they would make because you want to confirm before they write the code, they'll do that too. I do that all the time, and I'll get a list of like 12 questions to confirm/change.
That's the great thing about LLM's -- if you want them to change their behavior, all you need to do is ask.
So I've been exploring the idea of going all-in on this "basic level" of validation. I'm assembling systems out of really small "services" (written in Go) that Claude Code can immediately run and interact with using curl, jq, etc. Plus when building a particular service I already have all of the downstream services (the dependencies) built and running so a lot of dependency management and integration challenges disappear. Only trying this out at a small scale as yet, but it's fascinating how the LLMs can potentially invert a lot of the economics that inform the current conventional wisdom.
(Shameless plug: I write about this here: https://twilightworld.ai/thoughts/atomic-programming/)
My intuition is that LLMs will for many use cases lead us away from things like formal verification and even comprehensive test suites. The cost of those activities is justified by the larger cost of fixing things in production; I suspect that we will eventually start using LLMs to drive down the cost of production fixes, to the point where a lot of those upstream investments stop making sense.
There is still a cost to having bugs, even if deploying fixes becomes much cheaper. Especially if your plan is to wait until they actually occur in practice to discover that you have a bug in the first place.
Put differently: would you want the app responsible for your payroll to be developed in this manner? Especially considering that the bug in question would be "oops, you didn't get paid."
Yeah, it's gonna be fun waiting for compilation cycles when those models "reason" with themselves about a semicolon. I guess we just need more compute...
For front end - the verification is make sure that the UI looks expected (can be verified by an image model) and clicking certain buttons results in certain things (can be verified by chatgpt agent but its not public ig).
For back end it will involve firing API requests one by one and verifying the results.
To make this easier, we need to somehow give an environment for claude or whatever agent to run these verifications on and this is the gap that is missing. Claude Code, Codex should now start shipping verification environments that make it easy for them to verify frontend and backend tasks and I think antigravity already helps a bit here.
Why not something like a dependently typed language with an effect system?
The type system can say "this function returns valid Sudoku puzzle solution without accessing the internet or hard drive", and then you just let AI go wild trying to fill in the definition. Once it compiles, it works. The type system can check all of this at compile time.
So let's say that, similarly, there are programming tasks that are easier and harder for agents to do well. If we know that a task is in the easy category, of course having tests is good, but since we already know that an agent does it well, the test isn't the crucial aspect. On the other hand, for a hard task, all the testing in the world may not be enough for the agent to succeed.
Longer term, I think it's more important to understand what's hard and what's easy for agents.
I like the idea that languages even like Rust and Haskell may be more accessible. Learn them of course but LLM can steer you out of getting stuck.
Some students whose work I had to fix (pre AI), was crashing a lot all over the place, due to !'s instead of ?'s followed by guard let … {} and if let … {}
Much better to have AI write deterministic test suites for your project.
AI can help write test code and suggest edge cases, but it shouldn’t be trusted to decide whether behavior is correct.
When software is hard to test, that’s usually a sign the design is too tightly coupled or full of side effects, or that the architecture is unnecessarily complicated. Not that the testing tools are bad.
Another take is that LLMs don't have enough conceptual understanding to actually create proofs for the correctness of code.
Personally I believe this kind of work is predicated on more ergonomic proof systems. And those happen to be valuable even without LLMs. Moreover the built in guarantees of rust seem like they are a great start for creating more ergonomic proof systems. Here I am both in awe of Kani, and disappointed by it. The awe is putting in good work to make things more ergonomic. The disappointment is using bounded model checking for formal analysis. That can barely make use of the exclusion of mutable aliasing. Kani, but with equational reasoning, that's the way forward. Equational reasoning was long held back by needing to do a whole lot of pointer work to exclude worries of mutable aliasing. Now you can lean on the borrow checker for that!
This makes me wonder if LLMs works better in Chinese.
Its both. A colorblind person will admit their shortcoming and, if compelled to be helpful like an LLM is, will reason their way to finding a solution that works around their limitations.
But as LLMs lack a way to reason, you get nonsense instead.
This assumes the colorblind person both believes it is true that they are colorblind, in a world where that can be verified, and possesses tools to overcome these limitations.
You have to be much more clever to 'see' an atom before the invention of a microscope, if the tool doesn't exist: most of the time you are SOL.
I suppose we should conclude humans don't reason either, since they do this all the time. Reason over blindspots without realizing it, is endemic in humans.
LLMs reason differently because their architecture is different, the way they learn is different, and their mode of inference is different. They will be bad at some things we are good at.
And yet their reasoning is far wider than ours (able to weave together virtually any disparate topics sensibly), and almost always better than humans at moderate depth one-shot responses.
Imperfect, not always correct, sometimes dumb. But I have yet to meet this mythical super reliable reasoner, who by their existence shames LLM reasoning abilities so much.
Human's are truly awful at one-shot reasoning.
https://chatgpt.com/share/6941df90-789c-8005-8783-6e1c76cdfc...
It's an example of a simple task. How often are you relying on LLMs to complete simple tasks?
If you really need an answer and you really need the LLM to give it to you, then ask it to write a (Python?) script to do the calculation you need, execute it, and give you the answer.
I added LSP support for images to get better feedback loops and opus was able to debug github.com/alok/LeanPlot. The entire library was vibe coded by older ai. It also wrote GitHub.com/alok/hexluthor (a hex color syntax highlighting extension that uses lean’s metaprogramming and lsp to show you what color a hex literal is) by using feedback and me saying “keep goign” (yes i misspelled it).
It has serious issues with slop and the limitations of small data, but the rate of progress is really really fast. Opus 4.5 and Gemini were a huge step change.
The language is also improving very fast. not as fast as AI.
The feedback loop is very real even for ordinary programming. The model really resists it though because it’s super hard, but again this is rapidly improving.
I started vibe coding Lean about 3 years ago and I’ve used Lean 3 (which was far worse). It’s my favorite language after churning through idk 30?
A big aspect of being successful with them is not being all or nothing with proofs. It’s very useful to write down properties as executable code and then just not prove them because they still have to type check and fit together and make sense. github.com/lecopivo/scilean is a good example (search “sorry_proof”).
Even the documentation system for lean (verso) is (dependently!) typed.
Check out my Lean vibe codes at https://github.com/alok?tab=repositories&q=Lean&type=&langua...
The frameworks are to improve lean’s programming ecosystem and not just its proving. Metaprogramming is pretty well covered already too, but not ordinary programs.
A formally verified system is easier for the model to check and consequently easier for it to program to. I suppose the question is whether or not formal methods are sufficiently tractable that they actually do help the LLM be able to finish the job before it runs out of its context.
Regardless, I often use coding assistants in that manner:
1. First, I use the assistant to come up with the success condition program
2. Then I use the assistant to solve the original problem by asking it to check with the success condition program
3. Then I check the solution myself
It's not rocket science, and is just the same approach we've always taken to problem-solving, but it is nice that modern tools can also work in this way. With this, I can usually use Opus or GPT-5.2 in unattended mode.
0: https://wiki.roshangeorge.dev/w/Blog/2025-12-11/LLMs_Excel_A...
As for those who use LLMs to impersonate humans, which is the kind of verification (verify that this solution that is purported to be built by a human actually works), I have no doubt we will rapidly evolve norms that make us more resistant to them. The cost of fraud and anti-fraud is not zero, but I suspect it will be much less than we fear.
> 2. Then I use the assistant to solve the original problem by asking it to check with the success condition program
This sounds a lot like Test-Driven Development. :)
So far so good, though the smaller amount of training data is noticeable.
Verification is substantially more challenging.
Currently, even for an expert in the domains of the software to be verified and the process of verification, defining a specification (even partial) is both difficult and tedious. Try reading/comparing the specifications of e.g. a pure crypto function, then a storage or clustering algorithm, then seL4.
(It's possible that brute force specification generation, iteration, and simplification by an LLM might help. It's possible an LLM could help eat away complexity from the other direction, unifying methods and languages, optimising provers, etc.)
The program used to check the validity of a proof is called a kernel. It just need to check one step at a time and the possible steps can be taken are just basic logic rules. People can gain more confidence on its validity by:
- Reading it very carefully (doable since it's very small)
- Having multiple independent implementations and compare the results
- Proving it in some meta-theory. Here the result is not correctness per se, but relative consistency. (Although it can be argued all other points are about relative consistency as well.)
Kernels are usually verified by humans. A good design can make them very small: 500 to 5000 lines of code. Systems designers brag about how small their kernels are!
The kernel could be proved correct by another system, which introduces another turtle below. Or the kernel can be reflected upwards and proved by the system itself. That is virtually putting the bottom turtle on top of a turtle higher in the stack. It will find some problems, but it still leaves the possibility that the kernel has a flaw that accepts bad proofs, including the kernel itself.
There is a problem with this argument similar to one made about imagining the future possibilities of vibe coding: once we imagine AI to do this task i.e. fully automatically prove software correct, and/or reliably write relibale software (and the two aren't all that different), we can just as easily imagine it to not have to do these things in the first place. If AI can do the hardest things, those it is currently not very good at doing, there's no reason to assume it won't be able to do easier things/things it currently does better.
In particular, we won't need it to verify our software for us, because there's no reason to believe that it won't be able to come up with what software we need better than us in the first place.
Of course, we can imagine any conceivable capability or limitation we like, but then we are literally entering the realm of science fiction.
TL;DR: We don't need to be radically agnostic about the capabilities of AI-- we have enough experience already with the software value chain (with and without AI) for formal verification to be an appealing next step, for the reasons this author lays out.
But that is much simpler to understand: eventually finding a proof using guided search (machines searching for proofs, multiple inference attempts) takes more effort than verifying a proof. Formal verification does not disappear, because communicating a valuable succinct proof is much cheaper than having to search for the proof anew. The proofs will become inevitable lingua franca (like it is among capable humans) for computers as well. Basic economics will result in adoption of formal verification.
Whenever humans found an original proof, their notes will contain a lot of deductions that were ultimately not used, they were searching for a proof, using intuition gained in reading and finding proofs of other theorems. It's just that LLM's are similarily gaining intuition, and at some point become better than humans at finding proofs. It is currently already much better than the average human at finding proofs. The question is how long it takes until it gets better than any human being at finding proofs.
The future you see where the whole proving exercise (if by humans or by LLMs) becomes redundant because it immediately emits the right code is nonsensical: the frontier of what LLM's are capable of will move gradually, so for each generation of LLMs. there will always be problems it can not instantly generate provably correct software (but omitting the according-to-you-unnecessary proof). That doesn't mean they can't find the proofs, just that it would have to search by reasoning, with no guarantee if it ever finds a proof.
That search heuristic is Las Vegas, not Monte Carlo.
Companies will compare the levelized operating costs of different LLM's to decide which LLMs to use in the future on hard proving tasks.
Satellite data centers will consume ever more resources in a combined space/logic race for cryptographic breakthroughs.
But AI isn't used to verify the proof. It's used to find it. And if it can find it - one of the hardest things in software development - there's no reason to believe it can't do anything else associated with software development. If AI agents find formal verification helpful, they would probably opt to use it, but there would also be no need for a human in the loop at all.
> It is currently already much better than the average human at finding proofs.
The average human also can't write hello world. LLMs are currently significantly worse at finding proofs than the average formal-verification person (I say this as someone who's done formal verification for many years), though, just as they're significantly worse at writing code than the average programmer. I'm not saying they won't become better, it's just strange to expect that they'll become better than the average formal-verification person and at the same time they won't be better than the average product manager.
> there will always be problems it can not instantly generate provably correct software
Nobody said anything about "instantly". If the AI finds formal verification helpful, it will choose to use it, but if it can find proofs better than humans, why expect that it won't be able to do easier tasks better than humans?
I know this is just an imprecision of language thing but they aren't 'proving' the software is correct but writing the proofs instead of C++ (or whatever).
I had a but of a discussion with one of them about this a while ago to determine the viability of having one generate the proofs and use those to generate the actual code, just another abstraction over the compiler. The main takeaway I got from that (which may or may not be the way to do) is to use the 'result' to do differential testing or to generate the test suite but that was (maybe, don't remember) in the context of proving existing software was correct.
I mean, if they get to the point where they can prove an entire codebase is correct just in their robot brains I think we'll probably have a lot bigger things to worry about...
When I recently booted up Google Antigravity and had it make a change to a backend routine for a web site, I was quite surprised when it opened Chrome, navigated to the page, and started trying out the changes to see if they had worked. It was janky as hell, but a year from now it won't be.
It's also possible that LLMs can, by themselves, prove the correctness of some small subroutines, and produce a formal proof that you can check in a proof checker, provided you can at least read and understand the statement of the proposition.
This can certainly make formal verification easier, but not necessarily more mainstream.
But once we extrapolate the existing abilities to something that can reliably verify real large or medium-sized programs for a human who cannot read and understand the propositions (and the necessary simplifying assumptions) that it's hard to see a machine do that and at the same time not able to do everything else.
Sure, formal verification might give stronger guarantees about various levels of the stack, but I don’t think most of us care about having such strong guarantees now and I don’t think AI really introduces a need for new guarantees at that level.
It's hard even for a human who understands the full business, social and political context to disambiguate the meaning and intent of the spec; to try to express it mathematically is a nightmare. You would literally need some kind of super intelligence... And the amount of stream-of-thought tokens which would have to be generated to arrive at a correct, consistent, unambiguous formal spec is probably going to cost more than just hiring top software engineers to build the thing with 100% test coverage of all the main cases.
And then there's the fact that most software can tolerate bugs... If big tech software which literally has millions of concurrent users can tolerate bugs, then most software can tolerate bugs.
A bidirectional bridge that spans multiple representations from informal spec to semiformal spec to code seems ideal. You change the most relevant layer that you're interested in and then see updates propagating semi-automatically to other layers. I'd say the jury is out on whether this uses extra tokens or saves them, but a few things we do know. Chain of code works better than chain of thought, and chain-of-spec seems like a simple generalization. Markdown-based planning and task-tracking agent workflows work better than just YOLOing one-shot changes everywhere, and so intermediate representations are useful.
It seems to me that you can't actually get rid of specs, right? So to shoot down the idea of productive cooperation between formal methods and LLM-style AI, one really must successfully argue that informal specs are inherently better than formal ones. Or even stronger: having only informal specs is better than having informal+formal.
What I'm hearing is quite literally "I have a bridge to sell you."
People spend gobs of money on human security auditors who don't necessarily catch everything either, so verification easily fits in the budget. And once deployed, the code can't be changed.
Verification has also been used in embedded safety-critical code.
They are not. The power of rich and succinct specification languages (like TLA+) comes from the ability to succinctly express things that cannot be efficiently computed, or at all. That is because a description of what a program does is necessarily at a higher level of abstraction than the program (i.e. there are many possible programs or even magical oracles that can do what a program does).
To give a contrived example, let's say you want to state that a particular computation terminates. To do it in a clear and concise manner, you want to express the property of termination (and prove that the computation satisfies it), but that property is not, itself, computable. There are some ways around it, but as a rule, a specification language is more convenient when it can describe things that cannot be executed.
Do you really think it is going to be easier for the average developer to write a specification for their program that does not terminate
vs
Giving them a framework or a language that does not have for loop?
You really have to be able to reduce your models to: “at some point in the future, this will happen," or "it will always be true from now on”
Have probabilistic outcomes? Or even floats [0] and it becomes challenging and strings are a mess.
> Note there is not a float type. Floats have complex semantics that are extremely hard to represent. Usually you can abstract them out, but if you absolutely need floats then TLA+ is the wrong tool for the job.
TLA+ works for the problems it is suitable for, try and extend past that and it simply fails.
[0] https://learntla.com/core/operators.html
Formal methods are really broad, ranging from lightweight type systems to theorem proving. Some techniques are fantastic for one type of problem but fail at another one. This is quite natural, the same thing happens with different programming paradigms.
[1] https://www.prismmodelchecker.org
[2] https://www.stormchecker.org
(Then again, AIUI it's basically a thin wrapper over stochastic matrices, so maybe that's asking too much...)
You really don't. It's not LTL. Abstraction/refinement relations are built into core TLA.
> Or even floats [0] and it becomes challenging and strings are a mess.
No problem with floats or strings as far as specification goes. The particular verification tools you choose to run on your TLA+ spec may or may not have limitations in these areas, though.
> TLA+ works for the problems it is suitable for, try and extend past that and it simply fails.
TLA+ can specify anything that could be specified in mathematics. You're talking about the verfication tools you can run on TLA+ spec, and like all verification tools, they have their limitations. I never claimed otherwise.
I think it's disingenuous to say that TLA+ verifiers "may or may not have limitations" wrt floats when none of the available tools support floats. People should know going in that they won't be able to verify specs with floats!
This has burned me before when I e.g needed to take the mean of a sequence.
I don't know the state of contemporary model checkers that work with theories of reals and/or FP, and I'm sure you're much more familar with that than me, but I believe that when it comes to numeric computation, deductive proofs or "sampling tests" (such as property-based testing) are still more common than model-checking. It could be interesting to add a random sampling mode to TLC that could simulate many operations on reals using BigDecimal internally.
Might as well just learn Agda or Lean. There are good books out there. It’s not as hard as the author suggests. Hard, yes, but there’s no Royal road.
With a domain specific language you can add extra limitations on the kinds of outputs. This can also make formal verification faster.
Maybe like React components. You limit the format. You could limit which libraries can be imported, what hooks could be used, how expressive could be.
Semgrep isn't a formal methods tool, but it's in the same space of rigor-improving tooling that sound great but in practice are painful to consistently use.
If anyone does write a specification, the "AI" won't get even past the termination proof of a moderately complex function, which is the first step of accepting said function in the proof environment. Before you can even start the actual proof.
This article is pretty low on evidence, perhaps it is about getting funding by talking about "AI".
Future: AI generates incorrect code, and formal verification that proves that the code performs that incorrect behaviour.
The hard part isn’t getting an LLM to grind out proofs, it’s getting organizations to invest in specs and models at all. Right now we barely write good invariants in comments. If AI makes it cheap to iteratively propose and refine specs (“here’s what I think this service guarantees; what did I miss?”) that’s the moment things tip: verification stops being an academic side-quest and becomes another refactoring tool you reach for when changing code, like tests or linters, instead of a separate capital-P “formal methods project”.
265 more comments available on Hacker News