Fawk: Llms Can Write a Language Interpreter

about 2 months ago

2 replies

Yes! I'm currently using copilot + antigravity to implement a language with ergonomic syntax and semantics that lowers cleanly to machine code targeting multiple platforms, with a focus on safety, determinism, auditability and fail-fast bugs. It's more work than I thought but the LLMs are very capable.

I was dreaming of a JS to machine code, but then thought, why not just start from scratch and have what I want? It's a lot of fun.

lionkor

about 2 months ago

4 replies

Curious why you do this with AI instead of just writing it yourself?

You should be able to whip up a Lexer, Parser and compiler with a couple weeks of time.

TechDebtDevin

about 2 months ago

Because this is someone in a "spiral" or "AI psychosis" Its pretty clear by how they are talking.

epolanski

about 2 months ago

I'm not the previous user, but I imagine that weeks of investment might be a commitment one does not have.

I have implemented an interpreter for a very basic stack-based language (you can imagine it being one of the simplest interpreters you can have) and it took me a lot of time and effort to have something solid and functional.

Thus I can absolutely relate to the idea of having an LLM who's seen many interpreters lay out the ground for you and make you play as quickly as possible with your ideas while procrastinating delving in details till necessary.

about 2 months ago

It would be very new to me. I'd have to learn a lot to do that. And I can't spare the time or attention. It's more of a fun side project.

The machine code would also be tedious, tho fun. But I really can't spare the time for it.

My_Name

about 2 months ago

Because he did it in a day, not a few weeks.

If I want to go from Bristol to Swindon, I could walk there in about 12 hours. It's totally possible to do it by foot. Or I could use a car and be there in an hour. There and back, with a full work day in-between done, in a day. Using the tool doesn't change what you can do, it speeds up getting the end result.

64718283661

about 2 months ago

4 replies

What's the point of making something like this if you don't get to deeply understand what your doing?

ModernMech

about 2 months ago

1 reply

If they go far enough with it they will be forced to understand it deeply. The LLM provides more leverage at the beginning because this project is a final exam for a first semester undergrad PL course, therefore there are a billion examples of “vaguely Java/Python/C imperative language with objects and functions” to train the LLM on.

Ultimately though, the LLM is going to become less useful as the language grows past its capabilities. If the language author doesn’t have a sufficient map of the language and a solid plan at that point, it will be the blind leading the blind. Which is how most lang dev goes so it should all work out.

https://github.com/artpar/jslike

about 2 months ago

Lol thank you for this. It’s more worth I work than i thought!

My_Name

about 2 months ago

What's the point of owning a car if you don't build it by hand yourself?

Anyway, all it will do is stop you being able to run as well as you used to be able to do when you had to go everywhere on foot.

johnisgood

about 2 months ago

I have made a lot of things using LLMs and I fully understood everything. It is doable.

afpx

about 2 months ago

How deep do you need to know?

"Imagination is more important than knowledge."

At least for me that fits. I have quite enough graduate-level knowledge of physics, math, and computer science to rarely be stumped by a research paper or anything an LLM spits out. That may get me scorn from those tested on those subjects. Yet, I'm still an effective ignoramus.

skydhash

about 2 months ago

2 replies

Commendable effort, but I expected at least a demo, which would showcase working code (even if it’s hacky). It’s like someone talking about a sheet music without playing it once.

epolanski

about 2 months ago

Even more, it's like talking about a sheet without seeing the sheet itself.

johnisgood

about 2 months ago

See https://github.com/Janiczek/fawk and .fawk files in https://github.com/Janiczek/fawk/tree/main/tests.

artpar

about 2 months ago

1 reply

I wrote two

jslike (acorn based parser)

https://www.npmjs.com/package/jslike

wang-lang ( i couldn't get ASI to work like javascript in this nearley based grammar )

https://www.npmjs.com/package/wang-lang

https://artpar.github.io/wang/playground.html

https://github.com/artpar/wang

about 2 months ago

wang-lang? Is that a naughty language?

about 2 months ago

2 replies

It's a fun post, and I love language experiments with LLMs (I'm close to hitting the weekly limit of my Claude Max subscription because I have a near-constantly running session working on my Ruby compiler; Claude can fix -- albeit with messy code sometimes -- issues that requires complex tracing of backtraces with gdb, and fix complex parser interactions almost entirely unaided as long as it has a test suite to run).

But here's the Ruby version of one of the scripts:

    BEGIN {
      result = [1, 2, 3, 4, 5]
        .filter {|x| x % 2 == 0 }
        .map {|x| x * x}
        .reduce {|acc,x| acc + x }
     puts "Result: #{result}"
    }

The point being that running a script with the "-n" switch un runs BEGIN/END blocks and puts an implicit "while gets ... end" around the rest. Adding "-a" auto-splits the line like awk. Adding "-p" also prints $_ at the end of each iteration.

So here's a more typical Awk-like experience:

    ruby -pe '$_.upcase!' somefile.txt ($_ has the whole line)

Or:

    ruby -F, -ane '$F[1]' # Extracts the second field field -F sets the default character to split on, and -a adds an implicit $F = $_.split.

That is not to detract from what he's doing because it's fun. But if your goal is just to use a better Awk, then Ruby is usually better Awk, and so, for that matter, is Perl, and for most things where an Awk script doesn't fit on the command line the only reason to really use Awk is that it is more likely to be available.

about 2 months ago

3 replies

So I have had to work very hard to use $80 worth of my $250 free Claude code credits. What am I doing wrong?

sceptic123

about 2 months ago

2 replies

> free

how do you get free credits?

throwup238

about 2 months ago

1 reply

They were given out for the Claude Code on Web launch. Mine expired November 18 (but I managed to use them all before then).

about 2 months ago

Mine were set to expire then but got extended to the 23.

about 2 months ago

1 reply

Pro users got $250 and max users got $1000

sceptic123

about 2 months ago

1 reply

Okay, so not really free then, more bonus credits

about 2 months ago

Just a "free sample" to get you hooked

about 2 months ago

Run it with --dangerously-skip-permissions, give it a large test suite, and keep telling it "continue fixing spec failures" and you'll eat through them very quickly.

Or it will format your drives, and set fire to your cat; might be worth doing it in a VM.

Though a couple of days ago, I gave Claude Code root access to a Raspberry Pi and told it to set up Home Assistant and a voice agent... It likes to tweak settings and reboot it.

EDIT: It just spoke to me, by ssh'ing into the Pi and running Espeak (I'd asked it to figure it out; it decided the HA API was too difficult, and decided on its own to pivot to that approach...)

throwup238

about 2 months ago

I used all of my credits working on a PySide QT desktop app last weekend. What worked:

I first had Claude write an E2E testing framework that functioned a lot like Cypress, with tests using element selectors like Jquery and high level actions like 'click' with screenshots at every step.

Then I had Claude write an MCP server that could run the GUI in the background (headless in Claude's VM) and take screenshots, execute actions, etc. This gave Claude the ability to test the app in real time with visual feedback.

Once that was done, I was able to run half a dozen or more agents at the same time running in parallel working on different features. It was relatively easy to blow through credits at that point, especially since I think VM times counts so whenever I spent 4-5 min running the full e2e test suite that cost money. At the end of an agents run, I'd ask them to pull master and merge conflicts, then I'd watch the e2e tests run locally before doing manual acceptance testing.

about 2 months ago

1 reply

> That is not to detract from what he's doing because it's fun. But if your goal is just to use a better Awk, then Ruby is usually better Awk

I agree, but I also would not use such one liners in ruby. I tend to write more elaborate scripts that do the filtering. It is more work, but I hate to burden my brain with hard to remember sigils. That's why I don't really use sed or awk myself, though I do use it when other people write it. I find it much simpler to just write the equivalent ruby code and use e. g. .filter or .select instead. So something like:

   ruby -F, -ane '$F[1]'

I'd never use because I wouldn't have the faintest idea what $F[1] would do. I assume it is a global variable and we access the second element of whatever is stored in F? But either way, I try to not have to think when using ruby, so my code ends up being really dumb and simple at all times.

> for that matter, is Perl

I'd agree but perl itself is a truly ugly language. The advantages over awk/sed are fairly small here.

> the only reason to really use Awk is that it is more likely to be available.

People used the same explanation with regard to bash shell scripts or perl (typically more often available on a cluster than python or ruby). I understand this but still reject it; I try to use the tool that is best. So, for me, python and ruby are better than perl; and all are better than awk/sed/shell scripts. I am not in the camp of users who want to use shell scripts + awk + sed for everything. I understand that it can be useful, but I much prefer just writing the solution in a ruby script and then use that. I actually wrote numerous ruby scripts and aliases, so I kind of use these in pipes too, e. g. "delem" is just my alias for delete_empty_files (defaults to the current working directory), so if I use a pipe in bash, with delem between two | |, then it just does this specific action. The same is true for numerous other actions, so ruby kind of "powers" my system. Of course people can use awk or sed or rm and so forth and pipe the correct stuff in there, which also works, but I found that my brain just can not want to be bothered to remember all flags. I just want to think in terms of super-simple instructions at all times and keep on re-using them; and extending them if I need to. So ruby kind of functions as a replacement for me for all computer-related actions in general. It is the ultimate glue for me to efficiently work with a computer system. Anything that can be scripted and automated and I may do more than once, I end up writing into ruby and then just tapping into that functionality. I could do the same in python too for the most part, so this is a very comparable use case. I did not do it in perl, largely because I find perl just to be too ugly to use efficiently.

https://github.com/igravious/cosmoruby

about 2 months ago

> I'd never use because I wouldn't have the faintest idea what $F[1] would do.

I don't use it often either, and most people probably don't know about it. But $F will contain each row of the input split by the field separator, which you can set with -F, hence the comparison to Awk.

Basically, each of -n, -p, -a, -F conceptually just does some simple transforms to your code:

-n: wrap "while gets; <your code>; end around your code and call the BEGIN and END blocks.

-a: Insert $F = $_.split at the start of the while loop from a. $_ contains the last line read by gets.

-p: Insert the same loop as -n, but add "puts $_" at the end of the while loop.

These are sort-of inherited from Perl. like a lot of Ruby's sigils, hence my mention of it (I agree its ugly). They're not that much harder to remember than Awk, and it saves me from having to use a language I use so rarely that I invariably end up reading the manual every time I need more than the most basic expressions.

> I understand this but still reject it; I try to use the tool that is best.

I do too, but sometimes you need to access servers you can't install stuff on.

Like you I have lots of my own Ruby scripts (and a Ruby WM, a Ruby editor, a Ruby terminal emulator, a file manager, a shell; I'm turning into a bit of a zealot in my old age...) and much prefer them when I can.

ikari_pl

about 2 months ago

3 replies

Today, Gemini wrote a python script for me, that connects to Fibaro API (local home automation system), and renames all the rooms and devices to English automatically.

Worked on the first run. I mean, the second, because the first run was by default a dry run printing a beautiful table, and the actual run requires a CLI arg, and it also makes a backup.

It was a complete solution.

igravious

about 2 months ago

3 replies

I've gotten Claude Code to port Ruby 3.4.7 to Cosmopolitan: https://github.com/jart/cosmopolitan

I kid you not. Took between a week and ten days. Cost about €10 . After that I became a firm convert.

I'm still getting my head around how incredible that is. I tell friends and family and they're like "ok, so?"

rogual

about 2 months ago

3 replies

It seems like AIs work how non-programmers already thought computers worked.

ACCount37

about 2 months ago

1 reply

That's apt.

One of the first thing you learn in CS 101 is "computers are impeccable at math and logic but have zero common sense, and can easily understand megabytes of code but not two sentences of instructions in plain English."

LLMs break that old fundamental assumption. How people can claim that it's not a ground-shattering breakthrough is beyond me.

skydhash

about 2 months ago

Then build a LLM shell and make it your login shell. And you’ll see how well the computer understands english.

love2read

about 2 months ago

I love this, thank you

zelphirkalt

about 2 months ago

"Why didn't you do that earlier?"

RealityVoid

about 2 months ago

1 reply

I am incredibly curious how you did that. You just told it... Port ruby to cosmopolitan and let it crank out for a week? Or what did you do?

I'll use these tools, and at times they give good results. But I would not trust it to work that much on a problem by itself.

igravious

about 2 months ago

1 reply

unzipped Ruby 3.4.7 into the appropriate place (third-party) in the repo and explained what i wanted (it used the Lua and Python port for reference)

first it built the Cosmo Make tooling integration and then we (ha "we" !) started iterating and iterating compiling Ruby with the Cosmo compiler … every time we hit some snag Claude Code would figure it out

I would have completed it sooner but I kept hitting the 5 hourly session token limits on my Pro account

simonw

about 2 months ago

Looks like this is the relevant code https://github.com/jart/cosmopolitan/compare/master...igravi...

darkwater

about 2 months ago

1 reply

This seems cool! Can you share the link to the repository?

igravious

about 2 months ago

here you go, still early days, rough round the edges :)

https://github.com/igravious/cosmoruby

about 2 months ago

Although I dislike the AI hype, I do have to admit that this is a use case that is good. You saved time here, right?

I personally still prefer the oldschool way, the slower way - I write the code, I document it, I add examples, then if I feel like it I add random cat images to the documentation to make it appear less boring, so people also read things.

about 2 months ago

I've been surprised by how often Sonnet 4.5 writes working code the first try.

runeks

about 2 months ago

1 reply

> I only interacted with the agent by telling it to implement a thing and write tests for it, and I only really reviewed the tests.

Did you also review the code that runs the tests?

https://williamcotton.com/articles/introducing-web-pipe

about 2 months ago

Yes :)

andsoitis

about 2 months ago

1 reply

> And it did it.

it would be nice when people do these things give us a transcript or recording of their dialog with the LLM so that more people can learn.

chrisweekly

about 2 months ago

Yes! This. It'd take so little effort to share, thereby validating your credibility, providing value, teaching,... it's so full of win I can't understand why so few people do this.

williamcotton

about 2 months ago

5 replies

I've been working on my own web app DSL, with most of the typing done by Claude Code, eg,

  GET /hello/:world
    |> jq: `{ world: .params.world }`
    |> handlebars: `<p>hello, {{world}}</p>`
  
  describe "hello, world"
    it "calls the route"
      when calling GET /hello/world
      then status is 200
      and output equals `<p>hello, world</p>`

Here's a WIP article about the DSL:

And the DSL itself (written in Rust):

https://github.com/williamcotton/webpipe

And an LSP for the language:

https://github.com/williamcotton/webpipe-lsp

And of course my blog is built on top of Web Pipe:

https://github.com/williamcotton/williamcotton.com/blob/mast...

It is absolutely amazing that a solo developer (with a demanding job, kids, etc) with just some spare hours here and there can write all of this with the help of these tools.

about 2 months ago

2 replies

That is impressive, but it also looks like a babelfish language. The |> seems to have been inspired by Elixir? But this is like a mish-mash of javascript-like entities; and then Rust is also used? It also seems rather verbose. I mean it's great that it did not require a lot of effort, but why would people favour this over less verbose DSL?

williamcotton

about 2 months ago

1 reply

> babelfish language

Yes, exactly! It's more akin to a bash pipeline, but instead of plain text flowing through sed/grep/awk/perl it uses json flowing through jq/lua/handlebars.

> The |> seems to have been inspired by Elixir

For me, F#!

> and then Rust is also used

Rust is what the runtime is written in.

> It also seems rather verbose.

IMO, it's rather terse, especially because it is more of a configuration of a web application runtime.

> why would people favour this

I dunno why anyone would use this but it's just plain fun to write your own blog in your own DSL!

The BDD-style testing framework being part of the language itself does allow for some pretty interesting features for a language server, eg, the LSP knows if a route that is trying to be tested has been defined. So who knows, maybe someone finds parts of it inspiring.

travisjungroth

about 2 months ago

> it's just plain fun to write your own blog in your own DSL!

It’s the perfect thing for skill development, too. Stakes are low compared to a project at work, even one that’s not “mission critical”.

AdieuToLogic

about 2 months ago

> The |> seems to have been inspired by Elixir?

This is an infix operator commonly used to define the Thrush combinator, which transcends Elixir (or any other programming language). It is effectively:

  f (g (x)) = g (x) |> f

https://www.jetbrains.com/help/idea/http-client-in-product-c...

about 2 months ago

I like this syntax. And yes it amazing. And fun, so fun!

mike_hearn

about 2 months ago

FWIW if someone wants a tool like this with better support, JetBrains has defined a .http file format that contains a DSL for making HTTP requests and running JS on the results.

There's a CLI tool for executing these files:

https://www.jetbrains.com/help/idea/http-client-cli.html

There's a substantially similar plugin for VSCode here: https://github.com/Huachao/vscode-restclient

about 2 months ago

I like the pipe approach. I build a large web app with a custom framework that was built around a pipeline years ago, and it was an interesting way to decompose things.

cdaringe

about 2 months ago

Cool! Have you seen https://camlworks.github.io/dream/

I get OCaml isnt for everybody, but dream is the web framework i wish i knew first

about 2 months ago

1 reply

But the question is: will the language suck?

I have a slight feeling it would suck even more than, say, PHP or JavaScript.

about 2 months ago

Yes, I'll only have an answer to this later, as I use it, and there's a real chances my changes to the language won't mix well with the original AWK. (Or is your comment more about AWK sucking for programs larger than 30 LOC? I think that's a given already.)

Thankfully, if that's the case, then I've only lost a few hours """implementing""" the language, rather than days/weeks/more.

low_tech_love

about 2 months ago

5 replies

Slightly off-topic: I have an honest question for all of you out there who love Advent of Code, please don't take this the wrong way, it is a real curiosity: what is it for you that makes the AoC challenge so special when compared with all of the thousands of other coding challenges/exercises/competitions out there? I've been doing coding challenges for a long time and I never got anything special out of AoC, so I'm really curious. Is it simply that it reached a wider audience?

qsort

about 2 months ago

1 reply

Personally it's the community factor. Everyone is doing the same problem each day and you get to talk about it, discuss with your friends, etc.

cdaringe

about 2 months ago

Community plus problem solving in low stakes fun setting.

some_random

about 2 months ago

For me, it's a bunch of things. It happens once a year, so it feels special. Many of my friends (and sometimes coworkers) try it as well, so it turns into something to chat about. Because they're one a day they end up being timeboxed, I can focus on just hammering out a solution or dig in and optimize but I can't move on so when I'm done for the day I'm done. It's also pretty nostalgic for me, I started working on it in high school.

zelphirkalt

about 2 months ago

I think the corny stories about how the elves f up and their ridiculous machines and processes add a lot of flavor. It is not as dry as Project Euler for example, which is great in its own right. And you collect ASCII art golden stars!

torginus

about 2 months ago

I think one is the feeling of community - we have a workplace leaderboard and we compete with each other, discuss solutions to the problems, how we overcame them etc.

The second is the timing and pacing - the fist few days are about warming up, then comes a couple decently challenging puzzles, after which the whole thing gets very difficult. Having the discipline to actually spend the time every day to do the puzzles feels like going back to the gym and actually sticking to it.

I also get to solve these kind of coding puzzles at work very rarely - maybe once every couple of months - so the whole thing feels like an intense workout for my brain.

The downside is of course is that it's exhausting - later puzzles often took 1-2 hours for me to solve - during days where I have work related stress, this is not easy.

[0]: https://news.ycombinator.com/item?id=46005813 [1]: https://github.com/philpax/perchance-interpreter/pulls?q=is%...

about 2 months ago

I have only had some previous experience with Project Euler, which I liked for the loop of "try to bruteforce it -> doesn't work -> analyze the problem, exploit patterns, take shortcuts". (I hit a skill ceiling after 166 problems solved.)

Advent of Code has this mass hysteria feel about it (in a good sense), probably fueled by the scarcity principle / looking forward to it as December comes closer. In my programming circles, a bunch of people share frustration and joy over the problems, compete in private leaderboards; there are people streaming these problems, YouTubers speedrunning them or solving them in crazy languages like Excel or Factorio... it's a community thing, I think.

If I wanted to start doing something like LeetCode, it feels like I'd be alone in there, though that's likely false and there probably are Discords and forums dedicated to it. But somehow it doesn't have the same appeal as AoC.

root_axis

about 2 months ago

1 reply

It'd be interesting to see how well the LLM would be able to write code using the new language since it doesn't exist in the training data.

ModernMech

about 2 months ago

I've tested this, the LLM will tend to strongly pattern match to the closest language syntactically, so if your language is too divergent then you have continually remind it of your syntax or semantics. But if your language is just a skin for C or JavaScript then it'll do fine.

runeks

about 2 months ago

3 replies

I think it would be super interesting to see how the LLM handles extending/modifying the code it has written. Ie. adding/removing features, in order to simulate the life cycle of a normal software project. After all, LLM-produced code would only be of limited use if it’s worse at adding new features than humans are.

As I understand, this would require somehow “saving the state” of the LLM, as it exists after the last prompt — since I don’t think the LLM can arrive at the same state by just being fed the code it has written.

roywiggins

about 2 months ago

Claude can just poke around the codebase as-is. You can also have it synthesize a README and update that as it goes.

I've found it perfectly capable of adding eg new entities and forms to existing CRUD apps.

Philpax

about 2 months ago

I described my experience using Claude Code Web to vibe-code a language interpreter here [0], with a link to the closed PRs [1].

As it turns out, you don't really need to "save the state"; with decent-enough code and documentation (both of which the LLM can write), it can figure out what needs to be done and go from there. This is obviously not perfect - and a human developer with a working memory could get to the problem faster - but its reorientation process is fast enough that you generally don't have to worry about it.

rogeliodh

about 2 months ago

They are very good at understanding current code and its architecture so no need to save state. In any case, it is good to explicitly ask them to generate proper comments for their architectural decisions and to keep updated AGENT.md file

l9o

about 2 months ago

1 reply

I've been working on something similar, a typed shell scripting language called shady (hehe). haven't shared it because like 99% of the code was written by claude and I'm definitely not a programming language expert. it's a toy really.

but I learned a ton building this thing. it has an LSP server now with autocompletion and go to definition, a type checker, a very much broken auto formatter (this was surprisingly harder to get done than the LSP), the whole deal. all the stuff previously would take months or a whole team to build. there's tons of bugs and it's not something I'd use for anything, nu shell is obviously way better.

the language itself is pretty straightforward. you write functions that manipulate processes and strings, and any public function automatically becomes a CLI command. so like if you write "public deploy $env: str $version: str = ..." you get a ./script.shady deploy command with proper --help and everything. it does so by converting the function signatures into clap commands.

while building it I had lots of process pipelines deadlocking, type errors pointing at the wrong spans, that kind of thing. it seems like LLMs really struggle understanding race conditions and the concept of time, but they seem to be getting better. fixed a 3-process pipeline hanging bug last week that required actually understanding how the pipe handles worked. but as others pointed out, I have also been impressed at how frequently sonnet 4.5 writes working code if given a bit of guidance.

one thing that blew my mind: I started with pest for parsing but when I got to the LSP I realized incremental parsing would be essential. because I was diligent about test coverage, sonnet 4.5 perfectly converted the entire parser to tree-sitter for me. all tests passed. that was wild. earlier versions of the model like 3.5 or 3.7 struggled with Rust quite a bit from my experience.

claude wrote most of the code but I made the design decisions and had to understand enough to fix bugs and add features. learned about tree-sitter, LSP protocol, stuff I wouldn't have touched otherwise.

still feels kinda lame to say "I built this with AI" but also... I did build it? and it works? not sure where to draw the line between "AI did it" and "AI helped me do it"

anyway just wanted to chime in from someone else doing this kind of experiment :)

simonw

about 2 months ago

"because I was diligent about test coverage, sonnet 4.5 perfectly converted the entire parser to tree-sitter for me. all tests passed."

I often suspect that people who complain about getting poor results from agents haven't yet started treating automated tests as a hard requirement for working with them.

If you don't have substantial test coverage your coding agents are effectively flying blind. If you DO have good test coverage prompts like "port this parser to tree-sitter" become surprisingly effective.

fsmv

about 2 months ago

1 reply

I await your blog post about how it only appeared to work at first and then had major problems when you actually dug in.

ht-syseng

about 2 months ago

I just looked at the code, the

ast: https://github.com/Janiczek/fawk/pull/2/files#diff-b531ba932...

module has 167 lines and the

interpreter module: https://github.com/Janiczek/fawk/pull/2/files#diff-a96536fc3...

has 691 lines. I expect it would work, as FAWK seems to be a very simple language. I'm currently working on a similar project with a different language, and the equivalent AST module is around 20,000 lines and only partially implemented according to the standard. I have tried to use LLMs without any luck. I think in addition to the language size, something they currently fail at seems to be, for lack of a better description, "understanding the propagation of changes across a complex codebase where the combinatoric space of behavioral effects of any given change is massive". When I ask Claude to help in the codebase I'm working in, it starts making edits and going down paths I know are dead ends, and I end up having to spend way more time explaining why things wouldn't work to it, than if I had just implemented it myself...

We seem to be moving in the right direction, but I think absent a fundamental change in model architecture we're going to end up with models that consume gigawatts to do what a brain can do for 20 watts. Maybe a metaphorical pointer to the underlying issue, whatever it is, is that if a human sits down and works on a problem for 10 hours, they will be fundamentally closer to having solved the problem (deeper understanding of the problem space), whereas if you throw 10 hours worth of human or LLM generated context into an LLM and ask it to work on the problem, it will perform significantly worse than if it had no context, as context rot (sparse training data for the "area" of the latent space associated with the prior sequence of tokens) will degrade its performance. The exception would be like, when the prior context is documentation for how to solve the problem, in which case the LLM would perform better, but also the problem was already solved. I mention that case because I imagine it would be easy to game a benchmark that intends to test this, without actually solving the underlying problem of building a system that can dynamically create arbitrary novel representations of the world around it and use those to make predictions and solve problems.

alganet

about 2 months ago

1 reply

> "Take a look at those tests!"

A math module that is not tested for division by zero. Classical LLM development.

The suite is mostly happy paths, which is consistent with what I've seen LLMs do.

Once you setup coverage, and tell it "there's a hidden branch that the report isn't able to display on line 95 that we need to cover", things get less fun.