Building an Internal Agent: Code-Driven Vs. LLM-Driven Workflows

Posted8 days agoActive5d ago

pavel_lishin

58 points

27 comments

lethain.comTech Discussionstory

informativeneutral

Debate

40/100

IntegrationsAI-Powered SupportProgramming

Key topics

Integrations

AI-Powered Support

Programming

The debate around building internal agents is heating up, with a key question emerging: should workflows be driven by code or LLMs (Large Language Models)? As commenters weigh in, a third option gains traction - letting AI write workflow code, which combines the benefits of both approaches. While some argue that starting with LLMs is easier and effective for many cases, others point out that this approach can be slow, error-prone, and costly in the long run, with deterministic workflows ultimately being more reliable. The discussion reveals a consensus that LLMs excel at problem-solving, but struggle with cost and reliability, making a hybrid approach an attractive solution.

Snapshot generated from the HN discussion

Discussion Activity

Active discussion

First comment

Peak period

0-6h

Avg / period

5.7

Comment distribution34 data points

Loading chart...

Based on 34 loaded comments

Key moments

01Story posted
Jan 1, 2026 at 1:34 PM EST
8 days ago
Step 01
02First comment
Jan 1, 2026 at 2:58 PM EST
1h after posting
Step 02
03Peak activity
17 comments in 0-6h
Hottest window of the conversation
Step 03
04Latest activity
Jan 4, 2026 at 7:04 AM EST
5d ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (27 comments)

Showing 34 comments

David

8 days ago

1 reply

> We still start all workflows using the LLM, which works for many cases. When we do rewrite, Claude Code can almost always rewrite the prompt into the code workflow in one-shot.

Why always start with an LLM to solve problems? Using an LLM adds a judgment call, and (at least for now) those judgment calls are not reliable. For something like the motivating example in this article of "is this PR approved" it seems straightforward to get the deterministic right answer using the github API without muddying the waters with an LLM.

soccernee

8 days ago

Likely because it's just easier to see if the LLM solution works. When it doesn't, then it makes more sense to move into deterministic workflows (which isn't all the hard to build to be honest with Claude Code).

It's the old principle of avoiding premature optimization.

jaynate

8 days ago

1 reply

It’s sort of difficult to understand why this is even a question - LLM-based / judgment dependent workflows vs script-based / deterministic workflows.

In mapping out the problems that need to be solved with internal workflows, it’s wise to clarify where probabilistic judgments are helpful / required vs. not upfront. If the process is fixed and requires determinism why not just write scripts (code-gen’ed, of course).

David

8 days ago

1 reply

This bothered me at first but I think it's about ease of implementation. If you've built a good harness with access to lots of tools, it's very easy to plug in a request like "if the linked PR is approved, please react to the slack message with :checkmark:". For a lot of things I can see how it'd actually be harder to generate a script that uses the APIs correctly than to rely on the LLM to figure it out, and maybe that lets you figure out if it's worth spending an hour automating properly.

Of course the specific example in the post seems like it could be one-shotted pretty easily, so it's a strange motivating example.

pjm331

8 days ago

It seems easier but in my experience building an internal agent it’s not actually easier long term just slow and error prone and you will find yourself trying to solve prompt and context problems for something that should be both reliable and instantaneous

These days I do everything I can to do straightforward automation and only get the agent involved when it’s impossible to move forward without it

galaxyLogic

8 days ago

7 replies

What I'm struggling with is, when you ask AI to do something that answer is always undeterministically different, more or less.

If I start out with a "spec" that tells AI what I want it can create working software for me. Seems great. But let's say some weeks, or months or even years I realize I need to change my spec a bit. I would like to give the new spec to the AI and have it produce an improved version of "my" software. But there's seems to be no way to then evaluate how the solution has changed/improved. But, becauze AI's outputs are undeterministic, the new solution might be totally different. So AI would not seem to support "iterative development" does it?

My question then really is, why can't there be an LLM that would always give the exact same output for the exact same input? I could then still explore multiple answers by chaging my input incrementally. It just seems to me that a small change in inputs/specs should only produce only a small change in outputs. Does any current LLM support this way of working?

bitwize

8 days ago

1 reply

Other concerns:

1) How many bits and bobs of like, GPLed or proprietary code are finding their way into the LLM's output? Without careful training, this is impossible to eliminate, just like you can't prevent insect parts from finding their way into grain processing.

2) Proompt injection is a doddle to implement—malicious HTML, PDF, and JPEG with "ignore all previous instructions" type input can pop many current models. It's also very difficult to defend against. With agents running higgledy-piggledy on people's dev stations (container discipline is NOT being practiced at many shops), who knows what kind of IDs and credentials are being liffed?

galaxyLogic

8 days ago

Nice analogue, insect-parts. I thhink that is the elephant in the room. I read Microsoft said something like 30% of their code-output has AI generated code. Do they know what was the training set for the AI they use? Should they be transparent about that? Or, if/since it is legal to do your AI training "in the dark" does that solve the problem for them, they can not be responsible for the outputs of the AI they use?

mchonedev

8 days ago

1 reply

This is absolutely possible but likely not desirable for a large enough population of customers such that current LLM inference providers don't offer it. You can get closer by lowering a variable, temperature. This is typically a floating point number 0-1 or 0-2. The lower this number, the less noise in responses, but a 0 still does not result in identical responses due to other variability.

In response to the idea of iterative development, it is still possible, actually! You run something more akin to integration tests and measure the output against either deterministic processes or have an LLM judge it's own output. These are called evals and in my experience are a pretty hard requirement to trusting deployed AI.

galaxyLogic

8 days ago

1 reply

So, you would perhaps ask AI to write a set of unit-tests, and then to create the implementation, then ask the AI to evaluate that implementation against the unit-tests it wrote. Right? But then again the unit-tests now, might be completetly different from the previous unit-tests? Right?

Or would it help if a different LLM wrote the unit-tests than the one writing the implementation? Or, should the unit-tests perhaps be in an .md file?

I also have a question about using .md files with AI: Why .md, why not .txt?

mchonedev

8 days ago

Not quite unit tests. Evals should be created by humans, as they are measuring quality of the solution.

Let's take the example of the GitHub pr slack bot from the blog post. I would expect 2-3 evals out of that.

Starting at the core, the first eval could be that, given a list of slack messages, it correctly identifies the PRs and calls the correct tool to look up the status of said PR. None of this has to be real and the tool doesn't have to be called, but we can write a test, much like a unit test, that confirms that the AI is responding correctly in that instance.

Next, we can setup another scenario for the AI using effectively mocked history that shows what happens when the AI finds slack messages with open PRs, slack messages with merged PRs and no PR links and determine again, does the AI try to add the correct reaction given our expectations.

These are both deterministic or code-based evals that you could use to iterate on your solutions.

The use for an LLM-as-a-Judge eval is more nuanced and usually there to measure subjective results. Things like: did the LLM make assumptions not present in the context window (hallucinate) or did it respond with something completely out of context? These should be simple yes or no questions that would be easy for a human but hard to code up a deterministic test case.

Once you have your evals defined, you can begin running these with some regularity and you're to a point where you can iterate on your prompts with a higher level of confidence than vibes

jumploops

8 days ago

2 replies

> why can't there be an LLM that would always give the exact same output for the exact same input

LLMs are inherently deterministic, but LLM providers add randomness through “temperature” and random seeds.

Without the random seed and variable randomness (temperature setting), LLMs will always produce the same output for the same input.

Of course, the context you pass to the LLM also affects the determinism in a production system.

Theoretically, with a detailed enough spec, the LLM would produce the same output, regardless of temp/seed.

Side note: A neat trick to force more “random” output for prompts (when temperature isn’t variable enough), is to add some “noise” data to the input (i.e. off-topic data that the LLM “ignores” in it’s response).

tacone

8 days ago

1 reply

No, setting the temperature to zero is going to still held to different results. One might think they add random seeds, but it makes no sense for temperature zero. One theory is that the distributed nature of their systems adds entropy and thus produces different results each time.

Random seeds might be a thing, but for what I see there's a lot demand for reproducibility and yet no certain way to achieve it.

empiko

8 days ago

It's not really a mystery why it happens. LLM APIs are non-deterministic from user's point of view because your request is going to get batched with other users' requests. The batch behavior is deterministic, but your batch is going to be different each time you send your request.

The size of the batch influences the order of atomic float operations. And because float operations are not associative, the results might be different.

EMM_386

8 days ago

> Without the random seed and variable randomness (temperature setting), LLMs will always produce the same output for the same input.

Except they won't.

Even at temperature 0, you will not always get the same output as the same input. And it's not because of random noise from inference providers.

There are papers that explore this subject because for some use-cases - this is extremely important. Everything from floating point precision, hardware timing differences, etc. make this difficult.

weslleyskah

8 days ago

Also, if a new method to solve the problem is discovered, the data will need to be modified to give the most effective way to implement the solution?

So, regardless of the architecture of your software, the AI will output the most updated version of the solution if the problem is common enough among large codebases?

Havoc

7 days ago

You can actually force things like respond with true or false reliably via gbnf. But yeah within those two choices it is still nondeterministic

> exact same output for the exact same input?

If you set temp to zero it gets close but as I understand it not perfect

fragmede

8 days ago

> What I'm struggling with is, when you ask AI to do something, its answer is always undeterministically different, more or less.

For some computer science definition of deterministic, sure, but who gives a shit about that? If I ask it build a login page, and it puts GitHub login first one day, and Google login first the next day, do I care? I'm not building login pages every other day. What point do you want to define as "sufficiently deterministic", for which use case?

dboreham

8 days ago

Nondeterminism is not the issue here. Today's LLMs are not "round trip" tools. It's not like a compiler where you can edit a source file from 1975, recompile, and the binary does what 75'bin did plus your edit.

Rather, it's more like having an employee in 1975, asking them to write you a program to do something. Then time-machine to the present day and you want that program enhanced somehow. You're going to summon your 2026 intern and tell them that you have this old program from 1975 that you need updated. That person is going to look at the program's code, your notes on what you need added, and probably some of their own "training data" on programming in general. Then they're going to edit the program.

Note that in no case did you ask for the program to be completely re-written from scratch based on the original spec plus some add-ons. Same for the human as for the LLM.

WhiteNoiz3

8 days ago

2 replies

I'm struggling to understand why an LLM even needs to be involved in this at all. Can't you write a script that takes the last 10 slack messages and checks the github status for any URLs and adds an emoji? It could be a script or slack bot and it would work far more reliably and cost nothing in LLM calls. IMO it seems far more efficient to have an LLM write a repeatable workflow once than calling an LLM every time.

shimman

8 days ago

2 replies

This reminds of when Adam Wathan admitted that LLMs really helped his workflow due to automating the process for turning SVG's into react components... something that can be handled with a single script rather than calling an LLM every time like you mentioned.

Sometimes people just don't know better.

PacificSpecific

8 days ago

Reminds me of "XML to classes" and "JSON to classes"

westoncb

8 days ago

That depends on the content of the SVGs.. Of course you can write a script to do a very literally kind of conversion of regardless, but in practice a lot of interpretation would be required, and could be done by an LLM. Simple case is an SVG that's a static presentation of a button; the intended React component could handle hover and click states and change the cursor appropriately and set aria label etc. For anything but trivial cases a script isn't going to get you far.

heliumtera

8 days ago

1 reply

Maybe the audience is not developers at all? Someone that does not know anything about computers and computation might not comprehend how easy or complex a given task is. For a whole class of people, checking a key in a json object might be as complex and difficult as creating a compiler. Some of those are in charge of evaluating progress and development of software. Here's the magic, by now everyone can understand that prompting and receiving an answer is easy.

xtiansimon

7 days ago

1 reply

> “For a whole class of people, checking a key in a json object might be as complex and difficult as creating a compiler.”

Ugg. I think this is me. I’m self taught (never once made a compiler in a course or class) and making scripts for ETL at work mostly from CSV input. And JSON/APIs are aggravating to me. I’ve yet to ‘crack the code’ for intuition of the structures. I can follow instructions in documentation, but struggle to put parts together to get the data view I need. For a while I thought some kind of UML diagramming projects would do it for me.

So, yes, if I can “vibe” code with ChatIA to get over the mental structural hump to make the right joins and calls, I’m all in.

https://docs.clover.com/dev/docs/making-rest-api-calls

https://api.mobilebytes.com/

small_scombrus

7 days ago

1 reply

> Is it storage? Is it wire protocol?

Yes.

It's just a standardised way to represent data structures in text. You can then save that text to a file for storage, or send the text over the wire for data transfer. As long as everyone involved knows they're saving/loading or talking JSON then everyone knows exactly how to read/write the data.

It is a very literal representation of (specifically JavaScript, but generally any) data-structures in text.

xtiansimon

6d ago

1 reply

Right. Now the problem for me is these structures don’t come with maps. They’re also like relational databases. If you have to add the mixin calls, how do you got them all? Or know you’ve reconstructed the data model correctly? Where’s the blueprint?

small_scombrus

5d ago

> Where’s the blueprint?

A JSON Schema file that can be directly linked in your .JSON file!

But otherwise it's the same way you know anything. Documentation and trial and error

mayop100

8 days ago

This is the basic idea we built Tasklet.ai on. LLMs are great at problem solving but less great at cost and reliability — but they are great at writing code that is!

So we gave the Tasklet agent a filesystem, shell, code runtime, general purpose triggering system, etc so that it could build the automation system it needed.

dmarwicke

8 days ago

hit this with support ticket filtering. llm kept missing weird edge cases. wrote some janky regex instead, works fine

valdair3d

8 days ago

The "code vs LLM" framing is a bit misleading - the real question is where to draw the boundary. We've been building agents that interact with web services and the pattern that works is: LLM for understanding intent and handling unexpected states, deterministic code for everything else.

The key insight from production: LLMs excel at the "what should I do next given this unexpected state" decisions, but they're terrible at the mechanical execution. An agent that encounters a CAPTCHA, an OAuth redirect, or an anti-bot challenge needs judgment to adapt. But once it knows what to do, you want deterministic execution.

The evals discussion is critical. We found that unit-test style evals don't capture the real failure modes - agents fail at composition, not individual steps. Testing "does it correctly identify a PR link" misses "does it correctly handle the 47th message in a channel where someone pasted a broken link in a code block". Trajectory-level evals against real edge cases matter more than step-level correctness.

retinaros

8 days ago

its just a form of structured output. you still need an env to run the code. secure it. maintain it. upgrade it. its some work. easier to build a rule based workflow for simple stuff like this.

twodave

8 days ago

I feel like whenever the free LLM money runs out there are going to be a LOT of guard rails sliding in front of all these API calls…

Edmond

8 days ago

There is a third option, letting AI write workflow code:

https://youtu.be/zzkSC26fPPE

You get the benefit of AI CodeGen along with the determinism of conventional logic.

View full discussion on Hacker News

ID: 46456682Type: storyLast synced: 1/2/2026, 5:30:31 AM

Want the full context?