Lessons From Interviews on Deploying AI Agents in Production
Key topics
The article discusses lessons learned from deploying AI agents in production, highlighting non-technical challenges such as workflow integration and employee trust, while the discussion in the comments reveals skepticism and concerns about the limitations and potential pitfalls of AI adoption.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
N/A
Peak period
61
0-6h
Avg / period
30
Based on 90 loaded comments
Key moments
- 01Story posted
Nov 4, 2025 at 2:26 AM EST
2 months ago
Step 01 - 02First comment
Nov 4, 2025 at 2:26 AM EST
0s after posting
Step 02 - 03Peak activity
61 comments in 0-6h
Hottest window of the conversation
Step 03 - 04Latest activity
Nov 6, 2025 at 10:57 AM EST
about 2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
A few patterns emerged that might be relevant to anyone working on applied AI or automation:
- The main blockers aren’t technical. Most founders pointed to workflow integration, employee trust, and data privacy as the toughest challenges — not model performance.
- Incremental deployment beats ambition. Successful teams focus on narrow, verifiable use cases that deliver measurable ROI and build user trust before scaling autonomy.
- Enterprise adoption is uneven. Many companies have “some agents” in production, but most use them with strong human oversight. The fully autonomous cases remain rare.
- Pricing is unresolved. Hybrid models dominate; pure outcome-based pricing is uncommon due to attribution and monitoring challenges.
Infrastructure is mostly homegrown. Over half of surveyed startups build their own agentic stacks, citing limited flexibility in existing frameworks.
The article also includes detailed case studies, commentary on autonomy vs. accuracy trade-offs, and what’s next for ambient and proactive agents.
If you’re building in this space, the full report is free here: https://mmc.vc/research/state-of-agentic-ai-founders-edition...
Would be interested to hear how others on HN are thinking about real-world deployment challenges — especially around trust, evaluation, and scaling agentic systems.
Sure, it then becomes a technical challenge to work around those limits, but that may be cost/time prohibitive.
Some key challenges around workflow are that while the fundamental white-board task flow is the same, different companies may distribute those tasks between people and over time in different ways.
Workflow is about flowing the task and associated information between people - not just doing the tasks.
Same goes for integration - the timing of when certain necessary information might be available again not uniform and timing concerns are often missed on the high level whiteboard.
Here's a classic example of ignoring timing issues.
https://www.harrowell.org.uk/blog/2017/03/19/universal-credi...
One is technical (it’s a hassle to connect things to a specific system because you’d need to deal with the api or there is no api)
The other isn’t, because it’s figuring out how and where to use these new tools in an existing workflow. Maybe you could design something from scratch but you have lots of business processes right now, how do you smoothly modify that? Where does it make sense?
Frankly understanding what the systems can and can’t do takes at least some time even if only because the field is moving so fast (I worked with a small local firm who I was able to help by showing them the dramatic improvements in transcription quality vs cost recently - people here are more used to whisper and the like but it’s not as common knowledge how and where you can use these things).
What does that even mean? Are you trying to say that the problem isn’t that the AI models are bad — it’s that it’s hard to get people to use them naturally in their daily work?
It just looks like the highly-polished marketing copy I’ve read, all my career. It’s entirely possible that it was edited by AI (a task that I have found useful), but I think that it’s actually a fairly important (to the firm) paper, and was likely originally written by their staff (or a consultant), and carefully edited.
I do feel as if it’s a promotional effort, but HN often features promotional material, if it is of interest to our community.
1. Agentic AI systems are hard to measure and evaluate methodologically.
2. Quote from Salesforce analyst day: "it's been so easy to build a killer demo, but why has it been so hard to get agents that actually deliver the goods.”
3. Unfortunately, small errors tend to compound over time, which means most systems need a human in the loop as of 2025.
4. A lot of enterprise buyers feel the huge potential (and FOMO), yet ROI is still unclear as of 2025. MIT report "State of AI in business 2025": Despite $30–40 billion in enterprise investment into GenAI, 95% of organizations are not seeing profit and loss impact.
We've got a sort of "business intelligence" AI they poured a lot of time and money into, and I don't think anyone really uses it because it makes stuff up.
I'm sure there are things. I just haven't seen them. I would love to hear concrete examples.
The cynic in me says I wouldn't want something with the error aptitude and truth telling of a small child taking any sort of important action on my behalf.
1) Enhanced User guides/manuals.
We sell some very complicated and expensive instruments. As such, making them work is quite hard. One of our biggest expenses is our engineers that go out, physically, to help customers. Company policy is that the first visit is always free. These customers can be very remote (Deep sea oil platforms, Australian outback, quite nice ski country ;) , etc). Often their issue is simple but they can also be very complex. We have phone trees, email, texts, iridium phones, etc. to talk customers through things to avoid these first visits and then hep them afterwards. So adding in AI chatbots is a natural way to help out. People don't feel quite the same 'shame' in asking really dumb questions to a chatbot that they do to a real person. So, to make these chatbots smarter, we use some of this AI mumbo-jumbo (RAG), to help them out. So far, it seems successful and the customer and engineers like the enhanced/AI manuals.
2) Making said manuals
We support 35 languages and many regulatory environments. Our instruments are all compliant with whatever version of a government agency you've got (modulo a lot of time, money, ITAR regulations). As such, making all that paper (manuals, compliance docs, contracts, etc) takes a lot of time and effort and has to pass the legal tests too. So AI is really helpful with it. Most of the work for these large stacks of paper is essentially boilerplate, but all subtly different so that literal copy-pasting doesn't get you quite that far. AI systems have been able to, last I checked, get that team about 5x faster, as it cuts out ~85% of the process and drudgery. Since these documents get hauled into courts, they can't just be blindly AI made, and a human always has to go over everything with a sharp eye still, but AI helps out there a bit too. Last lunch I had with them, they were saying that they were actually working on their burn-down charts now and not just going from panic to panic. As in, they could actually do their jobs.
>> While AI is strictly deterministic - it is technically chaotic
AI is neither deterministic nor chaotic. It is nondeterministic because it works based on probability, which means that for open-ended contexts it can be unpredictable. But properly engineered agentic AI workflows can drastically reduce and even completely eliminate the unpredictability. Having proper guardrails such as well-defined prompts, validations and fallbacks in place can help ensure mistakes made by AIs don't result in errors in your system.
LLMs are computation, they are very complex, but they are deterministic. If you run one on the same device, in the same state, with exactly the same input parameters multiple times, you will always get the same result. This is the case for every possible program. Most of the time, we don’t run them with exactly the same input parameters, or we run them on different devices, or some part of the state of the system has changed between runs, which could all potentially result in a different outcome (which, incidentally, is also the case for every possible program).
GPU operations on floating point are generally not deterministic and are subject to the whims of the scheduler
the mathematics might be
but not on a GPU, because floating point numbers are an approximation, and their operations are not commutative
if the GPUs internal scheduler reorders the operations you will get a different outcome
remember GPUs were designed to render quake, where drawing pixels slightly off is imperceptible
Eg. An auto email parser that extracts an "action" - I just don't trust that the action will be accurate and precise enough to execute without rereading the email (hence defeating the purpose of the agent)
A deterministic function/algorithm always gives the same output given the same input.
LLMs are deterministic if you control all parameters, including the “temperature” and random “seed”. Same input (and params) -> same output.
Large Language Models (LLMs) are not perfectly deterministic even with temperature set to zero , due to factors like dynamic batching, floating-point variations, and internal model implementation details. While temperature zero makes the model choose the most probable token at each step, which is a greedy, "deterministic" strategy, these other technical factors introduce subtle, non-deterministic variations in the output
If your system is only deterministic if it processes its huge web of interconnected agentic prompts in exactly the same order, then its behavior is not deterministic in any sense that could ever be important in the context of predictable and repeatable system behavior. If I ask you whether it will handle the same task the same exact way, and its handling of it involves lots of concurrent calls that are never guaranteed to be ordered the same way, then you can't answer "yes".
This is not true. Even my LLM told me this isn't true: https://www.perplexity.ai/search/are-llms-deterministic-if-y...
... until people decide they are OK with things being less than 100% and relax the regulations. Helped along by the purveyors of the AI tools no doubt
That's already the case. Payments are not deterministic. It can take multiple days for things to settle. The real world is messy.
When I make a payment I have no clue if the money is actually going to make it to a merchant or if some fraud system will block it.
If you were on a real flight, asking a qualified human - like a trained pilot - would result in a very deterministic checklist.
Deterministic responses to emergencies is at least half of the training from the time we get a PPL.
Some things need to be deterministic. Many don’t.
Even your business will have many such problems that don’t need 100% all those properties - every task performed by a human for example. You as a developer are not all of these things 100%!
And your help query may need to be deterministic but does it need to be explainable? Many ml solutions aren’t really explainable, certainly not to 100% whatever that may mean, but can easily be deterministic.
A lot of the software industry has been moving away from assigning humans individual responsibility for failure (e.g. blameless post mortems).
I further suspect that most actors will still want someone responsible to take the blame when an incident takes place. Even if they have to make one up.
Not 100% deterministic workers but workflow. The auditability and explainability of your system becomes difficult with AI and LLMs in between because you don't know at what point in the reasoning things turned wrong.
You need, for a lot of things, to know at every step of the way who is culpable and what part of the work they were doing and why it went wrong and how
You log the interaction, you see what happened, no?
If you measure success by unit test failures or by the presence of the bug those behaviors can obscure that the LLM wasn't able to do the intended fix. Of course a closer inspection will still reveal what happened, but using proxy measurements to track success is dangerous, especially if the LLM knows about them or if the task description implies improving that metric "a unit test is failing, fix that"
*shrug*
To be honest, I don't think I'm going to get an answer.
The feedback from the architect was that the vendor was way too cautious in using AI. Nearly all vendors he has seen so far were too cautious. He lamented that no one was fully unleashing AI. They could achieve that by allowing read/write access to confidential data like ERP/CRMs and access to internet while being fully non-deterministic. Then AI could achieve lot more.
I explained that AI being right 95% of the time is still not good enough for finance workflows but he wouldn't budge. He kept repeating that non-deterministic and remove human in the loop is the way to go. I silently promised myself to stay away from any AI projects he might be part of.
Then tell us what how he sees that 5% error rate.
For an "Architect" this is extremely troubling..
> He lamented that no one was fully unleashing AI
More than likely he will never be the one cleaning up the mess, probably he will be the one contracted to design proper systems though so maybe it's a genius move.
You just described how you get your google account locked... :-)
https://www.youtube.com/watch?v=_zfN9wnPvU0
These, outside of employee resistance are technical problems. The insistence they aren’t seems to be the root of the misunderstanding around these tools. The reality is that “computers that speak English” are, at face value, incredibly impressive. But there’s nothing inherent to said systems that makes them easier to integrate with than computers which speak C. In fact I’d argue it’s harder because natural languages are significantly less precise.
Communication and integration is incredibly challenging because you’re trying to transfer states between systems. When you let “the machine carry a larger share of the burden,” as Dijkstra described of the presumed benefit of natural language programming but actual downside[0], you’re also forfeiting a large amount of control. It is for the same reason that word problems are considered more challenging than equations in math class. With natural languages the states being communicated are much less clear than with formal languages and much of the burden assumed to be transferred to the machine is returned in the form of an increase in required specificity and preciseness of which formal languages already solve for.
None of this is to say these tools aren’t useful nor that they cannot be deployed successfully. It is instead to say that the seduction of computers which speak English is more exactly that. These tools are incredibly easy to use to impress, and much more challenging to use to extract value.
0: https://www.cs.utexas.edu/~EWD/transcriptions/EWD06xx/EWD667...
The integration parts aren’t natural language issues but connecting systems and how to put these things in your workflow.
For example. I have a bunch of off the cuff questions and problems and tasks. I want to have these in one place and have that trigger a conversation with ChatGPT, which shows the results in the first place but can be continued easily.
Before it was added the other week, I could track issues in linear and I could have codex write the code to solve them but only by manually moving text from one place to another, manually moving the tickets, checking on progress, click g buttons to open prs - all of that is integration hassle but none is about the model itself. I think now with GitHub copilot I could just assign a task.
We’re saying the same thing. Integration is the hard (still technical) part.
For example we will accept 'probabilistic bookkeeping' because it's cheaper than requiring ledgers to balance to the penny.
But this leeway won't be equally applied. Powerful institutions like banks will use “probabilistic models” to decide they probably don’t owe you that refund, but if they decide you owe them money, they will still hold you to every cent.
Nondeterminism for the powerful, determinism for everyone else. Yay!
1. Laws can change.
2. Blackbox models can provide specific reasons, even if they need to hallucinate them.
People have used machine learning for fraud detection for a long time at this point. They do tolerate the false positives.
I've yet to see an 'agentic' setup that actually learns or improves over time. There are many techniques for this, but I don't see them used.
Why do you think that is?
Except it's worse than that because we'll all end up having to do it anyway, because the overall velocity of emitted working code will be faster, and productivity > all.
Maybe at least these charts are based on real data - albeit self-reported by AI startups likely talking to their investors.
Either way it's useless unless hopping on this train is a past time of yours or you make a living taking investors - to poor to fund an OpenAI, but just rich enough to fund someone eating OpenAI's scraps - for their money.
It has been how many years of people trying to create businesses around chatgpt prompts? I think we need to bring bullying back. This is getting ridiculous.
2 more comments available on Hacker News