LLM Output Drift in Financial Workflows: Validation and Mitigation (arxiv)
Postedabout 2 months agoActiveabout 2 months ago
arxiv.orgTechstory
calmmixed
Debate
60/100
Large Language ModelsFinancial WorkflowsAI GovernanceReproducibility
Key topics
Large Language Models
Financial Workflows
AI Governance
Reproducibility
A study on LLM output consistency in financial tasks reveals significant differences between smaller and larger models, sparking discussion on the use of LLMs in regulated financial workflows.
Snapshot generated from the HN discussion
Discussion Activity
Moderate engagementFirst comment
N/A
Peak period
8
1-2h
Avg / period
2.9
Comment distribution26 data points
Loading chart...
Based on 26 loaded comments
Key moments
- 01Story posted
Nov 12, 2025 at 2:53 PM EST
about 2 months ago
Step 01 - 02First comment
Nov 12, 2025 at 2:53 PM EST
0s after posting
Step 02 - 03Peak activity
8 comments in 1-2h
Hottest window of the conversation
Step 03 - 04Latest activity
Nov 13, 2025 at 3:35 AM EST
about 2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45905451Type: storyLast synced: 11/20/2025, 12:26:32 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Caveat: Measures reproducibility (edit distance), not full accuracy—determinism is necessary for compliance but needs semantic checks (e.g., embeddings to ground truth). Includes harness, invariants (±5%), and attestation.
Thoughts on inverse size-reliability? Planning follow-up with accuracy metrics vs. just repro.
Is this perhaps inference implementation details somehow introducing randomness?
https://news.ycombinator.com/item?id=45200925
https://thinkingmachines.ai/blog/defeating-nondeterminism-in...
> As it turns out, our request’s output does depend on the parallel user requests. Not because we’re somehow leaking information across batches — instead, it’s because our forward pass lacks “batch invariance”, causing our request’s output to depend on the batch size of our forward pass.
tl;dr: the way inference is batched introduces non-determinism.
Says who?
The stuff you comply with changes in real time. How’s that for determinism?
Most groups I work with stick to traditional automation/rules systems, but top-down mandates are pushing them toward frontier models for general tasks—which then get plugged into these workflows. A lot stays in sandbox, but you'd be surprised what's already live in fin services.
The authorities I cited (FSB/BIS/CFTC) literally just said last month AI monitoring is "still at early stage" cc https://www.fsb.org/2024/11/the-financial-stability-implicat...
Curious how you'd tackle that real-time changing reg?
This was the link I meant from Oct ‘25 reiterating early stages of AI monitoring
That's not the way regulations work. Your compliance is measured against a fixed version of legislation.
My bro, the tariffs. The first table of tariffs was written by ChatGPT!
> That's not the way regulations work.
Whatever regulations you are thinking of, they are myths now. I'm not saying deregulation - that isn't happening. In every industry - I know more about healthcare than finance - clear, complex, well specified regulations are being replaced by vague, mercurial ones. The SEC has changed many things too.
Also the mistral medium model we tested had ~70% deterministic outputs across the 16 runs for the text to sql gen and summarization in json tasks- and it had reasoning on. Llama 3.3 70b started to degrade and doesn’t have reasoning. But it’s a relevant variable to consider
What do you call the fallacy where the universe is imperfect, therefore nobody can have higher standards for anything?
Mankind has spent literal centuries observing deficiencies and faults in human bookkeeping and calculation, constantly trying to improve it with processes and machinery. There's no good reason to suddenly stop caring about those issues simply because the latest proposal is marketed as "AI".