Taxcalcbench: Evaluating Frontier Models on the Tax Calculation Task
Posted3 months agoActive2 months ago
arxiv.orgTechstory
calmmixed
Debate
40/100
AITaxationLlms
Key topics
AI
Taxation
Llms
Researchers released TaxCalcBench, a benchmark for evaluating frontier models on tax calculation tasks, highlighting the challenges of applying AI to complex legal tasks like taxation.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
2h
Peak period
22
Day 1
Avg / period
8
Comment distribution24 data points
Loading chart...
Based on 24 loaded comments
Key moments
- 01Story posted
Oct 15, 2025 at 11:45 PM EDT
3 months ago
Step 01 - 02First comment
Oct 16, 2025 at 1:31 AM EDT
2h after posting
Step 02 - 03Peak activity
22 comments in Day 1
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 29, 2025 at 7:55 PM EDT
2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45601230Type: storyLast synced: 11/20/2025, 12:26:32 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
If a ton of these mistakes are genuinely simple calculation errors, it seems like giving the models access to a calculator tool would help a fair bit.
I’m surprised they haven’t tried this, I’m running my own in parallel against my accountant in this way.
Yet a week ago I used Claude Code for my personal finances (not taxes) - I downloaded over a year’s worth of my bank account data. Since I pay for most things by card, if I buy lunch, it’s there.
With a single prompt (and about 10 minutes), it produced an analysis. It solved all the technical issues by itself (e.g., realizing it wasn’t CSV but TSV) and ran quite a few different explorations with Pandas. It was able to write an overview, find items that were likely misclassified, etc.
Everything I checked by hand was correct.
So, instead of pursuing a project to write an AI tool for personal finance, I ended up concluding: “just use Claude Code.” As a side note, I used 14 months of data due to my mistake - I wanted to analyze 2 months of data, since I didn’t believe it would handle a larger set, but I misclicked the year. The file was 350 KB.
So until there's umbrella AI insurance...
As of now, I would not use automatic AI to make any financial decisions with direct consequences. Unless system is tested and benchmarked against accountants.
I wonder what an average accountant would score.
I know LLMs have helped me identify many mistakes accountants have made on my behalf. Some mistakes that could have cost me a lot of money if not caught.
Let me know if you have any questions, happy to discuss!
Given that only short instructions are in context, I would not have expected even a frontier model to score well on this benchmark. For better results, I'd think that giving the model access to the entire tax code is required (which likely requires RAG due to its sheer size).
That all being said, we agree, which is what we've built with our internal tax coding agent, Iris: https://www.columntax.com/blog/introducing-iris-our-ai-tax-d... (ability to get just the right Tax form context on a per-line basis to turn the tax law into code).
From another article today, I discovered the IRS has a github repo with (what seems to be) XML versions of tax questions... surely some combination of LLM and structured data querying could solve this? https://github.com/IRS-Public/direct-file/tree/main
Honestly, I think humans have trouble with this as well.
Unsurprisingly. Sometimes I feel like I am in a madhouse. Or in an alchemist's laboratory.