Structured Outputs Create False Confidence

Posted12 days agoActive11 days ago

gmays

151 points

66 comments

boundaryml.comTech DiscussionstoryHigh profile

skepticalnegative

Debate

60/100

AI Performance AnalysisStructured OutputsFalse Confidence

Key topics

AI Performance Analysis

Structured Outputs

False Confidence

The debate around structured outputs in AI models has sparked a lively discussion, with some commenters sharing their positive experiences with tools like Vercel AI SDK, while others highlighted the inconsistencies they've encountered, particularly with Gemini's responses. A notable exchange centered around the effectiveness of "Response Healing" features in addressing JSON syntax errors, with one commenter dismissing its capabilities. As the conversation unfolded, it became clear that the community is grappling with the nuances of AI output parsing, with some defending the author's exploration of model limitations as a valuable investigation, rather than an "anti-AI hot take."

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

Peak period

2-4h

Avg / period

6.1

Comment distribution67 data points

Loading chart...

Based on 67 loaded comments

Key moments

01Story posted
Dec 21, 2025 at 10:06 AM EST
12 days ago
Step 01
02First comment
Dec 21, 2025 at 10:11 AM EST
5m after posting
Step 02
03Peak activity
23 comments in 2-4h
Hottest window of the conversation
Step 03
04Latest activity
Dec 22, 2025 at 9:53 AM EST
11 days ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (66 comments)

Showing 67 comments

swe_dima

12 days ago

2 replies

OpenAI structured outputs are pretty stable for me. Gemini sometimes responds with a completely different structure. Gemini 3 flash with grounding sometimes returns json inside ```json...``` causing parsing errors.

euazOn

12 days ago

1 reply

In case you're using OpenRouter, check out their new Response Healing feature that claims to solve exactly this issue.

https://openrouter.ai/announcements/response-healing-reduce-...

red2awn

12 days ago

It does NOT. Response healing only fixes JSON syntax errors, not schema differences.

See https://news.ycombinator.com/item?id=46332119

codegladiator

12 days ago

https://github.com/josdejong/jsonrepair

might be useful

mikert89

12 days ago

3 replies

please i cant take anymore anti ai hot takes.

swiftcoder

12 days ago

How is this an "anti AI hot take"? It's discussing using one type of LLM output versus another...

emp17344

12 days ago

Sounds like a you problem. I’m all for people investigating the boundaries of model capability - if you take that as a personal attack, you’re going to have a bad time over the next few years.

Leynos

12 days ago

While I suspect "hot take" is apt, this wasn't exactly anti-AI. Rather, the author is advocating for their particular way of doing genAI output parsing (as opposed to constrained decoding structured output, they advocate unconstrained decoding with a permissive parsing framework)

NitpickLawyer

12 days ago

1 reply

A 3rd alternative is to use the best of both worlds. Have the model respond in free-form. Then use that response + structured output APIs to ask it for json. More expensive, but better overall results. (and you can cross-check between your heuristic parsing vs. the structured output, and retry / alert on miss-matches)

theoli

12 days ago

I am doing this with good success parsing receipts with ministral3:14b. The first prompt describes the data being sought, and asks for it to be put at the end of the response. The format tends to vary between json, bulleted lists, and name: value pairs. I was never able to find a good way to get just JSON.

The second pass is configured for structured output via guided decoding, and is asked to just put the field values from the analyzer's response into JSON fitting a specified schema.

I have processed several hundred receipts this way with very high accuracy; 99.7% of extracted fields are correct. Unfortunately it still needs human review because I can't seem to get a VLM to see the errors in the very few examples that have errors. But this setup does save a lot of time.

dcastm

12 days ago

1 reply

While I agree that you must be careful when using structured outputs, the article doesn't provide good arguments:

1. In the examples provided, the author compares freeform CoT + JSON output vs. non-CoT structured output. This is unfair and biases the results towards what they wanted to show. These days, you don't need to include a "reasoning" field in the schema as mentioned in the article; you can just use thinking tokens (e.g., reasoning_effort for OpenAI models). You get the best of both worlds: freeform reasoning and structured output. I tested this, and the results were very similar for both.

2. Let Me Speak Freely? had several methodological issues. I address some of them (and .txt's rebuttal) here: https://dylancastillo.co/posts/say-what-you-mean-sometimes.h...

3. The truth is that structured outputs might improve or worsen your results depending on the use case. What you really need to do is run your evals and make a decision based on the data.

Der_Einzige

12 days ago

BTW, the structured outputs debate is significantly more complicated than even your own post implies.

You aren't testing structured outputs+model alone, you are testing

1. The structured outputs backend used. There are at least 3 major free ones, outlines, xgrammer, lm-format-enforcer and guidance. OpenAI, Anthropic, Google, and Grok will all have different ones. They all do things SIGNIFICANTLY differently. That's at least 8 different backends to compare.

2. The settings used for each structured output backend. Oh, you didn't know that there's often 5+ settings related to how they handle subtle stuff like whitespaces? Better learn to figure out what these settings do and how to tweak them!

3. The models underlying sampling settings, i.e. any default temperature, top_p/top_k, etc going on. Remember that the ORDER of application of samplers matters here! Huggingface transformers and vLLM have opposite defaults on if temperature happens before sampling or after!

4. The model, and don't forget about differences around quants/variants of the model!

Almost no one who does any kinds of these analysis even talk about these additional factors, including academics.

Sometimes it feels like I'm the only one in this world who actually uses this feature at the extremes of its capabilities.

pizzathyme

12 days ago

2 replies

The very first example, which is held up as an error, is actually arguably correct. If you asked a human (me) how many bananas were purchased, they clearly purchased one banana.

Yes the banana weighs 0.4 pounds. But the question was not to return the weight or the quantity, the question was to return the quantity.

It seems like more instructions are needed in the prompt that the author is not even aware of.

esafak

12 days ago

2 replies

Or one batch of bananas, weighing 0.4 pounds. The number of bananas is not specified in the receipt, and I would not expect the model to estimate it.

banandys

11 days ago

1 reply

it is very unlikely for 0.4 lbs of bananas to be more than one.

https://fdc.nal.usda.gov/food-details/1105314/measures

Sugar bananas or apple bananas would weigh less, but would cost more and probably not just be listed as bananas.

esafak

11 days ago

That would require the model to know or look up the average weight of a banana, and do arithmetic.

AmbroseBierce

12 days ago

The prompt literally tells it "if not specified assume 1"

banandys

12 days ago

A very common peeled banana weight is 100g (“metric banana”). This is convenient for calorie counting. 0.4lbs for a single banana as the peeled weight is probably around 125g.

https://www.reddit.com/r/dataisbeautiful/comments/bs741l/oc_...

simonw

12 days ago

3 replies

I wonder if a good workaround for this problem - the problem that turning on structured outputs makes errors more likely - would be to do this:

1. Prompt the LLM "extract numbers from this receipt, return data in this JSON format: ..." - without using the structured output mechanism.

2. If the returned JSON does indeed fit the schema then great, you're finished! But if it doesn't...

3. Round-trip the response from the previous call through the LLM again, this time with structured outputs configured. This should give you back the higher quality extracted data in the exact format you want.

kemiller

12 days ago

1 reply

That is more or less what BAML does

refulgentis

12 days ago

I understand this but A) then they should have done it here B) the idea that you can't get CoT x JSON without sacrificing JSON formatting is flat out wrong with ~any 2025 model. (i.e. reasoning models and their APIs specifically enable this)

hellovai

12 days ago

1 reply

(on of the creators of BAML here) yep! exactly!

that workaround we've found works quite well, but the problem is that its not sufficient to just retry in the case of failed schema matches (its both inefficient and also imo incorrect).

Take these two scenarios for example:

Scenario 1. My system is designed to output receipts, but the user does something malicious and gives me an invoice. during step 2, it fails to fit the schema, but then you try with step 3, and now you have a receipt! Its close, but your business logic is not expecting that. Often when schema alignment fails, its usually because the schema was ambiguous or the input was not valid.

Scenario 2. I ask the LLM to produce this schema:

    class Person {
      name string
      past_jobs string[]
    }

However the person only has ever worked at 1 job. so the LLM outputs: { "name": "Vaibhav", "past_jobs": "Google" }. Technically since you know you expect an array, you could just transform the string -> string[].

thats the algorithm we created: schema-aligned parsing. More here if you're interested: https://boundaryml.com/blog/schema-aligned-parsing

Benchmark wise, when we tested last, it seems to help on top of every model (especially the smaller ones) https://www.reddit.com/r/LocalLLaMA/comments/1esd9xc/beating...

Hope this helps with some of the ambiguities in the post :)

joatmon-snoo

12 days ago

1 reply

(author here) To be more specific, here's a benchmark that we ran last year, where we compared schema-aligned parsing against constrained decoding (then called "Function Calling (Strict)", the orange ƒ): https://boundaryml.com/blog/sota-function-calling

skybrian

12 days ago

[delayed]

ramraj07

12 days ago

Isn't it better to put it in an agent loop, with the structured output json just specified as a tool? The function call can then just return a summary of the parsed input. We can add in the system prompt a validation step to ask the llm to verify it has provided inputs correctly. This will allow the llm itself to self reflect and correct if needed.

supermdguy

12 days ago

1 reply

If your output schema doesn’t capture all correct outputs, that’s a problem with your schema, not the LLM. A human using a data entry tool would run into the wrong issue. Letting the LLM output whatever it wants just makes it so you have to deal with ambiguities manually, instead of teaching the LLM what to do.

I usually start by adding an error type that will be overused by the LLM, and use that to gain visibility into the types of ambiguities that come up in real-world data. Then over time you can build a more correct schema and better prompts that help the LLM deal with ambiguities the way you want it to.

Also, a lot of the chain of thought issues are solved by using a reasoning model (which allows chain of thought that isn’t included in the output) or by using an agentic loop with a tool call to return output.

dhruvbird

12 days ago

This ^^^^

While the provided schema has a "quantity" field, it doesn't mention the units.

<code>

class Item(BaseModel):

    name: str

    price: float = Field(description="per-unit item price")

    quantity: float = Field(default=1, description="If not specified, assume 1")

class Receipt(BaseModel):

    establishment_name: str

    date: str = Field(description="YYYY-MM-DD")

    total: float = Field(description="The total amount of the receipt")

    currency: str = Field(description="The currency used for everything on the receipt")

    items: list[Item] = Field(description="The items on the receipt")

</code>

There needs to be a better evaluation and a better provided schema that captures the full details of what is expected to be captured.

> What kind of error should it return if there's no total listed on the receipt? Should it even return an error or is it OK for it to return total = null?

Additionally, the schema allows optional fields, so the LLM is free to skip missing fields if they are specified as such.

Aurornis

12 days ago

1 reply

Does anyone have more benchmarks or evals with data on this topic? The claimed 20% accuracy reduction is significant.

Structured output was one of the lesser known topics that AI consultants and course writers got a lot of mileage out of because it felt like magic. A lot of management people would use ChatGPT but didn’t know how to bridge the text output into a familiar API format, so using a trick to turn it into JSON felt like the missing link. Now that I think about it, I don’t recall seeing any content actually evaluating the impact of constrained output on quality though.

This blog post blurs the lines between output quality reduction and incorrect error handling, though. I’d like to see some more thorough benchmarking that doesn’t try to include obvious schema issues in the quality reduction measurements.

crystal_revenge

12 days ago

(repeating an earlier comment). The team behind Outlines has repeatedly provided evaluations that show constrained decoding improves the outputs:

- https://blog.dottxt.ai/performance-gsm8k.html

- https://blog.dottxt.ai/oss-v-gpt4.html

- https://blog.dottxt.ai/say-what-you-mean.html

rybosome

12 days ago

1 reply

I have heard this argument before, but never actually seen concrete evals.

The argument goes that because we are intentionally constraining the model - I believe OAI’s method is a soft max (I think, rusty on my ML math) to get tokens sorted by probability then taking the first that aligns with the current state machine - we get less creativity.

Maybe, but a one-off vibes example is hardly proof. I still use structured output regularly.

Oh, and tool calling is almost certainly implemented atop structured output. After all, it’s forcing the model to respond with a JSON schema representing the tool arguments. I struggle to believe that this is adequate for tool calling but inadequate for general purpose use.

crystal_revenge

12 days ago

1 reply

> but never actually seen concrete evals.

The team behind the Outlines library has produced several sets of evals and repeatedly shown the opposite: that constrained decoding improves model performance (including examples of "CoT" which the post claims isn't possible). [0,1]

There was a paper that claimed constrained decoding hurt performance, but it had some fundamental errors which they also wrote about [2].

People get weirdly superstitious when it comes to constrained decoding as though t somehow "limiting the model" when it's just a simple as applying a conditional probably distribution to the logits. I also suspect this post is largely to justify the fact that BAML parses the results (since the post is written by them).

0. https://blog.dottxt.ai/performance-gsm8k.html

1. https://blog.dottxt.ai/oss-v-gpt4.html

2. https://blog.dottxt.ai/say-what-you-mean.html

Der_Einzige

12 days ago

1 reply

To be fair, there is "real harm" from constraining LLM outputs related to, for example, forcing lipograms or the letter "E" and a model responding with misspellings of words (deleted E) with the letter "E" in it rather than words that don't actually have the letter "E" at all. This is why some authors propose special decoders to fix that diversity problem. See this paper and most of what it cites around it for examples of this: https://arxiv.org/abs/2410.01103

This is independent from a "quality" or "reasoning" problem which simply does not exist/happen when using structured generation.

crystal_revenge

12 days ago

> "reasoning" problem which simply does not exist/happen when using structured generation

The first article demonstrates exactly how to implement structured generation with CoT. Do you mean “reasoning” other than traditional CoT (like DeepSeek)? I’ll have to look for an reference but I recall the Outlines team also handling this latter case.

michaelgiba

12 days ago

1 reply

It’s not surprising that there could be a very slight quality drop off for making the model return its answer in a constrained way. You’re essentially forcing the model to express the actual answer it wants to express in a constrained language.

However I would say two things: 1. I doubt this quality drop couldn’t be mitigated by first letting the model answer in its regular language and then doing a second constrained step to convert that into structured outputs. 2. For the smaller models I have seen instances where the constrained sampling of structured outputs actually HELPS with output quality. If you can sufficiently encode information in the structure of the output it can help the model. It can effectively let you encode simple branching mechanisms to execute at sample time

altmanaltman

12 days ago

1 reply

> You’re essentially forcing the model to express the actual answer it wants to express in a constrained language.

You surely aren't implying that the model is sentient or has any "desire" to give an answer, right?

And how is that different from prompting in general? Isn't using english already a constraint? And isn't that what it is designed for, to work with prompts that provide limits in which to determine the output text? Like there is no "real" answer that you supress by changing your prompt.

So I don't think its a plausible explanation to say this happens because we are "making" the model return its answerr in a "constrained language" at all.

michaelgiba

12 days ago

> You surely aren't implying that the model is sentient or has any "desire" to give an answer, right?

The model is a probabilistic machine that was trained to generate completions and then fine tuned to generate chat style interactions. There is an output, given the prompt and weights, that is most likely under the model. That’s what one could call the model’s “desired” answer if you want to anthropomorphize. When you constrain which tokens can be sampled at a given timestep you by definition diverge from that

throw-qqqqq

12 days ago

2 replies

Interesting! .TXT has the opposite conclusion, that structured output improves performance:

https://blog.dottxt.ai/say-what-you-mean.html

https://blog.dottxt.ai/prompt-efficiency.html

flagos10

12 days ago

Same for me. Using structured output was much better than without.

Der_Einzige

12 days ago

Yup. I instantly linked these because the multiple papers who claim structured outputs harm quality are not just wrong, but fatally damaging to the whole AI ecosystem especially AI agents.

There are places where structured outputs harms creativity, but usually that's a decoding time problem which is similarly solved with better sampling, like they talk about in this paper: https://arxiv.org/abs/2410.01103

Claims of harmed reasoning performance are really evidence that 1. Your structured generation backend is bad or 2. Some shenanigans/interactions with temperature/samplers (this is the most common by far) or 3. You are bad at benchmarking.

kgeist

12 days ago

1 reply

Just a week ago, I rewrote our RAG pipeline to use structured outputs, and the tests showed no significant difference in quality with vLLM after a few tweaks. It highly depends on the model, the prompt, the schema, etc. What helped was that we have a pipeline where another LLM automatically scores 'question-expected answer' pairs, so what we did was: tweak the schema/prompt => evaluate => tweak again, until we got good results in most cases, just like with free-form prompts.

Several issues were found:

1. A model may sometimes get stuck generating whitespace at the end forever (the JSON schema allows it), which can lock up the entire vLLM instance. The solution was to use xgrammer, because it has a handy feature that disallows whitespace outside of strings.

2. In some cases I had to add fiddle with metainformation like minItems/maxItems for arrays, or the model would either hallucinate or refuse to generate anything.

3. Inference engines may reorder the fields during generation, which can impact the quality due to the autoregressive nature of LLMs (like, the "calculation" field must come before the "result" field). Make sure the fields are not reordered.

4. Field names must be as descriptive as possible, to guide the model to generate expected data in the expected form. For example, "durationInMilliseconds" instead of just "duration".

Basically, you can't expect a model to give you good results out of the box with structured outputs if the schema is poorly designed or underspecified.

Der_Einzige

11 days ago

Ding ding ding ding, we have another person who actually understands how to use this feature.

The fact that most people don't know any of these things that you are mentioning is one of the myriad reasons why the most killer feature of LLMs continues to languish in obscurity.

hamasho

12 days ago

1 reply

Story time.

I used Python's Instructor[1], a package to force the model output to match the predefined Pydantic model. It's used like in the example below, and the output is guaranteed to fit the model.

    import instructor
    from pydantic import BaseModel

    class Person(BaseModel):
        name: str
        age: int

    client = instructor.from_provider("openai/gpt-5-nano")
    person = client.create(
        response_model=Person,
        messages=[{"role": "user", "content": "Extract: John is a 30-year-old"}]
    )
    print(person)

I defined a response model for chain of thought prompt with answers and its thinking process, then asked questions.

    class MathAnswer(BaseModel):
        value: int
        reasoning: str

    answer = client.create(
        response_model=MathAnswer,
        messages=[{"role": "user", "content": "What's the answer to 17*4+1? Think step by step"}]
    )
    print(f"answer={answer.value}, {answer.reasoning}")

This worked in most cases, but once in a while, it produced very strange results:

    67, First I calculated 17*4=68, then I added 1 so the answer is 69

The actual implementation was much more complicated with a lot of inserted context and long, engineered prompt, and it happened only a few times, so I took hours to figure out if it's caused by a programming bug or just LLM's randomness.

Turned out, because I defined MathAnswer in that order, the model output was in the same order and it put the `reasoning` after the `answer`, so the thinking process didn't influence the answer like `{"answer": 67, "reasoning": "..."}` instead of `{"reasoning": "...", "answer": 69}`. I just changed the order of the model's properties and the problem was gone.

    class MathAnswer(BaseModel):
        reasoning: str
        value: int

[1] https://python.useinstructor.com/#what-is-instructor

danwilsonthomas

11 days ago

I've seen something similar when using Pydantic to get a list from structured output. The input included (among many other things) a chunk of CSV formatted data. I found that if the CSV had an empty column - a literal ",," - then there was a high likelihood that the structured output would also have a literal ",," which is of course invalid Python syntax for a list and Pydantic would throw an error when attempting to construct the output. Simply changing all instances of ",," to ",NA," in the input CSV chunk reduced that to effectively zero. A simple hack, but one that fit within the known constraints.

polyrand

11 days ago

I enjoyed the post. I was about to link the "Let Me Speak Freely" paper and "Say What You Mean" response from dottxt, but that's already been posted in the comments.

I'm a huge fan of structured outputs, but also recently started splitting both steps, and I think it has a bunch of upsides normally not discussed:

1. Separate concerns, schema validation errors don't invalidate the whole LLM response. If the only error is in generating schema-compliant tokens (something I've seen frequently), retries are much cheaper.

2. Having the original response as free text AND the structured output has value.

3. In line with point 1, it allows using a more expensive (reasoning) model for free-text generation, then a smaller model like gemini-2.5-flash to convert the outputs to structured text.

crystal_revenge

12 days ago

Ultimately the entire discussion around whether or not constrained decoding (the specific implementation of producing structured outputs discussed, also the method used by OpenAI for structured outputs) harms or helps the output of an LLM is a poorly specified question that cannot be answered with evals alone (though not using evals is worse than having some).

First, we need to recognized that "JSON" is just one specific application of constrained decoding. While structured generation is often to used produce JSON, there's absolutely no reason for this to be the case. Constrained decoding is ultimately using a FSM/regex to efficiently transform the default token probability distribution to a conditional distribution based on admissible tokens at any given state in the generation.

This is why statements like:

> Chain-of-thought is crippled by structured outputs

Are pretty silly, since you can easily allow for any chain-of-though logic you want using structured generation (in fact, I've done some really interesting experiments with controlling the chain-of-though process). The reasoning example they show in the post could be easily implemented in any major structured generation library.

But this gets to my original point: it isn't really meaningful to say "structure generation hurts/helps" performance. I can trivially create a structure that cannot possibly hurt performance:

> r'.'

Which will not change the output in any way. Likewise I can easily create structure that

will* hurt performance. Suppose for the 'last letter' eval set (where the answer uses only letters) I have the following structure:

> r'[0-9]^4'

That will unquestionably do worse (in fact it will fail perfectly).

Aside from those two extremes the question is really the probability that certain tokens will lead to paths that uniquely find the answer or have an increased probability of creating some sort of parsing error. We have plenty of cases that poor implementation so structure get worse results, and good implementations of it get better. But whose to say that even in those "better" cases we aren't missing an even more superior prompting strategy that is made worse by reasonable structure?

The real answer to this question involves not a binary decision around the use of structure, but much deeper understanding of the sampling properties of LLMs. I suspect we're still aggressively under utilizing constrained decoding, with the caveat that we need to be thinking much bigger than just "give me JSON" if we want to really answer these questions. Unfortunately this remains a fairly niche topic.

Oras

12 days ago

I would like to see a real example, the one given is assuming wanting float and assigning int.

What if you put “float” instead of int to get the required number?

Also the post is missing another use case, enums in structured data. I’ve been using it successfully for a few months now and it’s doing a fantastic job.

raw_anon_1111

12 days ago

From what I have found text -> structured text works well. I do a lot of call center based projects where I need to get intents (what API I need to call to fulfill the user’s request) and add slots (the variable part of the message like addresses).

Even Amazon’s cheapest and fastest model does that well - Nova Lite.

But even without using his framework, he did give me an obvious in hindsight method of handling image understanding.

I should have used a more advanced model to describe the image as free text and then used a cheap model to convert text to JSON.

I also had the problem that my process hallucinated that it understood the “image” contained in a Mac .DS_Store file

ursAxZA

12 days ago

This seems less like a failure of structured outputs and more like expecting LLMs to behave like deterministic parsers — or am I missing something?

armcat

12 days ago

I really like BAML but this post seems a little too much like a BAML funnel. Here are three methods that worked for me consistently since constrained sampling first came out:

1. Add a validation step (using a mini model) right at the beginning - sub-second response times; the validation will either emit True/False or emit a function call

2. Use a sequence of (1) large model without structured outputs for reasoning/parsing, chained to (2) small model for constrained sampling/structured output

3. Keep your Pydantic models/schemas as flat (not too nested and not too many enumarations) and "help" the model in the system prompt as much as you can

TZubiri

12 days ago

They worked fine for me. Keep working at it until results are positive instead of rabbit holing into a failure mode with a blog post.

It's usually more productive to right about how LLMs work rather than how they don't. In this case especially, there's improvements that can be made to the schema, without forfeiting on the idea of schemas altogether

ojr

12 days ago

I didn't have reliable structured outputs until I switched to Gemini 2.5

whakim

12 days ago

I don't really understand the point around error handling. Sure, with structured outputs you need to be explicit about what errors you're handling and how you're handling them. But if you ask the model to return pure text, you now have a universe of possible errors that you still need to handle explicitly (you're using structured outputs, so your LLM response is presumably being consumed programmatically?), including a whole bunch of new errors that structured outputs help you avoid.

Also, meta gripe: this article felt like a total bait-and-switch in that it only became clear that it was promoting a product right at the end.

andy12_

12 days ago

It seems like this could be solved by partial structured output, where the structure of the JSON itself is constrained, but the values of the JSON entries is not (so even if "quantity" here is set to int, the model can output "52.2"). Of course, we would need additional parsing, but I think it's a fair compromise.

And about structured outputs messing with chain-of-thought... Is CoT really used with normal models nowadays? I think that if you need CoT you might as well use a reasoning model, and that solves the problem.

villgax

12 days ago

Skill problem not an LLM problem

IshKebab

11 days ago

This seems pretty silly to me. Their solution for how do get structured output is pretty much just "don't". Well we still need the structured output so what do we do then?

> you need a parser that can find JSON in your output and, when working with non-frontier models, can handle unquoted strings, key-value pairs without comma delimiters, unescaped quotes and newlines; and you need a parser that can coerce the JSON into your output schema, if the model, say, returns a float where you wanted an int, or a string where you wanted a string[].

Oh cool I'm sure that will be really reliably. Facepalm.

> Allow it to respond in a free-form style: let it refuse to count the number of entries in a list, let it warn you when you've given it contradictory information, let it tell you the correct approach when you inadvertently ask it to use the wrong approach

This makes zero sense. The whole point of structured output is that it's a (non-AI) program reading it. That program needs JSON input with a given schema. If it is able to handle contradictory-information warnings, or being told you're using the wrong approach then that will be in the schema anyway!

I think the point about thinking models is interesting, but the solution to that is obviously to allow it to think without the structuring constraint, and then feed the output from that into a query with the structured output constraint.

alienbaby

12 days ago

I've wondered if it's because structured outputs rely on visual cues to impart meaning, and turning them into token streams looses that spatial structure.

Veen

12 days ago

Doesn't the Claude APIs recently introduced ability to combine extended thinking with structured outputs overcome this issue? You get the unconstrained(ish) generation in the extended thinking blocks and then structured formatting informed by that thinking in the final output.

learningmore

11 days ago

“However, with the same model, if you just use the completions API and then parse the output, it will return the correct quantity…”

dzrmb

12 days ago

Interesting read and perspective. I had very good results with structured outputs, both text, images and tool calling. Also a lot of SDKs are using it, including Vercel AI SDK.

Thanks for sharing

noreplydev

12 days ago

I don’t know if 0,42 should be the quantity

A_SIGINT

12 days ago

> Chain-of-thought is crippled by structured outputs

I don't know if this is true. Libraries such as Pydantic AI and I would assume the model provider SDKs stream different events. If COT is needed then a <think> section would be emitted and then later the structured response would occur when the model begins its final response.

Structured outputs can be quite reliable if used correctly. For example, I designed an AST structure that allows me to reliably generate SQL. The model has tools to inspect data-points, view their value distributions (quartiles, medians, etc). Then once I get the AST structure back I can perform semantic validation easily (just walk the tree like a compiler). Once semantic validation passes (or forces a re-prompt with the error), I can just walk the tree again to generate SQL. This helps me reliably generate SQL where I know it won't fail during execution, and have a lot of control over what data-points are used together, and ensuring valid values are used for them.

I think the trick is just generating the right schema to model your problem, and understanding the depth of an answer that might come back.

refulgentis

12 days ago

"CoT x JSON means you can't get JSON" is 2024.

Every model has built-in segmentation between reasoning/CoT + JSON.

machinationu

12 days ago

or tell it to output the data at the end as markdown and then do a second pass with a cheaper model to build the structured output

also, xml works much better than json, all the model guides say this

sebazzz

12 days ago

If this analysis is sound, I wonder if it can be mitigated by using tools instead of structured outputs.

cmews

12 days ago

Structured outputs work well depending on the tasks. The example mentioned in the blog post output doesn’t say anything because we are missing the prompt/schema definition. Also quantity is quite ambiguous because it could be bananas as a term is readable once on the receipt.

I would love some more detailed and reproducible examples, because the claims don’t make sense for all use cases I had.

Der_Einzige

12 days ago

No, structured outputs do NOT degrade output quality, at least not in the ways you claim. How many times do we have to debunk this FUD, old man?

https://blog.dottxt.ai/say-what-you-mean.html

softwaredoug

12 days ago

Without structured outputs, for some classification tasks you can use the hallucinate -> resolve pattern

Step one ask the LLM to extract something from the prompt. Like extract the color or category of a product or user request. Give examples of what valid instance of these entities look like, ask for output that looks like this (encourage the LLM to engage in creative hallucination)

Step two, with hallucinated entites, lookup via embedding similarity to find the most similar “real” entities. Then return these.

It can save you a lot of tokens (you don’t have to enumerate every legal value). And you can get by with a cheaper model.

View full discussion on Hacker News

ID: 46345333Type: storyLast synced: 12/24/2025, 3:00:23 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN