Trying Out Gemini 3 Pro with Audio Transcription and a New Pelican Benchmark

Postedabout 2 months agoActiveabout 2 months ago

nabla9

179 points

56 comments

simonwillison.netTechstoryHigh profile

excitedpositive

Debate

60/100

Artificial IntelligenceLarge Language ModelsGemini 3 ProAudio Transcription

Key topics

Artificial Intelligence

Large Language Models

Gemini 3 Pro

Audio Transcription

The author tests Gemini 3 Pro with audio transcription and introduces a new pelican benchmark, sparking discussion on AI applications in journalism and benchmarking.

Snapshot generated from the HN discussion

Discussion Activity

Active discussion

First comment

36m

Peak period

0-2h

Avg / period

7.3

Comment distribution51 data points

Loading chart...

Based on 51 loaded comments

Key moments

01Story posted
Nov 18, 2025 at 2:05 PM EST
about 2 months ago
Step 01
02First comment
Nov 18, 2025 at 2:41 PM EST
36m after posting
Step 02
03Peak activity
20 comments in 0-2h
Hottest window of the conversation
Step 03
04Latest activity
Nov 19, 2025 at 8:38 AM EST
about 2 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (56 comments)

Showing 51 comments of 56

simonw

about 2 months ago

12 replies

The audio transcript exercise here is particularly interesting from a journalism perspective.

Summarizing a 3.5 hour council meeting is something of a holy grail of AI-assisted reporting. There are a LOT of meetings like that, and newspapers (especially smaller ones) can no longer afford to have a human reporter sit through them all.

I tried this prompt (against audio from https://www.youtube.com/watch?v=qgJ7x7R6gy0):

  Output a Markdown transcript of this meeting. Include speaker
  names and timestamps. Start with an outline of the key
  meeting sections, each with a title and summary and timestamp
  and list of participating names. Note in bold if anyone
  raised their voices, interrupted each other or had
  disagreements. Then follow with the full transcript.

Here's the result: https://gist.github.com/simonw/0b7bc23adb6698f376aebfd700943...

I'm not sure quite how to grade it here, especially since I haven't sat through the whole 3.5 hour meeting video myself.

It appears to have captured the gist of the meeting very well, but the fact that the transcript isn't close to an exact match to what was said - and the timestamps are incorrect - means it's very hard to trust the output. Could it have hallucinated things that didn't happen? Those can at least be spotted by digging into the video (or the YouTube transcript) to check that they occurred... but what about if there was a key point that Gemini 3 omitted entirely?

byt3bl33d3r

about 2 months ago

1 reply

I’ve been meaning to create & publish a structured extraction benchmark for a while. Using LLMs to extract info/entities/connections from large amounts of unstructured data is also a huge boon to AI-assisted reporting and has also a number of cybersecurity applications. Gemini 2.5 was pretty good but so far I have yet to see an LLM that can reliably , accurately and consistently do this

simonw

about 2 months ago

This would be extremely useful. I think this is one of the most commercially valuable uses of these kinds of models, having more solid independent benchmarks would be great.

mistercheph

about 2 months ago

1 reply

For this use case I think best bet is still a toolchain with a transcription model like whisper fed into an LLM to summarize

simonw

about 2 months ago

Yeah I agree. I ran Whisper (via MacWhisper) on the same video and got back accurate timestamps.

The big benefit of Gemini for this is that it appears to do a great job of speaker recognition, plus it can identify when people interrupt each other or raise their voices.

The best solution would likely include a mixture of both - Gemini for the speaker identification and tone-of-voice stuff, Whisper or NVIDIA Parakeet or similar for the transcription with timestamps.

rahimnathwani

about 2 months ago

2 replies

For this use case, why not use Whisper to transcribe the audio, and then an LLM to do a second step (summarization or answering questions or whatever)?

If you need diarization, you can use something like https://github.com/m-bain/whisperX

crazysim

about 2 months ago

Since Gemini seems to be sucking at timestamps, perhaps Whisper can be used to help ground that as an additional input alongside the audio.

pants2

about 2 months ago

Whisper simply isn't very good compared to LLM audio transcription like gpt-4o-transcribe. If Gemini 3 is even better it's a game-changer.

ks2048

about 2 months ago

1 reply

Does anyone benchmark these models for text-to-speech using traditional word-error-rates? It seems audio-input Gemini is a lot cheaper than Google Speech-to-text.

simonw

about 2 months ago

Here's one: https://voicewriter.io/speech-recognition-leaderboard

"Real-World Speech-to-text API Leaderboard" - it includes scores for Gemini 2.5 Pro and Flash.

Workaccount2

about 2 months ago

2 replies

My assumption is that Gemini has no insight into the time stamps, and instead is ballparking it based on how much context has been analyzed up to that point.

I wonder if you put the audio into a video that is nothing but a black screen with a timer running, it would be able to correctly timestamp.

minimaxir

about 2 months ago

Per the docs, Gemini represents each second of audio as 32 tokens. Since it's a consistent amount, as long as the model is trained to understand the relation between timestamps and the number of tokens (which per Simon's link it does), it should be able to infer the correct amount of seconds.

simonw

about 2 months ago

The Gemini documentation specifically mentions timestamp awareness here: https://ai.google.dev/gemini-api/docs/audio

potatolicious

about 2 months ago

3 replies

You really want to break a task like this down to constituent parts - especially because in this case the "end to end" way of doing it (i.e., raw audio to summary) doesn't actually get you anything.

IMO the right way to do this is to feed the audio into a transcription model, specifically one that supports diarization (separation of multiple speakers). This will give you a high quality raw transcript that is pretty much exactly what was actually said.

It would be rough in places (i.e., Speaker 1, Speaker 2, etc. rather than actual speaker names)

Then you want to post-process with a LLM to re-annotate the transcript and clean it up (e.g., replace "Speaker 1" with "Mayor Bob"), and query against it.

I see another post here complaining that direct-to-LLM beats a transcription model like Whisper - I would challenge that. Any modern ASR model will do a very, very good job with 95%+ accuracy.

simonw

about 2 months ago

1 reply

Which diarization models would you recommend, especially for running on macOS?

(Update: I just updated MacWhisper and it can now run Parakeet which appears to have decent diarization built in, screenshot here: https://static.simonwillison.net/static/2025/macwhisper-para... )

atonse

about 2 months ago

I use parakeet daily (with MacWhisper) to transcribe my meetings. It works really well, even with the speaker segmentation.

darkwater

about 2 months ago

2 replies

Why can't Gemini, the product, do that by itself? Isn't the point of all this AI hype to easily automate things with low effort?

vlovich123

about 2 months ago

1 reply

Multimodal models are only now starting to come into the space and even then I don’t know they really support diarization yet (and often multimodal is thinking+speech/images, not sure about audio).

jrk

about 2 months ago

I think they weren’t asking “why can’t Gemini 3, the model, just do good transcription,” they were asking “why can’t Gemini, the API/app, recognize the task as something best solved not by a single generic model call, but by breaking it down into an initial subtask for a specialized ASR model followed by LLM cleanup, automatically, rather than me having to manually break down the task to achieve that result.”

refulgentis

about 2 months ago

Speech recognition, as described above, is an AI too :) These LLMs are huge AIs that I guess could eventually replace all other uses of AI, but that’s sort of speculation no one would endorse.

Separately, in my role as wizened 16 year old veteran of HN: it was jarring to read that. There’s a “rules” section, but don’t be turned off by the name, it is more like a nice collection of guidelines of how to interact in a way that encourages productive discussion that illuminates. One of the key rules is not to interpret things weakly. Here, someone spelled out exactly how to do it, and we shouldn’t then assume its not AI, then tie to a vague demeaning description of “AI hype”, then ask an unanswerable question of what’s the point of “AI hype”.

If you’re nontechnical, to be clear, it would be hard to be nontechnical and new to HN and know how to ask that a different way, I suppose.

sillyfluke

about 2 months ago

2 replies

I'm curious when we started conflating transcription and summarization when discussing this LLM mess, or maybe I'm confused about the output simonw is quoting as "the transcript" which starts off not with the actual transcript but with a Meeting Outline and Summarization sections?

LLM summarization is utterly useless when you want 100% accuracy on the final binding decisions on things like council meeting decisions. My experience has been that LLMs cannot be trusted to follow convulted discussions, including revisting earlier agenda items later in the meeting etc.

With transcriptions, the catastrophic risk is far less since I'm doing the summarizing from a transcript myself. But in that case, for an auto-generated transcript, I'll take correct timestamps with gibberish sounding sentences over incorrect timestamps with "convincing" sounding but halluncinated sentences any day.

Any LLM summarization of a sufficiently important meeting requires second-by-second human verification of the audio recording. I have yet to see this convincingly refuted (ie, an LLM model that maintains 100% accuracy on summarizing meeting decisions consistently).

simonw

about 2 months ago

That's why I shared these results. Understanding the difference between LLM summarization and exact transcriptions is really important for this kind of activity.

Royce-CMR

about 2 months ago

This is a high area of focus for me and I agree: following a complex convo, especially when it gets picked up again 20-30 min later, is difficult.

But not impossible. I’ve had success with prompts that ID all topics and then map all conversation tied to each topic (each seperate LLM queries) and then pulling together summary and conclusions by topic.

I’ve also had success with one shot prompts - especially with the right context on the event and phrasing shared. But honestly I end up spending about 5-10 min reviewing and cleaning up the output before solid.

But that’s worlds better than attending the event, and then manually pulling together notes from your fast in flight shorthand.

(Former BA, ran JADs etc, lived and died by accuracy and right color / expression / context in notes)

stavros

about 2 months ago

1 reply

Just checking, but did you verify that the converted ffmpeg audio wasn't around an hour long? Maybe ffmpeg sped it up when converting?

simonw

about 2 months ago

Here is the file I used, it is the expected length https://static.simonwillison.net/static/2025/HMB-nov-4-2025....

barapa

about 2 months ago

We have done a lot of work solving this exact challenge at legaide.ai. I was both sad and relieved when gemini 3 didn't deprecate our startup.

WesleyLivesay

about 2 months ago

I think it appears to have done a good job of summarizing the points that it summarize, at least judging from my quick watch of a few sections and from the YT Transcript (which seems quite accurate).

Almost makes me wonder if it is behind the scenes doing something similar to: rough transcript -> Summaries -> transcript with timecodes (runs out of context) -> throws timestamps that it has on summaries.

I would be very curious to see if it does better on something like an hour long chunk of audio, to see if it is just some sort of context issue. Or if this same audio was fed to it in say 45 minute chunks to see if the timestamps fix themselves.

luke-stanley

about 2 months ago

Simon, sorry I didn't get around to answering your question on post-t5 encoder-decoders from the Markdown Lethal Trifecta prompt injection post. (https://news.ycombinator.com/item?id=45724941)

Since the plain decoder models stole the show, Google DeepMind demonstrated a way to adapt LLMs,adding a T5 encoder to an existing normal Gemma model to get the benefits of the more grounded text-to-text tasks WITHOUT instruction tuning (and the increased risk of prompt injection). They also have a few different kinds they shared on HuggingFace. I didn't get around to fine-tuning the weights of one for summarisation yet but it could well be a good way for more reliable summarisation. I did try out some models for inference though and made a Gist here, which is useful since I found the HF default code example a bit broken:

https://gist.github.com/lukestanley/ee89758ea315b68fd66ba52c...

Google's minisite: https://deepmind.google/models/gemma/t5gemma/ Paper: https://arxiv.org/abs/2504.06225

Here is one such model that didn't hallucinate and actually did summarise on HF: https://huggingface.co/google/t5gemma-l-l-prefixlm

anilgulecha

about 2 months ago

We've done some transcription exercises. The way to get the timestamps to line is : 1) break up the audio into minutes, and pass it in, one after another, so the chat completions prompt looks like: Here's minute 1 of audio [ffmpeg 1st minute cut out.wav] Here's minute 2 of audio [ffmpeg 2nd minute cut out.wav] Here's minute 3 of audio [ffmpeg 3rd minute cut out.wav] and so on..

The cutting step is simple, and token count is pretty much the same, but the crucial additional detail allows for excellent transcription fidelity time wise.

We've also experimented passing in regular TTS (non-llm) transcript for reference, which again helps the LLM do better.

luke-stanley

about 2 months ago

I'd lower the temperature and try DSPy Refine loop on it (or similar). Using audio encoder-decoder models and segmentation are good things to try too. A length mismatch would be bad. DSPy has optimisers. It could probably optimise well with length match heuristic, there is probably a good Shannon entropy rule.

londons_explore

about 2 months ago

2 replies

Anyone got a class full of students and able to get a human version of this pelican benchmark?

Perhaps half with a web browser to view the results, and half working blind with the numbers alone?

ethmarks

about 2 months ago

I highly doubt that any human could manually write a pelican SVG in a single shot like LLMs do. With a lot of guessing and checking of coordinate positions in the web browser, maybe they could do it.

I imagine that you could theoretically also guess and check without the web browser by manually rendering the SVG using some graph paper, a compass, a straightedge, and coloured pencils, but that sounds unbelievably tedious and also very error-prone.

roxolotl

about 2 months ago

If a human was capable of doing it blind I’d be absolutely blown away. Many can’t even draw a bicycle from memory[0].

0: https://www.wired.com/2016/04/can-draw-bikes-memory-definite...

ZeroConcerns

about 2 months ago

1 reply

> so I shrunk the file down to a more manageable 38MB using ffmpeg

Without having an LLM figure out the required command line parameters? Mad props!

simonw

about 2 months ago

Hah, nope! I had Claude Code figure that one out.

nurumaik

about 2 months ago

2 replies

Seems like pelican benchmark is finally added to model training process

ethmarks

about 2 months ago

Mission accomplished for Simon:

> Truth be told, I’m playing the long game here. All I’ve ever wanted from life is a genuinely great SVG vector illustration of a pelican riding a bicycle. My dastardly multi-year plan is to trick multiple AI labs into investing vast resources to cheat at my benchmark until I get one.

https://simonwillison.net/2025/Nov/13/training-for-pelicans-...

red75prime

about 2 months ago

It's easy to test. Use "one-eyed octopus on a tricycle" or something.

Wowfunhappy

about 2 months ago

1 reply

Aww, I don’t like the new pelican benchmark as much. I liked that the old prompt was vague and we could see how the AI interpreted it.

ahmedfromtunis

about 2 months ago

2 replies

Yeah. The new challenge seems easier to solve since it basically is hand-holding the LLMs into what the result should look like.

I think a more challenging, well, challenge, would be to offer an even more absurd scenario and see how the model handles it.

Example: generate an svg of a pelican and a mongoose eating popcorn inside a pyramid-shaped vehicle flying around Jupiter. Result: https://imgur.com/a/TBGYChc

simonw

about 2 months ago

1 reply

I like the hand-holding because it's a better test of how well models can follow more detailed instructions.

I was inspired by Max Woolf's nano banana test prompts: https://minimaxir.com/2025/11/nano-banana-prompts/

ahmedfromtunis

about 2 months ago

1 reply

That's a valid point but I'd argue the new test would be then interesting to couple with the original one, not to replace it.

Do you think it would be reasonable to include both in future reviews, at least for the sake of back-compatibility (and comparability)?

simonw

about 2 months ago

Yeah I'm going to keep on using the old one as well.

ethmarks

about 2 months ago

Which model did you use in the example result? It looks fantastic.

scosman

about 2 months ago

1 reply

If anyone enjoys cheeky mischief targeting LLM benchmark hackers: https://github.com/scosman/pelicans_riding_bicycles

I'll need to update for V2!

yberreby

about 2 months ago

Delightfully evil.

Redster

about 2 months ago

1 reply

So Gemini 3 Pro dropped today, which happens to be the day I proofread a historical timeline I'm assisting a PhD with. I do one pass and then realize I should try Gemini 3 Pro on it. I give the same exact prompt to 3 Pro as Claude 4.5 Sonnet. 3 pro finds 25 real errors, no hallucinations. Claude finds 7 errors, but only 2 of those are unique to Claude. (Claude was better at "wait, that reference doesn't match the content! It should be $corrected_citation!). But Gemini's visual understanding was top notch. It's biggest flaw was that it saw words that wrapped as having extra spaces. But it also correctly caught a typo where a wrapped word was misspelled, so something about it seemed to fixate on those line breaks, I think.

Redster

about 2 months ago

A better test would be 2.5 Pro vs 3 Pro. Google just has been doing better at vision for a while.

razodactyl

about 2 months ago

I was waiting for this post.

Love the pivot in pelican generation bench.

leetharris

about 2 months ago

I used to work in ASR. Due to the nature of current multimodal architectures, it is unlikely we'll ever see accurate timestamps over a longer horizon. You're better off using encoder-decoder ASR architectures, then using traditional diarization using embedding clustering, then using a multimodal model to refine it, then use a forced alignment technique (maybe even something pre-NN) to get proper timestamps and reconciling it at the end.

These things are getting really good at just regular transcription (as long as you don't care about verbatimicity), but every additional dimension you add (timestamps, speaker assignment, etc) will make the others worse. These work much better as independent processes that then get reconciled and refined by a multimodal LLM.

vessenes

about 2 months ago

I'd like a long bet on when Simon will add animation to the pelican SVG benchmark. I'm thinking late 2026.

5 more comments available on Hacker News

View full discussion on Hacker News

ID: 45970519Type: storyLast synced: 11/20/2025, 5:36:19 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN