Trying Out Gemini 3 Pro with Audio Transcription and a New Pelican Benchmark
Postedabout 2 months agoActiveabout 2 months ago
simonwillison.netTechstoryHigh profile
excitedpositive
Debate
60/100
Artificial IntelligenceLarge Language ModelsGemini 3 ProAudio Transcription
Key topics
Artificial Intelligence
Large Language Models
Gemini 3 Pro
Audio Transcription
The author tests Gemini 3 Pro with audio transcription and introduces a new pelican benchmark, sparking discussion on AI applications in journalism and benchmarking.
Snapshot generated from the HN discussion
Discussion Activity
Active discussionFirst comment
36m
Peak period
20
0-2h
Avg / period
7.3
Comment distribution51 data points
Loading chart...
Based on 51 loaded comments
Key moments
- 01Story posted
Nov 18, 2025 at 2:05 PM EST
about 2 months ago
Step 01 - 02First comment
Nov 18, 2025 at 2:41 PM EST
36m after posting
Step 02 - 03Peak activity
20 comments in 0-2h
Hottest window of the conversation
Step 03 - 04Latest activity
Nov 19, 2025 at 8:38 AM EST
about 2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45970519Type: storyLast synced: 11/20/2025, 5:36:19 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Summarizing a 3.5 hour council meeting is something of a holy grail of AI-assisted reporting. There are a LOT of meetings like that, and newspapers (especially smaller ones) can no longer afford to have a human reporter sit through them all.
I tried this prompt (against audio from https://www.youtube.com/watch?v=qgJ7x7R6gy0):
Here's the result: https://gist.github.com/simonw/0b7bc23adb6698f376aebfd700943...I'm not sure quite how to grade it here, especially since I haven't sat through the whole 3.5 hour meeting video myself.
It appears to have captured the gist of the meeting very well, but the fact that the transcript isn't close to an exact match to what was said - and the timestamps are incorrect - means it's very hard to trust the output. Could it have hallucinated things that didn't happen? Those can at least be spotted by digging into the video (or the YouTube transcript) to check that they occurred... but what about if there was a key point that Gemini 3 omitted entirely?
The big benefit of Gemini for this is that it appears to do a great job of speaker recognition, plus it can identify when people interrupt each other or raise their voices.
The best solution would likely include a mixture of both - Gemini for the speaker identification and tone-of-voice stuff, Whisper or NVIDIA Parakeet or similar for the transcription with timestamps.
If you need diarization, you can use something like https://github.com/m-bain/whisperX
"Real-World Speech-to-text API Leaderboard" - it includes scores for Gemini 2.5 Pro and Flash.
I wonder if you put the audio into a video that is nothing but a black screen with a timer running, it would be able to correctly timestamp.
IMO the right way to do this is to feed the audio into a transcription model, specifically one that supports diarization (separation of multiple speakers). This will give you a high quality raw transcript that is pretty much exactly what was actually said.
It would be rough in places (i.e., Speaker 1, Speaker 2, etc. rather than actual speaker names)
Then you want to post-process with a LLM to re-annotate the transcript and clean it up (e.g., replace "Speaker 1" with "Mayor Bob"), and query against it.
I see another post here complaining that direct-to-LLM beats a transcription model like Whisper - I would challenge that. Any modern ASR model will do a very, very good job with 95%+ accuracy.
(Update: I just updated MacWhisper and it can now run Parakeet which appears to have decent diarization built in, screenshot here: https://static.simonwillison.net/static/2025/macwhisper-para... )
Separately, in my role as wizened 16 year old veteran of HN: it was jarring to read that. There’s a “rules” section, but don’t be turned off by the name, it is more like a nice collection of guidelines of how to interact in a way that encourages productive discussion that illuminates. One of the key rules is not to interpret things weakly. Here, someone spelled out exactly how to do it, and we shouldn’t then assume its not AI, then tie to a vague demeaning description of “AI hype”, then ask an unanswerable question of what’s the point of “AI hype”.
If you’re nontechnical, to be clear, it would be hard to be nontechnical and new to HN and know how to ask that a different way, I suppose.
LLM summarization is utterly useless when you want 100% accuracy on the final binding decisions on things like council meeting decisions. My experience has been that LLMs cannot be trusted to follow convulted discussions, including revisting earlier agenda items later in the meeting etc.
With transcriptions, the catastrophic risk is far less since I'm doing the summarizing from a transcript myself. But in that case, for an auto-generated transcript, I'll take correct timestamps with gibberish sounding sentences over incorrect timestamps with "convincing" sounding but halluncinated sentences any day.
Any LLM summarization of a sufficiently important meeting requires second-by-second human verification of the audio recording. I have yet to see this convincingly refuted (ie, an LLM model that maintains 100% accuracy on summarizing meeting decisions consistently).
But not impossible. I’ve had success with prompts that ID all topics and then map all conversation tied to each topic (each seperate LLM queries) and then pulling together summary and conclusions by topic.
I’ve also had success with one shot prompts - especially with the right context on the event and phrasing shared. But honestly I end up spending about 5-10 min reviewing and cleaning up the output before solid.
But that’s worlds better than attending the event, and then manually pulling together notes from your fast in flight shorthand.
(Former BA, ran JADs etc, lived and died by accuracy and right color / expression / context in notes)
Almost makes me wonder if it is behind the scenes doing something similar to: rough transcript -> Summaries -> transcript with timecodes (runs out of context) -> throws timestamps that it has on summaries.
I would be very curious to see if it does better on something like an hour long chunk of audio, to see if it is just some sort of context issue. Or if this same audio was fed to it in say 45 minute chunks to see if the timestamps fix themselves.
Since the plain decoder models stole the show, Google DeepMind demonstrated a way to adapt LLMs,adding a T5 encoder to an existing normal Gemma model to get the benefits of the more grounded text-to-text tasks WITHOUT instruction tuning (and the increased risk of prompt injection). They also have a few different kinds they shared on HuggingFace. I didn't get around to fine-tuning the weights of one for summarisation yet but it could well be a good way for more reliable summarisation. I did try out some models for inference though and made a Gist here, which is useful since I found the HF default code example a bit broken:
https://gist.github.com/lukestanley/ee89758ea315b68fd66ba52c...
Google's minisite: https://deepmind.google/models/gemma/t5gemma/ Paper: https://arxiv.org/abs/2504.06225
Here is one such model that didn't hallucinate and actually did summarise on HF: https://huggingface.co/google/t5gemma-l-l-prefixlm
The cutting step is simple, and token count is pretty much the same, but the crucial additional detail allows for excellent transcription fidelity time wise.
We've also experimented passing in regular TTS (non-llm) transcript for reference, which again helps the LLM do better.
Perhaps half with a web browser to view the results, and half working blind with the numbers alone?
I imagine that you could theoretically also guess and check without the web browser by manually rendering the SVG using some graph paper, a compass, a straightedge, and coloured pencils, but that sounds unbelievably tedious and also very error-prone.
0: https://www.wired.com/2016/04/can-draw-bikes-memory-definite...
Without having an LLM figure out the required command line parameters? Mad props!
> Truth be told, I’m playing the long game here. All I’ve ever wanted from life is a genuinely great SVG vector illustration of a pelican riding a bicycle. My dastardly multi-year plan is to trick multiple AI labs into investing vast resources to cheat at my benchmark until I get one.
https://simonwillison.net/2025/Nov/13/training-for-pelicans-...
I think a more challenging, well, challenge, would be to offer an even more absurd scenario and see how the model handles it.
Example: generate an svg of a pelican and a mongoose eating popcorn inside a pyramid-shaped vehicle flying around Jupiter. Result: https://imgur.com/a/TBGYChc
I was inspired by Max Woolf's nano banana test prompts: https://minimaxir.com/2025/11/nano-banana-prompts/
Do you think it would be reasonable to include both in future reviews, at least for the sake of back-compatibility (and comparability)?
I'll need to update for V2!
Love the pivot in pelican generation bench.
These things are getting really good at just regular transcription (as long as you don't care about verbatimicity), but every additional dimension you add (timestamps, speaker assignment, etc) will make the others worse. These work much better as independent processes that then get reconciled and refined by a multimodal LLM.
5 more comments available on Hacker News