Experiment: Grid-Indexed Localization with Llms for Bounding Boxes

Posted4 months ago

iaitalia

2 points

1 comments

github.comResearchstory

informativeneutral

Debate

20/100

Large Language ModelsDetrsLocalization

Key topics

Large Language Models

Detrs

Localization

Discussion Activity

Light discussion

First comment

N/A

Peak period

Start

Avg / period

Key moments

01Story posted
Aug 26, 2025 at 5:36 PM EDT
4 months ago
Step 01
02First comment
Aug 26, 2025 at 5:36 PM EDT
0s after posting
Step 02
03Peak activity
1 comments in Start
Hottest window of the conversation
Step 03
04Latest activity
Aug 26, 2025 at 5:36 PM EDT
4 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (1 comments)

Showing 1 comments

iaitaliaAuthor

4 months ago

I’ve been experimenting with using multimodal LLMs for bounding box detection, and ran into the same issue many people here have described: models often return arbitrary or inconsistent coordinates.

As a workaround, I tried a different approach: instead of asking the model for raw pixel coordinates, I divide the image into a grid and let the LLM reason in terms of grid cells (e.g. row/column indices). These grid indices are then mapped back into pixel coordinates.

This “grid-indexed” method doesn’t solve everything, but it seems to reduce randomness and makes outputs more stable across providers (OpenAI, Anthropic, Gemini, etc.). It’s lightweight — just a single JS file + example HTML demo.

Code and README are here: https://github.com/IntelligenzaArtificiale/GILM-Grid-Indexed...

I’d be curious if others have tried similar approaches, or if anyone has ideas on how to improve robustness of bounding box detection with LLMs.

View full discussion on Hacker News

ID: 45032604Type: storyLast synced: 11/18/2025, 12:08:58 AM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN