Nov 22, 2025 at 4:41 PM EST

Running a 270M LLM on Android (architecture and benchmarks)

ayushranjan99

1 points

0 comments

Mood

informative

Sentiment

neutral

Category

tech_discussion

Key topics

LLM

Android

Mobile Hardware

Privacy

I’ve been experimenting with running small LLMs directly on mobile hardware (low-range Android devices), without relying on cloud inference. This is a summary of what worked, what didn’t, and why.

Cloud-based LLM APIs are convenient, but come with:

-latency from network round-trips -unpredictable API costs -privacy concerns (content leaving device) -the need for connectivity

For simple tasks like news summarization, small models seem “good enough,” so I tested whether a ~270M parameter model gemma3-270m could run entirely on-device.

Model - Gemma3-270M INT8 Quantized Runtime - Cactus SDK (Android NPU/GPU acceleration) App Framework - Flutter Device - Mediatek 7300 with 8GB RAM

Architecture - User shares a URL to the app (Android share sheet). - App fetches article HTML → extracts readable text. - Local model generates a summary. - device TTS reads the summary. Everything runs offline except the initial page fetch.

Performace - ~450–900ms Latency for a short summary (100–200 tokens). - On devices without NPU acceleration, CPU-only inference takes 2–3× longer. - Peak RAM: ~350–450MB

Limitation -Quality is noticeably worse than GPT-5 for complex articles. -Long-form summarization (>1k words) gets inconsistent. -Web scraping is fragile for JS-heavy or paywalled sites. -Some low-end phones throttle CPU/GPU aggressively.

| Metric | Local (Gemma 270M) | GPT-4o Cloud | | ------- | -------------------- | -------------------- | | Latency | 0.5–1.5s | 0.7–1.5s + network | | Cost | 0 | API cost per request | | Privacy | Text stays on device | Sent over network | | Quality | Medium | High |

Github - https://github.com/ayusrjn/briefly

Running small LLMs on-device is viable for narrow tasks like summarization. For more complex reasoning tasks, cloud models still outperform by a large margin, but the “local-first” approach seems promising for privacy-sensitive or offline-first applications. Cactus SDK does a pretty good job for handling the model and accelarations.

Happy to answer Questions :)

Discussion Activity

No activity data yet

We're still syncing comments from Hacker News.

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (0 comments)

Discussion hasn't started yet.

ID: 46018506Type: storyLast synced: 11/23/2025, 12:07:05 AM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

View on HN