Running a 270M LLM on Android (architecture and benchmarks)
Mood
informative
Sentiment
neutral
Category
tech_discussion
Key topics
LLM
Android
Mobile Hardware
AI
Privacy
Cloud-based LLM APIs are convenient, but come with:
-latency from network round-trips -unpredictable API costs -privacy concerns (content leaving device) -the need for connectivity
For simple tasks like news summarization, small models seem “good enough,” so I tested whether a ~270M parameter model gemma3-270m could run entirely on-device.
Model - Gemma3-270M INT8 Quantized Runtime - Cactus SDK (Android NPU/GPU acceleration) App Framework - Flutter Device - Mediatek 7300 with 8GB RAM
Architecture - User shares a URL to the app (Android share sheet). - App fetches article HTML → extracts readable text. - Local model generates a summary. - device TTS reads the summary. Everything runs offline except the initial page fetch.
Performace - ~450–900ms Latency for a short summary (100–200 tokens). - On devices without NPU acceleration, CPU-only inference takes 2–3× longer. - Peak RAM: ~350–450MB
Limitation -Quality is noticeably worse than GPT-5 for complex articles. -Long-form summarization (>1k words) gets inconsistent. -Web scraping is fragile for JS-heavy or paywalled sites. -Some low-end phones throttle CPU/GPU aggressively.
| Metric | Local (Gemma 270M) | GPT-4o Cloud | | ------- | -------------------- | -------------------- | | Latency | 0.5–1.5s | 0.7–1.5s + network | | Cost | 0 | API cost per request | | Privacy | Text stays on device | Sent over network | | Quality | Medium | High |
Github - https://github.com/ayusrjn/briefly
Running small LLMs on-device is viable for narrow tasks like summarization. For more complex reasoning tasks, cloud models still outperform by a large margin, but the “local-first” approach seems promising for privacy-sensitive or offline-first applications. Cactus SDK does a pretty good job for handling the model and accelarations.
Happy to answer Questions :)
Discussion Activity
No activity data yet
We're still syncing comments from Hacker News.
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Discussion hasn't started yet.
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.