High-Resolution Efficient Image Generation From Wifi Mapping
Posted3 months agoActive3 months ago
arxiv.orgResearchstory
skepticalmixed
Debate
80/100
Wifi ImagingDiffusion ModelsSurveillance
Key topics
Wifi Imaging
Diffusion Models
Surveillance
Researchers have developed a method to generate high-resolution images from WiFi mapping data using diffusion models, sparking debate about the technology's capabilities, limitations, and potential surveillance implications.
Snapshot generated from the HN discussion
Discussion Activity
Active discussionFirst comment
1h
Peak period
15
2-4h
Avg / period
4.4
Comment distribution35 data points
Loading chart...
Based on 35 loaded comments
Key moments
- 01Story posted
Oct 1, 2025 at 2:33 AM EDT
3 months ago
Step 01 - 02First comment
Oct 1, 2025 at 3:42 AM EDT
1h after posting
Step 02 - 03Peak activity
15 comments in 2-4h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 2, 2025 at 2:35 AM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45434941Type: storyLast synced: 11/20/2025, 4:44:33 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Is this just extremely overfitted?
Is there a way for us to test this? Or even if the model isn't open source, I'd pay $1 to upload a capture from my wifi card on my linux box and upload it to the researchers and have them generate a picture and see if it's accurate
The more space you take up in the frequency domain, the higher your resolution in the time domain is. Wifi sensing results that detect heart rate or breathing, for example, use even larger bandwidth, to the point where it'd be more accurate to call them radars than wifi access points.
If you uploaded a random room to the model without retraining it, you wouldn't get anything as accurate as the images in the paper.
It seems like they might be giving it more information besides the WiFi data, or else maybe training it on photos of the actual person in the actual room, in which case it's not obvious how well it would generalise.
The model was trained on the room.
It would produce images of the room even without any WiFi data input at all.
The WiFi is used as a modulator on the input to the pre trained model.
It’s not actually generating an image of the room from only WiFi signals.
The encoder itself is trained on latent embeddings of images in the same environment with the same subject, so it learns visual details (that are preserved through the original autoencoder; this is why the model can't overfit on, say, text or faces).
The interesting part of the whole setup is that the wifi signal seems to contain the information required to predict the posture of the individual to a reasonably high degree of accuracy, which is actually pretty cool.
"We consider a WiFi sensing system designed to monitor indoor environments by capturing human activity through wireless signals. The system consists of a WiFi access point, a WiFi terminal, and an RGB camera that is available only during the training phase. This setup enables the collection of paired channel state information (CSI) and image data, which are used to train an image generation model"
I know that is a subjective metric but by anyone’s measure a 4x4 matrix of postage stamp sized images are not high resolution.
2. “Postage stamp sized” is not a resolution. Zoom in on them and you’ll see that they’re quite crisp.
"When a brilliant, driven industrialist harnesses the cutting edge of quantum physics to enable people everywhere, at trivial cost, to see one another at all times: around every corner, through every wall, into everyone's most private, hidden, and even intimate moments. It amounts to the sudden and complete abolition of human privacy--forever."
I think the results here are much less important and surprising than what some people seem to be thinking. To summarize the core of the paper, we took stable diffusion (which is a 3-part system of an encoder, u-net, decoder), and replaced the encoder to use WiFi data instead of images. This gives you two advantages: you get text-based guidance for free, and the encoder model can be smaller. The smaller model combined with the semantic compression from the autoencoder gives you better (SOTA resolution) results, much faster.
I noticed a lot of discussion about how the model can possibly be so accurate. It wouldn't be wrong to consider the model overfit, in the sense that the visual details of the scene are moved from the training data to the model weights. These kinds of models are meant to be trained & deployed in a single environment. What's interesting about this work is that learning the environment well has become really fast because the output dimension is smaller than image space. In fact, it's so fast that you can basically do it in real time... you turn on a data collection node and can train a model from scratch online, in a new environment that gets decent results with at least a little bit of interesting generalization in ~10min. I'm presenting a demonstration of this at Mobicom 2025 next month in Hong Kong.
What people call "WiFi sensing" is now mostly CSI (channel state information) sensing. When you transmit a packet on many subcarriers (frequencies), the CSI represents how the data on each frequency changed during transmission. So, CSI is inherently quite sensitive to environmental changes.
I want to point out something that most everybody working in the CSI sensing/general ISAC space seems to know: generalization is hard and most definitely unsolved for any reasonably high-dimensional sensing problem (like image generation and to some extent pose estimation). I see a lot of fearmongering online about wifi sensing killing privacy for good, but in my opinion we're still quite far off.
I've made the project's code and some formatted data public since this paper is starting to pick up some attention: https://github.com/nishio-laboratory/latentcsi
What is available on the low level? Are researchers using SDR, or there are common wifi chips that properly report CSI? Do most people feed in CSI of literally every packet, or is it sampled?
As for low level:
The most common early hardware was afaik esp32s & https://stevenmhernandez.github.io/ESP32-CSI-Tool/, and also old intel NICs & https://dhalperi.github.io/linux-80211n-csitool/.
Now many people use https://ps.zpj.io/ which supports some hardware including SDRs, but I must discourage using it, especially for research, as it's not free software and has a restrictive license. I used https://feitcsi.kuskosoft.com/ which uses a slightly modified iwlwifi driver, since iwlwifi needs to compute CSI anyway. There are free software alternatives for SDR CSI extraction as well; it's not hard to build an OFDM chain with GNUradio and extract CSI, although this might require a slightly more in-depth understanding of how wifi works.
Basically, researchers figured out how to use the invisible radio waves from your Wi-Fi router to create surprisingly clear pictures of whatever is around it, even if there are walls in the way.
Your router is constantly firing out radio signals, right? When those signals hit a person, a dog, or a chair, they bounce off and create a unique echo pattern. This echo pattern is called CSI (Channel State Information). It's a precise digital "shadow" of everything in the room. Turning that messy echo pattern into an actual picture used to be super difficult and slow. But now, they use a fancy type of AI—the same kind that generates images when you type a prompt—to do the heavy lifting.
The AI is super smart and knows how to instantly translate that invisible echo pattern into a high-resolution image.
So the Big Picture is, It's like they've figured out how to use your average home Wi-Fi to "see" without light or a camera, and they can do it so efficiently (quickly and cheaply) that it might become a normal thing.
It’s pretty wild, and the applications are huge—especially for things like monitoring the health of older people without putting cameras in their rooms. Of course, it also means walls don't stop surveillance anymore, which is kind of unsettling!
Any "unknown" state of the scene is bound to confuse it.