Experimental Optical Encoder for Qwen3-Vlm-2b-Instruct

Posted3 months ago

volkopat2

1 points

1 comments

github.comTechstory

calmneutral

Debate

0/100

Optical ComputingMachine LearningExperimental Hardware

Key topics

Optical Computing

Machine Learning

Experimental Hardware

A GitHub repository shares an experimental optical encoder for the Qwen3-VLM-2B-Instruct model, sparking interest in novel hardware approaches for AI.

Snapshot generated from the HN discussion

Discussion Activity

Light discussion

First comment

N/A

Peak period

Start

Avg / period

Key moments

01Story posted
Oct 23, 2025 at 4:18 PM EDT
3 months ago
Step 01
02First comment
Oct 23, 2025 at 4:18 PM EDT
0s after posting
Step 02
03Peak activity
1 comments in Start
Hottest window of the conversation
Step 03
04Latest activity
Oct 23, 2025 at 4:18 PM EDT
3 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (1 comments)

Showing 1 comments

volkopat2Author

3 months ago

Hey everyone!

So I am quite amazed with the innovation in DeepSeek-OCR model! I wanted to break it apart and try it out myself, so I asked myself - what if I extract the encoder to fit other existing VLMs?

https://huggingface.co/Volkopat/DeepSeek-DeepEncoder

I didn't have any expectations and was doing this just for fun cos why not? Moving on, after vibe scripting with the encoder, I tried to patch this with Qwen3-VLM 2B. Due to difference in input dimensions of Qwen and the DeepSeek encoder, I pretrained a custom adapter to fit this piece of puzzle.

https://huggingface.co/Volkopat/Qwen-VLM-Optical-Encoder

Long story short - I noticed some performance gains in my experimental synthetic dataset as well as Longbench V2. You can check the project out and try it -

https://github.com/Volkopat/VLM-Optical-Encoder

I have added the training and test scripts in the repo.

In a miniscule test run of 50 cases of LongBench V2 benchmark - I noticed that the custom optical encoder with compressed visual tokens performed slightly better than the original Qwen encoder. It could be that 2B model is really weak for this benchmark.

I could be wrong in my approach so I don't want to hype this too much, and I am more curious to find out if this is scalable beyond 2B? I'm GPU poor with a 12 GB 5070 so I would love it if someone gives this a shot and try to take it further? Hope this helps!

View full discussion on Hacker News

ID: 45686536Type: storyLast synced: 11/17/2025, 9:12:55 AM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN