Archestra's Dual LLM Pattern: Using "guess Who?" Logic to Stop Prompt Injections
Posted3 months agoActive3 months ago
archestra.aiTechstory
calmpositive
Debate
20/100
LLMAI SecurityPrompt Injection
Key topics
LLM
AI Security
Prompt Injection
The article discusses Archestra's Dual LLM pattern, a technique to prevent prompt injections in large language models (LLMs), and the HN community discusses its effectiveness and potential applications.
Snapshot generated from the HN discussion
Discussion Activity
Moderate engagementFirst comment
N/A
Peak period
6
0-3h
Avg / period
2.2
Comment distribution11 data points
Loading chart...
Based on 11 loaded comments
Key moments
- 01Story posted
Oct 13, 2025 at 9:51 AM EDT
3 months ago
Step 01 - 02First comment
Oct 13, 2025 at 9:51 AM EDT
0s after posting
Step 02 - 03Peak activity
6 comments in 0-3h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 15, 2025 at 5:22 AM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45568339Type: storyLast synced: 11/17/2025, 10:05:08 AM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
However, in your example, I don't see how the agent decides what to do and how to do it. So it is unclear for me how the main agent is protected. That is, what is preventing the quarantined LLM to act on the malicious instructions instead, ignoring the documentation update, causing the agent to act on those?
That is, what is preventing the quarantined LLM to make the agent think it should generate a bug report with all the API keys in it?
Anyway, I do think having a secondary quarantined LLM seems like a good idea for agentic systems. In general, having a second LLM review the primary LLM in seems to identify a lot of problematic issues and leads to significantly better results.
The main LLM does have access to the tools or sensitive data, but doesn't have direct access to untrusted data (quarantine LLM is restricted at the controller level to respond only with integer digits, and only to legitimate questions from the main llm)
In the example case, without having access to the issue text (the evil data), how does the main LLM actually figure out what to do if the quarantined LLM can just answer with digits?
Sure it can discover that it's a request to update the documentation, but how does it get the information it needs to actually change the erroneous part of the documentation?
I know LLama folks have a special Guard model for example, which I imagine is for such tasks.
So my ignorant questions are this:
Do these MCP endpoints not run such guard models, and if so why not?
If they do, how come they don't stop such blatant attacks that seemingly even an old local model like Gemma 2 can sniff out?
Joey here from Archestra. Good question. I recently was evaluating what you mention, against the latest/"smartest" models from the big LLM providers, and I was able to trick all of them.
Take a look at https://www.archestra.ai/blog/what-is-a-prompt-injection which has all the details on how I did this.
Why on earth would you not consider that as a very dangerous operation that needs to be carefully managed? It's like parking your bike downtown hoping it wont be stolen. Like, at least use a zip tie or something.
That said, I agree with your post that this won't catch everything. So something else, like a quarantined LLM like you suggest is likely needed.
However I just didn't expect such blatant attacks to pass.
It’s really a cat-and-mouse game, where for each new model version, new jailbreaks and injections are found