We Tested 6 AI Models on 3 Common Security Exploits

Postedabout 2 months agoActiveabout 2 months ago

heymax054

2 points

1 comments

blog.kilocode.aiTechstory

skepticalnegative

Debate

40/100

AI SecurityLLM EvaluationModel Bias

Key topics

AI Security

LLM Evaluation

Model Bias

The post tests 6 AI models on 3 common security exploits, but the discussion raises concerns about the methodology, particularly the use of a different model to judge the output of another.

Snapshot generated from the HN discussion

Discussion Activity

Light discussion

First comment

Peak period

2-3h

Avg / period

Key moments

01Story posted
Nov 5, 2025 at 5:23 PM EST
about 2 months ago
Step 01
02First comment
Nov 5, 2025 at 8:01 PM EST
3h after posting
Step 02
03Peak activity
1 comments in 2-3h
Hottest window of the conversation
Step 03
04Latest activity
Nov 5, 2025 at 8:01 PM EST
about 2 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (1 comments)

Showing 1 comments

viraptor

about 2 months ago

> No self-judging: GPT-5 scored the other 5 models but couldn’t evaluate its own output - I used Claude Opus 4.1 as judge for GPT-5’s submissions to avoid bias.

That's a bit silly, especially since all openai models will share some elements. The points lost meaning there. They could for example use glm for all judging instead. Or go all the way and do a full matrix of everything judging everything else.

View full discussion on Hacker News

ID: 45828917Type: storyLast synced: 11/17/2025, 7:54:12 AM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN