To solve the benchmark crisis, evals must think | Not Hacker News!