SherlockBench

An LLM Benchmarking System

SherlockBench is a benchmarking system designed to test an LLM's ability to pro-actively investigate a problem.

SherlockBench is a good benchmark because:

  • It's resistant to memorisation because it doesn't use a Q&A format. Essentially it tests if the model can practice the scientific method i.e. hypothesise, experiment, analyse.
  • A model has to use function-calling and structured outputs to perform well at this benchmark. This means it gives an indication of whether the model is actually useful for professional/business workloads.
  • The benchmarking system is Open Source and it is easy to make new problem-sets for it.

Leaderboard

Model Provider Average pass@10
o4-mini (high effort) o4-mini-2025-04-16 (high effort) OpenAI 73% 92%
Grok 4 grok-4-0709 xAI 72% NT Not Tested
Pass@3 = 83%
o3 (high effort) o3-2025-04-16 (high effort) OpenAI 72% 83%
o3 (medium effort) o3-2025-04-16 (medium effort) OpenAI 73% 83%
Claude Sonnet 4 + extended thinking claude-sonnet-4-20250514 + thinking Anthropic 64% NT Not Tested
Pass@3 = 83%
o4-mini (medium effort) o4-mini-2025-04-16 (medium effort) OpenAI 63% 83%
Qwen3 235B-A22B qwen3-235b-a22b-instruct-2507 Fireworks 50% NT Not Tested
Pass@3 = 67%
Gemini 2.5 Pro gemini-2.5-pro-preview-06-05 Google 49% 83%
Deepseek R1 0528 DeepSeek-R1-0528 Deepseek 48% 83%
GPT-4.1 gpt-4.1-2025-04-14 OpenAI 46% 75%
Grok-3-mini (high effort) grok-3-mini-beta (high effort) xAI 45% 75%
o3-mini (medium effort) o3-mini-2025-01-31 (medium effort) OpenAI 43% 58%
Kimi-k2 kimi-k2-0711-preview Moonshot 42% NT Not Tested
Pass@3 = 50%
Grok 3 grok-3-beta xAI 41% 67%
GPT-4.1 mini gpt-4.1-mini-2025-04-14 OpenAI 38% 75%
Deepseek V3 deepseek-v3-0324 Deepseek 36% 67%
Grok-3-mini (low effort) grok-3-mini-beta (low effort) xAI 35% 67%
Gemini 2.5 Flash gemini-2.5-flash-preview-05-20 Google 31% 58%
Gemini 2.5 Flash Lite gemini-2.5-flash-lite-preview-06-17 Google 20% 58%
GPT-4o gpt-4o-2024-08-06 OpenAI 16% 42%
GPT-4o mini gpt-4o-mini-2024-07-18 OpenAI 10% 25%
Claude 3.5 Haiku claude-3-5-haiku-20241022 Anthropic 8% 17%
Llama 3.1 405B Instruct llama-v3p1-405b-instruct Fireworks 4% 25%

Moriarty problem-set

Models which get at-least 50% on Sherlock1 get to try the Moriarty problem-set.

Model Provider Average pass@10
o3 (high effort) o3-2025-04-16 (high effort) OpenAI 45% 70%
o4-mini (high effort) o4-mini-2025-04-16 (high effort) OpenAI 38% NT Not Tested
Pass@5 = 70%
Grok 4 grok-4-0709 xAI 37% NT Not Tested
Pass@3 = 60%
Claude Sonnet 4 + extended thinking claude-sonnet-4-20250514 + thinking Anthropic 33% NT Not Tested
Pass@3 = 40%
o3 (medium effort) o3-2025-04-16 (medium effort) OpenAI 32% 60%
o4-mini (medium effort) o4-mini-2025-04-16 (medium effort) OpenAI 25% 50%