SherlockBench

An LLM Benchmarking System

SherlockBench is a benchmarking system designed to test an LLM's ability to pro-actively investigate a problem.

SherlockBench is a good benchmark because:

  • It's resistant to memorisation because it doesn't use a Q&A format. Essentially it tests if the model can practice the scientific method i.e. hypothesise, experiment, analyse.
  • A model has to use function-calling and structured outputs to perform well at this benchmark. This means it gives an indication of whether the model is actually useful for professional/business workloads.
  • The benchmarking system is Open Source and it is easy to make new problem-sets for it.

Leaderboard

Model Provider Average pass@10
o4-mini (high effort) o4-mini-2025-04-16 (high effort) OpenAI 77% NT Not Tested
Pass@5 = 83%
o3 (high effort) o3-2025-04-16 (high effort) OpenAI 73% NT Not Tested
Pass@5 = 83%
o3 (medium effort) o3-2025-04-16 (medium effort) OpenAI 73% 83%
o4-mini (medium effort) o4-mini-2025-04-16 (medium effort) OpenAI 63% 83%
Gemini 2.5 Pro gemini-2.5-pro-preview-06-05 Google 49% 83%
Deepseek R1 0528 DeepSeek-R1-0528 Deepseek 48% 83%
GPT-4.1 gpt-4.1-2025-04-14 OpenAI 46% 75%
Grok-3-mini (high effort) grok-3-mini-beta (high effort) xAI 45% 75%
o3-mini (medium effort) o3-mini-2025-01-31 (medium effort) OpenAI 43% 58%
Grok 3 grok-3-beta xAI 41% 67%
GPT-4.1 mini gpt-4.1-mini-2025-04-14 OpenAI 38% 75%
Deepseek V3 deepseek-v3-0324 Deepseek 36% 67%
Grok-3-mini (low effort) grok-3-mini-beta (low effort) xAI 35% 67%
Gemini 2.5 Flash gemini-2.5-flash-preview-05-20 Google 31% 58%
Qwen3 235B-A22B qwen3-235b-a22b Fireworks 28% 58%
GPT-4o gpt-4o-2024-08-06 OpenAI 16% 42%
GPT-4o mini gpt-4o-mini-2024-07-18 OpenAI 10% 25%
Claude 3.5 Haiku claude-3-5-haiku-20241022 Anthropic 8% 17%
Llama 3.1 405B Instruct llama-v3p1-405b-instruct Fireworks 4% 25%

Provisional

Usually I test each model 10 times and average the score. However due to budget or time reasons the following models have only been tested once. Consider these scores as approximate.

Model Provider Score
Claude Sonnet 4 + extended thinking claude-sonnet-4-20250514 + thinking Anthropic 58%
Claude Opus 4 + extended thinking claude-opus-4-20250514 + thinking Anthropic 50%
Claude Opus 4 claude-opus-4-20250514 Anthropic 42%
Claude Sonnet 4 claude-sonnet-4-20250514 Anthropic 25%

Moriarty problem-set

Models which get at-least 50% on Sherlock1 get to try the Moriarty problem-set.

Model Provider Average pass@10
o3 (high effort) o3-2025-04-16 (high effort) OpenAI 42% NT Not Tested
Pass@5 = 70%
o3 (medium effort) o3-2025-04-16 (medium effort) OpenAI 32% 60%
o4-mini (medium effort) o4-mini-2025-04-16 (medium effort) OpenAI 25% 50%