An LLM Benchmarking System
SherlockBench is a benchmarking system designed to test an LLM's ability to pro-actively investigate a problem.
SherlockBench is a good benchmark because:
- It's resistant to memorisation because it doesn't use a Q&A format. Essentially it tests if the model can practice the scientific method i.e. hypothesise, experiment, analyse.
- A model has to use function-calling and structured outputs to perform well at this benchmark. This means it gives an indication of whether the model is actually useful for professional/business workloads.
- The benchmarking system is Open Source and it is easy to make new problem-sets for it.
Leaderboard
Model | Provider | Average | pass@10 |
---|---|---|---|
o4-mini (high effort) o4-mini-2025-04-16 (high effort) | OpenAI | 73% | 92% |
Grok 4 grok-4-0709 | xAI | 72% | NT
Not Tested Pass@3 = 83% |
o3 (high effort) o3-2025-04-16 (high effort) | OpenAI | 72% | 83% |
o3 (medium effort) o3-2025-04-16 (medium effort) | OpenAI | 73% | 83% |
Claude Sonnet 4 + extended thinking claude-sonnet-4-20250514 + thinking | Anthropic | 64% | NT
Not Tested Pass@3 = 83% |
o4-mini (medium effort) o4-mini-2025-04-16 (medium effort) | OpenAI | 63% | 83% |
Qwen3 235B-A22B qwen3-235b-a22b-instruct-2507 | Fireworks | 50% | NT
Not Tested Pass@3 = 67% |
Gemini 2.5 Pro gemini-2.5-pro-preview-06-05 | 49% | 83% | |
Deepseek R1 0528 DeepSeek-R1-0528 | Deepseek | 48% | 83% |
GPT-4.1 gpt-4.1-2025-04-14 | OpenAI | 46% | 75% |
Grok-3-mini (high effort) grok-3-mini-beta (high effort) | xAI | 45% | 75% |
o3-mini (medium effort) o3-mini-2025-01-31 (medium effort) | OpenAI | 43% | 58% |
Kimi-k2 kimi-k2-0711-preview | Moonshot | 42% | NT
Not Tested Pass@3 = 50% |
Grok 3 grok-3-beta | xAI | 41% | 67% |
GPT-4.1 mini gpt-4.1-mini-2025-04-14 | OpenAI | 38% | 75% |
Deepseek V3 deepseek-v3-0324 | Deepseek | 36% | 67% |
Grok-3-mini (low effort) grok-3-mini-beta (low effort) | xAI | 35% | 67% |
Gemini 2.5 Flash gemini-2.5-flash-preview-05-20 | 31% | 58% | |
Gemini 2.5 Flash Lite gemini-2.5-flash-lite-preview-06-17 | 20% | 58% | |
GPT-4o gpt-4o-2024-08-06 | OpenAI | 16% | 42% |
GPT-4o mini gpt-4o-mini-2024-07-18 | OpenAI | 10% | 25% |
Claude 3.5 Haiku claude-3-5-haiku-20241022 | Anthropic | 8% | 17% |
Llama 3.1 405B Instruct llama-v3p1-405b-instruct | Fireworks | 4% | 25% |
Moriarty problem-set
Models which get at-least 50% on Sherlock1 get to try the Moriarty problem-set.
Model | Provider | Average | pass@10 |
---|---|---|---|
o3 (high effort) o3-2025-04-16 (high effort) | OpenAI | 45% | 70% |
o4-mini (high effort) o4-mini-2025-04-16 (high effort) | OpenAI | 38% | NT
Not Tested Pass@5 = 70% |
Grok 4 grok-4-0709 | xAI | 37% | NT
Not Tested Pass@3 = 60% |
Claude Sonnet 4 + extended thinking claude-sonnet-4-20250514 + thinking | Anthropic | 33% | NT
Not Tested Pass@3 = 40% |
o3 (medium effort) o3-2025-04-16 (medium effort) | OpenAI | 32% | 60% |
o4-mini (medium effort) o4-mini-2025-04-16 (medium effort) | OpenAI | 25% | 50% |