An LLM Benchmarking System
SherlockBench is a benchmarking system designed to test an LLM's ability to pro-actively investigate a problem.
SherlockBench is a good benchmark because:
- It's resistant to memorisation because it doesn't use a Q&A format. Essentially it tests if the model can practice the scientific method i.e. hypothesise, experiment, analyse.
- A model has to use function-calling and structured outputs to perform well at this benchmark. This means it gives an indication of whether the model is actually useful for professional/business workloads.
- The benchmarking system is Open Source and it is easy to make new problem-sets for it.
Leaderboard
Model | Provider | Average | pass@10 |
---|---|---|---|
o4-mini (high effort) o4-mini-2025-04-16 (high effort) | OpenAI | 77% | NT
Not Tested Pass@5 = 83% |
o3 (high effort) o3-2025-04-16 (high effort) | OpenAI | 73% | NT
Not Tested Pass@5 = 83% |
o3 (medium effort) o3-2025-04-16 (medium effort) | OpenAI | 73% | 83% |
o4-mini (medium effort) o4-mini-2025-04-16 (medium effort) | OpenAI | 63% | 83% |
Gemini 2.5 Pro gemini-2.5-pro-preview-06-05 | 49% | 83% | |
Deepseek R1 0528 DeepSeek-R1-0528 | Deepseek | 48% | 83% |
GPT-4.1 gpt-4.1-2025-04-14 | OpenAI | 46% | 75% |
Grok-3-mini (high effort) grok-3-mini-beta (high effort) | xAI | 45% | 75% |
o3-mini (medium effort) o3-mini-2025-01-31 (medium effort) | OpenAI | 43% | 58% |
Grok 3 grok-3-beta | xAI | 41% | 67% |
GPT-4.1 mini gpt-4.1-mini-2025-04-14 | OpenAI | 38% | 75% |
Deepseek V3 deepseek-v3-0324 | Deepseek | 36% | 67% |
Grok-3-mini (low effort) grok-3-mini-beta (low effort) | xAI | 35% | 67% |
Gemini 2.5 Flash gemini-2.5-flash-preview-05-20 | 31% | 58% | |
Qwen3 235B-A22B qwen3-235b-a22b | Fireworks | 28% | 58% |
GPT-4o gpt-4o-2024-08-06 | OpenAI | 16% | 42% |
GPT-4o mini gpt-4o-mini-2024-07-18 | OpenAI | 10% | 25% |
Claude 3.5 Haiku claude-3-5-haiku-20241022 | Anthropic | 8% | 17% |
Llama 3.1 405B Instruct llama-v3p1-405b-instruct | Fireworks | 4% | 25% |
Provisional
Usually I test each model 10 times and average the score. However due to budget or time reasons the following models have only been tested once. Consider these scores as approximate.
Model | Provider | Score |
---|---|---|
Claude Sonnet 4 + extended thinking claude-sonnet-4-20250514 + thinking | Anthropic | 58% |
Claude Opus 4 + extended thinking claude-opus-4-20250514 + thinking | Anthropic | 50% |
Claude Opus 4 claude-opus-4-20250514 | Anthropic | 42% |
Claude Sonnet 4 claude-sonnet-4-20250514 | Anthropic | 25% |
Moriarty problem-set
Models which get at-least 50% on Sherlock1 get to try the Moriarty problem-set.
Model | Provider | Average | pass@10 |
---|---|---|---|
o3 (high effort) o3-2025-04-16 (high effort) | OpenAI | 42% | NT
Not Tested Pass@5 = 70% |
o3 (medium effort) o3-2025-04-16 (medium effort) | OpenAI | 32% | 60% |
o4-mini (medium effort) o4-mini-2025-04-16 (medium effort) | OpenAI | 25% | 50% |