The SherlockBench Problem-sets
SherlockBench problem-sets are just files written in Clojure which can be loaded by the API server. Each has it's own repo in GitHub.
Irene1 Problem-Set
This is our newest problem-set. It is very small to make it easy to run and is designed to test fluid intelligence.
GitHub Link: sherlockbench-problems-irene1
Mycroft1 Problem-Set
This contains a lot of problems which smart models sometimes get right.
GitHub Link: sherlockbench-problems-mycroft1
Sherlock1 Problem-Set
This is the problem-set we use in our research paper. It has a mixture of easy and hard problems.
GitHub Link: sherlockbench-problems-sherlock1
Sherlock1 Leaderboard
| Model | Provider | Average | pass@10 |
|---|---|---|---|
| o4-mini (high effort) o4-mini-2025-04-16 (high effort) | OpenAI | 73% | 92% |
| Grok 4 grok-4-0709 | xAI | 72% | NT
Not Tested Pass@3 = 83% |
| o3 (high effort) o3-2025-04-16 (high effort) | OpenAI | 72% | 83% |
| o3 (medium effort) o3-2025-04-16 (medium effort) | OpenAI | 73% | 83% |
| Claude Sonnet 4 + extended thinking claude-sonnet-4-20250514 + thinking | Anthropic | 64% | NT
Not Tested Pass@3 = 83% |
| o4-mini (medium effort) o4-mini-2025-04-16 (medium effort) | OpenAI | 63% | 83% |
| Qwen3 235B-A22B qwen3-235b-a22b-instruct-2507 | Fireworks | 50% | NT
Not Tested Pass@3 = 67% |
| Gemini 2.5 Pro gemini-2.5-pro-preview-06-05 | 49% | 83% | |
| Deepseek R1 0528 DeepSeek-R1-0528 | Deepseek | 48% | 83% |
| GPT-4.1 gpt-4.1-2025-04-14 | OpenAI | 46% | 75% |
| Grok-3-mini (high effort) grok-3-mini-beta (high effort) | xAI | 45% | 75% |
| o3-mini (medium effort) o3-mini-2025-01-31 (medium effort) | OpenAI | 43% | 58% |
| Kimi-k2 kimi-k2-0711-preview | Moonshot | 42% | NT
Not Tested Pass@3 = 50% |
| Grok 3 grok-3-beta | xAI | 41% | 67% |
| GPT-4.1 mini gpt-4.1-mini-2025-04-14 | OpenAI | 38% | 75% |
| Deepseek V3 deepseek-v3-0324 | Deepseek | 36% | 67% |
| Grok-3-mini (low effort) grok-3-mini-beta (low effort) | xAI | 35% | 67% |
| Gemini 2.5 Flash gemini-2.5-flash-preview-05-20 | 31% | 58% | |
| Gemini 2.5 Flash Lite gemini-2.5-flash-lite-preview-06-17 | 20% | 58% | |
| GPT-4o gpt-4o-2024-08-06 | OpenAI | 16% | 42% |
| GPT-4o mini gpt-4o-mini-2024-07-18 | OpenAI | 10% | 25% |
| Claude 3.5 Haiku claude-3-5-haiku-20241022 | Anthropic | 8% | 17% |
| Llama 3.1 405B Instruct llama-v3p1-405b-instruct | Fireworks | 4% | 25% |
Moriarty1 Problem-Set
This problem-set is designed for testing frontier reasoning models which get very high scores on Sherlock1.
GitHub Link: sherlockbench-problems-moriarty1
Moriarty1 Leaderboard
| Model | Provider | Average | pass@10 |
|---|---|---|---|
| o3 (high effort) o3-2025-04-16 (high effort) | OpenAI | 45% | 70% |
| o4-mini (high effort) o4-mini-2025-04-16 (high effort) | OpenAI | 38% | NT
Not Tested Pass@5 = 70% |
| Grok 4 grok-4-0709 | xAI | 37% | NT
Not Tested Pass@3 = 60% |
| Claude Sonnet 4 + extended thinking claude-sonnet-4-20250514 + thinking | Anthropic | 33% | NT
Not Tested Pass@3 = 40% |
| o3 (medium effort) o3-2025-04-16 (medium effort) | OpenAI | 32% | 60% |
| o4-mini (medium effort) o4-mini-2025-04-16 (medium effort) | OpenAI | 25% | 50% |
InterroBench Problem-Set
This is a port of the version 6 problem-set from the old InterroBench benchmark.
GitHub Link: sherlockbench-problems-interrobench6