SherlockBench

The SherlockBench Problem-sets

SherlockBench problem-sets are just files written in Clojure which can be loaded by the API server. Each has it's own repo in GitHub.

Irene1 Problem-Set

This is our newest problem-set. It is very small to make it easy to run and is designed to test fluid intelligence.

GitHub Link: sherlockbench-problems-irene1

Mycroft1 Problem-Set

This contains a lot of problems which smart models sometimes get right.

GitHub Link: sherlockbench-problems-mycroft1

Sherlock1 Problem-Set

This is the problem-set we use in our research paper. It has a mixture of easy and hard problems.

GitHub Link: sherlockbench-problems-sherlock1

Sherlock1 Leaderboard
Model Provider Average pass@10
o4-mini (high effort) o4-mini-2025-04-16 (high effort) OpenAI 73% 92%
Grok 4 grok-4-0709 xAI 72% NT Not Tested
Pass@3 = 83%
o3 (high effort) o3-2025-04-16 (high effort) OpenAI 72% 83%
o3 (medium effort) o3-2025-04-16 (medium effort) OpenAI 73% 83%
Claude Sonnet 4 + extended thinking claude-sonnet-4-20250514 + thinking Anthropic 64% NT Not Tested
Pass@3 = 83%
o4-mini (medium effort) o4-mini-2025-04-16 (medium effort) OpenAI 63% 83%
Qwen3 235B-A22B qwen3-235b-a22b-instruct-2507 Fireworks 50% NT Not Tested
Pass@3 = 67%
Gemini 2.5 Pro gemini-2.5-pro-preview-06-05 Google 49% 83%
Deepseek R1 0528 DeepSeek-R1-0528 Deepseek 48% 83%
GPT-4.1 gpt-4.1-2025-04-14 OpenAI 46% 75%
Grok-3-mini (high effort) grok-3-mini-beta (high effort) xAI 45% 75%
o3-mini (medium effort) o3-mini-2025-01-31 (medium effort) OpenAI 43% 58%
Kimi-k2 kimi-k2-0711-preview Moonshot 42% NT Not Tested
Pass@3 = 50%
Grok 3 grok-3-beta xAI 41% 67%
GPT-4.1 mini gpt-4.1-mini-2025-04-14 OpenAI 38% 75%
Deepseek V3 deepseek-v3-0324 Deepseek 36% 67%
Grok-3-mini (low effort) grok-3-mini-beta (low effort) xAI 35% 67%
Gemini 2.5 Flash gemini-2.5-flash-preview-05-20 Google 31% 58%
Gemini 2.5 Flash Lite gemini-2.5-flash-lite-preview-06-17 Google 20% 58%
GPT-4o gpt-4o-2024-08-06 OpenAI 16% 42%
GPT-4o mini gpt-4o-mini-2024-07-18 OpenAI 10% 25%
Claude 3.5 Haiku claude-3-5-haiku-20241022 Anthropic 8% 17%
Llama 3.1 405B Instruct llama-v3p1-405b-instruct Fireworks 4% 25%

Moriarty1 Problem-Set

This problem-set is designed for testing frontier reasoning models which get very high scores on Sherlock1.

GitHub Link: sherlockbench-problems-moriarty1

Moriarty1 Leaderboard
Model Provider Average pass@10
o3 (high effort) o3-2025-04-16 (high effort) OpenAI 45% 70%
o4-mini (high effort) o4-mini-2025-04-16 (high effort) OpenAI 38% NT Not Tested
Pass@5 = 70%
Grok 4 grok-4-0709 xAI 37% NT Not Tested
Pass@3 = 60%
Claude Sonnet 4 + extended thinking claude-sonnet-4-20250514 + thinking Anthropic 33% NT Not Tested
Pass@3 = 40%
o3 (medium effort) o3-2025-04-16 (medium effort) OpenAI 32% 60%
o4-mini (medium effort) o4-mini-2025-04-16 (medium effort) OpenAI 25% 50%

InterroBench Problem-Set

This is a port of the version 6 problem-set from the old InterroBench benchmark.

GitHub Link: sherlockbench-problems-interrobench6