An LLM Benchmarking System
SherlockBench is a benchmarking system designed to test an LLM's ability to pro-actively investigate a problem. It is given a mystery function, and has to test it to determine what it does.
It tests "pro-active learning" which means it requires the LLM to reason about a strategy for it's own learning.
Example problems
Due to the extreme cost of current frontier models (some costing several dollars to solve a single problem) I am no-longer providing a leaderboard of all models. Instead we provide a few example problems, with an explanation of how well different models perform on those.
Valley or Mountain
This is an easy problem for most LLMs. It is provided with a function which takes three integers. The output of the function follows the following rule:
- if all equal -> "plateau"
- if middle number is smallest -> "valley"
- if middle number is largest -> "mountain"
- if descending -> decline
- if ascending -> incline
xychart-beta title "Valley or Mountain" x-axis ["Claude", "Gemini", GPT-5, o3, "Grok 4", "Qwen3", "Deepseek", Kimi-k2] y-axis "success rate %" 0 --> 100 bar [100, 100, 100, 100, 100, 50, 90, 100]
Why did Qwen do so badly? It often never tried any numbers that resulted in a mountain or valley, only trying ascending, descending or equal numbers. Sample test logs:
Set Heading
This problem is easy for most humans but for some reason impossible for LLMs.
The function takes two integers. It treats them as a latitude and longitude, and returns directions to a secret co-ordinate. The LLM is expected to follow the directions and find the treasure.
xychart-beta title "Set Heading" x-axis ["Claude", "Gemini", GPT-5, o3, "Grok 4", "Qwen3", "Deepseek", Kimi-k2] y-axis "success rate %" 0 --> 100 bar [0, 0, 0, 0, 0, 0, 0, 0]
This is just a super-easy problem in my opinion. But most models just keep trying crazy combinations that don't lead to a result. Sample test logs:
Crack Lock
This problem is really easy once you figure out the rule. But again LLMs never get it right.
It represents a number lock, and you have to guess the code. However, it gives you a clue on how close you are as you go! Just tweak the inputs one at a time until the number reaches zero to find the secret code.
xychart-beta title "Crack Lock" x-axis ["Claude", "Gemini", GPT-5, o3, "Grok 4", "Qwen3", "Deepseek", Kimi-k2] y-axis "success rate %" 0 --> 100 bar [0, 10, 0, 0, 0, 0, 0, 0]
The one time Gemini got it right was sheer luck that it guessed the code on the second test. Sample test logs:
If you want to work through these three problems yourself, navigate to our Test Me page and select the "Irene1" problem-set.
Appendix
Method
The tests in this page were collected using:
- the "Irene1" problem-set
- with 10 attempts per problem
- with the 2-phase test mode (the model investigates and decides in a single context window)
Models
Here are the models used above and how we configured them:- Claude
- Anthropic's Claude Sonnet 4 + extended thinking
- Gemini
- Google's Gemini 2.5 Pro
- GPT-5
- OpenAI's GPT-5 with "reasoning" set to "medium"
- o3
- OpenAI's o3 with "reasoning" set to "medium"
- Grok 4
- xAI's Grok 4
- Qwen3
- Alibaba's Qwen3 235B-A22B
- Deepseek
- Deepseek's Deepseek R1 0528
- Kimi-k2
- Moonshot's Kimi-k2