SherlockBench

An LLM Benchmarking System

SherlockBench is a benchmarking system designed to test an LLM's ability to pro-actively investigate a problem. It is given a mystery function, and has to test it to determine what it does.

It tests "pro-active learning" which means it requires the LLM to reason about a strategy for it's own learning.

Example problems

Due to the extreme cost of current frontier models (some costing several dollars to solve a single problem) I am no-longer providing a leaderboard of all models. Instead we provide a few example problems, with an explanation of how well different models perform on those.

Valley or Mountain

This is an easy problem for most LLMs. It is provided with a function which takes three integers. The output of the function follows the following rule:

  • if all equal -> "plateau"
  • if middle number is smallest -> "valley"
  • if middle number is largest -> "mountain"
  • if descending -> decline
  • if ascending -> incline
            xychart-beta
            title "Valley or Mountain"
            x-axis ["Claude", "Gemini", GPT-5, o3, "Grok 4", "Qwen3", "Deepseek", Kimi-k2]
            y-axis "success rate %" 0 --> 100
            bar [100, 100, 100, 100, 100, 50, 90, 100]
          

Why did Qwen do so badly? It often never tried any numbers that resulted in a mountain or valley, only trying ascending, descending or equal numbers. Sample test logs:

Set Heading

This problem is easy for most humans but for some reason impossible for LLMs.

The function takes two integers. It treats them as a latitude and longitude, and returns directions to a secret co-ordinate. The LLM is expected to follow the directions and find the treasure.

            xychart-beta
            title "Set Heading"
            x-axis ["Claude", "Gemini", GPT-5, o3, "Grok 4", "Qwen3", "Deepseek", Kimi-k2]
            y-axis "success rate %" 0 --> 100
            bar [0, 0, 0, 0, 0, 0, 0, 0]
          

This is just a super-easy problem in my opinion. But most models just keep trying crazy combinations that don't lead to a result. Sample test logs:

Crack Lock

This problem is really easy once you figure out the rule. But again LLMs never get it right.

It represents a number lock, and you have to guess the code. However, it gives you a clue on how close you are as you go! Just tweak the inputs one at a time until the number reaches zero to find the secret code.

            xychart-beta
            title "Crack Lock"
            x-axis ["Claude", "Gemini", GPT-5, o3, "Grok 4", "Qwen3", "Deepseek", Kimi-k2]
            y-axis "success rate %" 0 --> 100
            bar [0, 10, 0, 0, 0, 0, 0, 0]
          

The one time Gemini got it right was sheer luck that it guessed the code on the second test. Sample test logs:

If you want to work through these three problems yourself, navigate to our Test Me page and select the "Irene1" problem-set.

Appendix

Method

The tests in this page were collected using:

  • the "Irene1" problem-set
  • with 10 attempts per problem
  • with the 2-phase test mode (the model investigates and decides in a single context window)

Models

Here are the models used above and how we configured them:
Claude
Anthropic's Claude Sonnet 4 + extended thinking
Gemini
Google's Gemini 2.5 Pro
GPT-5
OpenAI's GPT-5 with "reasoning" set to "medium"
o3
OpenAI's o3 with "reasoning" set to "medium"
Grok 4
xAI's Grok 4
Qwen3
Alibaba's Qwen3 235B-A22B
Deepseek
Deepseek's Deepseek R1 0528
Kimi-k2
Moonshot's Kimi-k2