LR²Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems

LR²Bench is a novel benchmark designed to evaluate the Long-chain Reflective Reasoning capabilities of LLMs. LR²Bench comprises 850 samples across six Constraint Satisfaction Problems (CSPs) where reflective reasoning is crucial for deriving solutions that meet all given constraints. Each type of task focuses on distinct constraint patterns, such as knowledge-based, logical, and spatial constraints, providing a comprehensive evaluation of diverse problem-solving scenarios.

Note: We have released the LR²Bench dataset here. For evaluation, you can submit your model's answer here following the submission guidelines. The Leaderboard will automatically evaluate the performance with rule-based matching. If you have further questions, please feel free to contact us at chenjianghao2022@ia.ac.cn.

Model	#Params (B)	Completion Rate	Subtask Accuracy	Exact Match	Partial Match (0.5)	Tokens
mistralai/Mistral-Large-Instruct-2411	671	96.3	58.7	23.6	61.7	11436

Model	#Params (B)	Completion Rate	Subtask Accuracy	Exact Match	Partial Match (0.5)	Tokens
OpenAI-o1-preview	<NA>	96.3	58.7	23.6	61.7	11436
deepseek-ai/DeepSeek-R1	671	100.0	58.4	20.0	62.0	9856
OpenAI-o1-mini	<NA>	97.7	41.3	9.1	32.8	9576
Qwen/QwQ-32B-Preview	32	65.0	26.6	8.5	19.3	6709
Gemini-2.0-flash-thinking	<NA>	88.2	39.4	4.3	35.0	3725
OpenAI-gpt-4o	<NA>	99.8	43.7	3.2	41.7	1486
mistralai/Mistral-Large-Instruct-2411	123	96.1	36.4	2.5	30.0	2313
Gemini-2.0-flash	<NA>	81.1	37.0	2.4	34.5	2637
meta-llama/Llama-3.3-70B-Instruct	70	92.4	33.1	1.3	25.8	1842
Qwen/Qwen2.5-72B-Instruct	72	95.0	33.9	0.9	20.8	2359
Qwen/Qwen2.5-32B-Instruct	32	96.2	29.9	0.6	14.8	1924
meta-llama/Llama-3.1-70B-Instruct	70	71.8	27.4	0.5	21.9	2090
Qwen/Qwen2.5-7B-Instruct	7	85.1	17.7	0.3	5.1	2086
mistralai/Mistral-Small-Instruct-2409	22	91.0	23.1	0.2	13.3	2273
mistralai/Mistral-7B-Instruct-v0.3	7	85.8	12.1	0.0	2.3	2736
meta-llama/Llama-3.1-8B-Instruct	8	42.6	9.9	0.0	3.8	2478

The Crossword task requires inferring correct words from given clues and filling them into a grid. A key challenge lies in satisfying the constraint of shared letter intersections between horizontal and vertical words. We collected 150 Crossword samples published in 2024 from Los Angeles Times and Vulture in three sizes: $5 imes5$, $10 imes10$, and $15 imes15$, with 50 ones for each size.

Model	#Params (B)	Completion Rate	Subtask Accuracy	Exact Match	Partial Match (0.5)	Tokens
mistralai/Mistral-Large-Instruct-2411	671	99.3	77.7	24.7	89.3	10098

Model	#Params (B)	Completion Rate	Subtask Accuracy	Exact Match	Partial Match (0.5)	Tokens
OpenAI-o1-preview	<NA>	98.0	77.7	24.7	89.3	10098
deepseek-ai/DeepSeek-R1	671	100.0	75.3	16.7	94.0	9810
mistralai/Mistral-Large-Instruct-2411	123	99.3	62.8	2.0	86.0	3237
Gemini-2.0-flash-thinking	<NA>	94.7	57.7	1.3	79.3	2648
OpenAI-gpt-4o	<NA>	100.0	66.0	1.3	86.7	1726
OpenAI-o1-mini	<NA>	95.3	45.5	1.3	54.0	7840
meta-llama/Llama-3.1-70B-Instruct	70	77.3	46.8	0.0	62.0	3072
Gemini-2.0-flash	<NA>	98.7	61.6	0.0	83.3	2555
mistralai/Mistral-7B-Instruct-v0.3	7	94.0	23.0	0.0	6.7	3655
meta-llama/Llama-3.3-70B-Instruct	70	85.3	47.6	0.0	65.3	2613
meta-llama/Llama-3.1-8B-Instruct	8	61.3	23.3	0.0	14.0	2888
mistralai/Mistral-Small-Instruct-2409	22	98.7	48.3	0.0	54.0	3135
Qwen/QwQ-32B-Preview	32	80.0	30.2	0.0	18.0	4817
Qwen/Qwen2.5-32B-Instruct	32	100.0	34.6	0.0	20.0	2560
Qwen/Qwen2.5-72B-Instruct	72	100.0	44.1	0.0	36.7	2735
Qwen/Qwen2.5-7B-Instruct	7	98.7	21.1	0.0	3.3	2441

The Acrostic task involves word clues like Crossword, but its objective is to form a hidden quotation or sentence from the answers to the clues. This requires that the answer words not only satisfy the corresponding clues but also effectively integrate to construct the ultimate hidden message. We collected 50 easy and 50 hard Acrostic samples from Printable Puzzles with timestamps ranging from September 2024 to December 2024.

Model	#Params (B)	Completion Rate	Subtask Accuracy	Exact Match	Partial Match (0.5)	Tokens
mistralai/Mistral-Large-Instruct-2411	671	100	62.2	0	83	10077

Model	#Params (B)	Completion Rate	Subtask Accuracy	Partial Match (0.5)	Tokens
deepseek-ai/DeepSeek-R1	671	100.0	62.2	83.0	10077
Gemini-2.0-flash	<NA>	98.0	48.0	48.0	4020
Gemini-2.0-flash-thinking	<NA>	92.0	40.7	27.0	4257
meta-llama/Llama-3.1-70B-Instruct	70	84.0	35.8	21.0	3565
meta-llama/Llama-3.1-8B-Instruct	8	43.0	5.6	0.0	3712
meta-llama/Llama-3.3-70B-Instruct	70	97.0	40.8	28.0	3584
mistralai/Mistral-7B-Instruct-v0.3	7	75.0	7.9	0.0	4600
mistralai/Mistral-Large-Instruct-2411	123	98.0	39.4	20.0	4279
mistralai/Mistral-Small-Instruct-2409	22	67.0	5.5	0.0	4171
OpenAI-gpt-4o	<NA>	100.0	56.0	67.0	3229
OpenAI-o1-mini	<NA>	97.0	34.7	12.0	10952
OpenAI-o1-preview	<NA>	100.0	67.2	90.0	14847
Qwen/QwQ-32B-Preview	32	97.0	31.6	6.0	4964
Qwen/Qwen2.5-32B-Instruct	32	100.0	31.8	2.0	4073
Qwen/Qwen2.5-72B-Instruct	72	100.0	39.3	18.0	4111
Qwen/Qwen2.5-7B-Instruct	7	42.0	3.6	0.0	4159

The Logic Puzzle task constitutes a problem that necessitates logical reasoning to deduce relationships between a set of entities based on the given constraints and clues. The objective is to systematically analyze the given information, employing techniques such as hypothesis formation, elimination, and deductive inference, to determine a unique solution that satisfies all given constraints. We collected 50 puzzles for each of the four sizes ($4 imes4$, $4 imes5$, $4 imes6$, and $4 imes7$) from Printable Puzzles, with timestamps ranging from September 2024 to December 2024.

Model	#Params (B)	Completion Rate	Subtask Accuracy	Exact Match	Partial Match (0.5)	Tokens
mistralai/Mistral-Large-Instruct-2411	671	78.5	69.4	42.5	68.5	10242

Model	#Params (B)	Completion Rate	Subtask Accuracy	Exact Match	Partial Match (0.5)	Tokens
deepseek-ai/DeepSeek-R1	671	100.0	69.4	42.5	68.0	9205
OpenAI-o1-preview	<NA>	99.0	68.8	41.0	68.5	9449
OpenAI-o1-mini	<NA>	99.0	57.2	23.5	53.5	10242
Qwen/QwQ-32B-Preview	32	78.5	46.3	19.5	48.0	9524
Gemini-2.0-flash-thinking	<NA>	99.0	45.9	8.0	37.5	4038
OpenAI-gpt-4o	<NA>	100.0	39.3	3.5	29.5	953
mistralai/Mistral-Large-Instruct-2411	123	100.0	38.3	3.0	30.5	1637
Gemini-2.0-flash	<NA>	58.0	24.2	2.0	20.0	2104
meta-llama/Llama-3.1-70B-Instruct	70	56.0	22.8	2.0	18.0	1165
meta-llama/Llama-3.3-70B-Instruct	70	80.5	32.2	1.0	25.0	1738
mistralai/Mistral-Small-Instruct-2409	22	99.5	30.7	0.5	12.5	1514
meta-llama/Llama-3.1-8B-Instruct	8	57.0	16.0	0.0	8.0	1293
mistralai/Mistral-7B-Instruct-v0.3	7	97.0	19.1	0.0	4.5	1618
Qwen/Qwen2.5-32B-Instruct	32	93.0	32.2	0.0	22.5	1208
Qwen/Qwen2.5-72B-Instruct	72	93.5	34.0	0.0	23.0	1810
Qwen/Qwen2.5-7B-Instruct	7	96.5	25.8	0.0	8.5	1396

The Cryptogram task involves the decryption of an encrypted quotation or sentence, where each letter of an original text is substituted with another, resulting in an apparently nonsense text. Decryption requires identifying patterns, common letter frequencies, and word structures to deduce the letter-to-letter correspondences, ultimately reconstructing the original content. We collected 50 easy and 50 hard samples from Printable Puzzles with timestamps ranging from September 2024 to December 2024.

Model	#Params (B)	Completion Rate	Subtask Accuracy	Exact Match	Partial Match (0.5)	Tokens
mistralai/Mistral-Large-Instruct-2411	671	100	34.8	13	29	12567

Model	#Params (B)	Completion Rate	Subtask Accuracy	Exact Match	Partial Match (0.5)	Tokens
OpenAI-o1-preview	<NA>	92.0	34.8	13.0	29.0	12567
deepseek-ai/DeepSeek-R1	671	100.0	26.0	4.0	21.0	10344
OpenAI-o1-mini	<NA>	100.0	22.7	1.0	13.0	11208
Gemini-2.0-flash	<NA>	47.0	8.5	0.0	1.0	1585
meta-llama/Llama-3.1-8B-Instruct	8	43.0	2.3	0.0	0.0	2068
meta-llama/Llama-3.3-70B-Instruct	70	99.0	14.3	0.0	1.0	1137
Gemini-2.0-flash-thinking	<NA>	68.0	11.2	0.0	2.0	4167
meta-llama/Llama-3.1-70B-Instruct	70	62.0	6.9	0.0	1.0	1298
mistralai/Mistral-Large-Instruct-2411	123	96.0	13.7	0.0	1.0	1204
mistralai/Mistral-7B-Instruct-v0.3	7	99.0	4.3	0.0	0.0	1096
OpenAI-gpt-4o	<NA>	100.0	20.7	0.0	5.0	740
mistralai/Mistral-Small-Instruct-2409	22	95.0	7.0	0.0	0.0	1233
Qwen/QwQ-32B-Preview	32	47.0	3.6	0.0	0.0	6492
Qwen/Qwen2.5-32B-Instruct	32	89.0	9.8	0.0	0.0	1303
Qwen/Qwen2.5-72B-Instruct	72	85.0	11.8	0.0	0.0	1727
Qwen/Qwen2.5-7B-Instruct	7	81.0	3.5	0.0	0.0	1181

The Sudoku task consists of filling a $n^2 imes n^2$ grid with digits from 1 to $n^2$, subject to the constraint that each row, column, and $n imes n$ subgrid contains all digits from 1 to $n^2$ without repetition. Success in Sudoku relies on logical deduction and careful consideration of the existing digits to determine valid placements for the remaining numbers. From 1sudoku, we collected 200 Sudoku samples in total: 50 easy and 50 hard samples for both $4 imes4$ and $9 imes9$ sizes.

Model	#Params (B)	Completion Rate	Subtask Accuracy	Exact Match	Partial Match (0.5)	Tokens
mistralai/Mistral-Large-Instruct-2411	671	91.5	70.3	31.5	55.5	8277

Model	#Params (B)	Completion Rate	Subtask Accuracy	Exact Match	Partial Match (0.5)	Tokens
deepseek-ai/DeepSeek-R1	671	100.0	70.3	50.0	64.0	8277
OpenAI-o1-preview	<NA>	91.5	65.1	50.0	55.5	8062
Qwen/QwQ-32B-Preview	32	54.5	40.1	31.5	35.5	8381
OpenAI-o1-mini	<NA>	99.0	53.4	27.0	43.0	3961
Gemini-2.0-flash-thinking	<NA>	79.5	46.5	16.5	41.0	3853
OpenAI-gpt-4o	<NA>	100.0	52.2	14.5	48.0	1104
Gemini-2.0-flash	<NA>	93.0	45.3	12.5	37.5	2842
mistralai/Mistral-Large-Instruct-2411	123	85.5	39.5	10.0	33.5	1955
meta-llama/Llama-3.3-70B-Instruct	70	93.5	34.8	7.0	22.5	1062
Qwen/Qwen2.5-72B-Instruct	72	97.5	43.0	5.5	34.0	2013
Qwen/Qwen2.5-32B-Instruct	32	100.0	42.8	3.5	30.5	1202
Qwen/Qwen2.5-7B-Instruct	7	94.5	30.2	1.5	15.0	1486
meta-llama/Llama-3.1-70B-Instruct	70	69.5	24.2	1.0	17.5	1940
mistralai/Mistral-Small-Instruct-2409	22	89.0	20.5	0.5	7.5	1968
mistralai/Mistral-7B-Instruct-v0.3	7	84.0	11.9	0.0	1.5	3108
meta-llama/Llama-3.1-8B-Instruct	8	7.5	1.2	0.0	0.0	2782

The Drop Quote task comprises a grid of multiple rows and columns, with each column providing a set of candidate letters. The task requires determining the correct row for letters in each column, effectively "dropping" it into target place to reveal the hidden quotation. We created 50 easy samples by manually compiling common quotations, and collected 50 hard samples from Printable Puzzles, with timestamps ranging from September 2024 to December 2024.

Model	#Params (B)	Completion Rate	Subtask Accuracy	Exact Match	Partial Match (0.5)	Tokens
mistralai/Mistral-Large-Instruct-2411	671	100	38.8	13	38	13595

Model	#Params (B)	Completion Rate	Subtask Accuracy	Exact Match	Partial Match (0.5)	Tokens
OpenAI-o1-preview	<NA>	97.0	38.8	13.0	38.0	13595
deepseek-ai/DeepSeek-R1	671	100.0	47.3	7.0	42.0	11422
OpenAI-o1-mini	<NA>	96.0	34.3	2.0	21.0	13255
Gemini-2.0-flash	<NA>	92.0	34.3	0.0	17.0	2717
meta-llama/Llama-3.1-8B-Instruct	8	44.0	11.2	0.0	1.0	2123
meta-llama/Llama-3.3-70B-Instruct	70	99.0	29.0	0.0	13.0	918
Gemini-2.0-flash-thinking	<NA>	96.0	34.4	0.0	23.0	3386
meta-llama/Llama-3.1-70B-Instruct	70	82.0	27.7	0.0	12.0	1498
mistralai/Mistral-Large-Instruct-2411	123	98.0	24.7	0.0	9.0	1566
mistralai/Mistral-7B-Instruct-v0.3	7	66.0	6.6	0.0	1.0	2337
OpenAI-gpt-4o	<NA>	99.0	31.1	0.0	14.0	1165
mistralai/Mistral-Small-Instruct-2409	22	97.0	26.9	0.0	6.0	1615
Qwen/QwQ-32B-Preview	32	33.0	7.5	0.0	8.0	6078
Qwen/Qwen2.5-32B-Instruct	32	95.0	28.4	0.0	14.0	1197
Qwen/Qwen2.5-72B-Instruct	72	94.0	30.9	0.0	13.0	1757
Qwen/Qwen2.5-7B-Instruct	7	98.0	21.9	0.0	4.0	1852

✉️✨ Submit your results here!

Submission Template

See submission_template.json for detail. The following is an example for the JSON structure.

{
    "config": {
        "model_name": "deepseek-ai/DeepSeek-R1", # your model name
        "link": "https://huggingface.co/deepseek-ai/DeepSeek-R1", # your model link if available
        "Params": 671, # number of parameters if available
        "show_on_leaderboard": true, # whether to show your model on the leaderboard
    },
    "results": {
        "crossword": [
            {"tag": "TAG", "level": "LEVEL", "answer": "ANSWER"},
        ],
        "acrostic": [
            {"tag": "TAG", "level": "LEVEL", "answer": "ANSWER"},
        ],
        "logic": [
            {"tag": "TAG", "level": "LEVEL", "answer": "ANSWER"},
        ],
        "cryptogram": [
            {"tag": "TAG", "level": "LEVEL", "answer": "ANSWER"},
        ],
        "sudoku": [
            {"tag": "TAG", "level": "LEVEL", "answer": "ANSWER"},
        ],
        "drop": [
            {"tag": "TAG", "level": "LEVEL", "answer": "ANSWER"},
        ]
    }
}

Upload JSON File

Your Model Performance

📙 Citation

Built with Gradio logo

LR2Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems

✉️✨ Submit your results here!

Submission Template

LR²Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems