LR2Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems

LR2Bench is a novel benchmark designed to evaluate the Long-chain Reflective Reasoning capabilities of LLMs. LR2Bench comprises 850 samples across six Constraint Satisfaction Problems (CSPs) where reflective reasoning is crucial for deriving solutions that meet all given constraints. Each type of task focuses on distinct constraint patterns, such as knowledge-based, logical, and spatial constraints, providing a comprehensive evaluation of diverse problem-solving scenarios.

Note: We have released the LR2Bench dataset here. For evaluation, you can submit your model's answer here following the submission guidelines. The Leaderboard will automatically evaluate the performance with rule-based matching. If you have further questions, please feel free to contact us at chenjianghao2022@ia.ac.cn.

Model
#Params (B)
Completion Rate
Subtask Accuracy
Exact Match
Partial Match (0.5)
Tokens
671
96.3
58.7
23.6
61.7
11436