Overview

human_performance — **Figure 1:** Model accuracy on VisualPuzzles compared to human performance percentiles. All evaluated models fall below the human 5th percentile (57.5%), highlighting the difficulty of VisualPuzzles. Interestingly, models with explicit "thinking" modes do not consistently outperform their base versions, suggesting that current reasoning strategies do not yet generalize well to VisualPuzzles's scenarios, even though these strategies have proven effective in existing reasoning tasks that often rely heavily on domain-specific knowledge.

Current multimodal benchmarks often conflate reasoning with domain-specific knowledge, making it difficult to isolate and evaluate general reasoning abilities in non-expert settings. To address this, we introduce VisualPuzzles, a benchmark that targets visual reasoning while deliberately minimizing reliance on specialized knowledge. VisualPuzzles consists of diverse questions spanning five categories: algorithmic, analogical, deductive, inductive, and spatial reasoning. Experiments show that VisualPuzzles requires significantly less intensive domain-specific knowledge and more complex reasoning compared to benchmarks like MMMU, enabling us to better evaluate genuine multimodal reasoning. Evaluations show that state-of-the-art multimodal large language models consistently lag behind human performance on VisualPuzzles, and that strong performance on knowledge-intensive benchmarks does not necessarily translate to success on reasoning-focused, knowledge-light tasks. Additionally, reasoning enhancements such as scaling up inference compute (with "thinking" modes) yield inconsistent gains across models and task types, and we observe no clear correlation between model size and performance. We also found that models exhibit different reasoning and answering patterns on VisualPuzzles compared to benchmarks with heavier emphasis on knowledge. VisualPuzzles offers a clearer lens through which to evaluate reasoning capabilities beyond factual recall and domain knowledge.

Category	Statistics
Total Questions	1168
– Algorithmic Reasoning	262
– Analogical Reasoning	211
– Deductive Reasoning	200
– Inductive Reasoning	209
– Spatial Reasoning	286

Easy / Medium / Hard	46% / 39% / 15%
Option Type (Image / Text)	57% / 43%
AVG. Question Length	154.9
% Easy Words	54%

Table 1: Statistics of VisualPuzzles

VisualPuzzles includes five core reasoning categories: Algorithmic Reasoning, which involves reasoning over algorithmic rules; Analogical Reasoning, which requires analyzing the relationships between a pair of entities; Deductive Reasoning, which involves logically drawing conclusions from known premises; Inductive Reasoning, which focuses on generalizing rules from observed patterns; and Spatial Reasoning, which requires interpreting and manipulating spatial relationships. Each question is also labeled as easy, medium, or hard, based on annotators' estimated cognitive load and time-to-solve metrics.

Disentangling Reasoning from Domain Knowledge

Is VisualPuzzles less knowledge-intensive than existing reasoning benchmarks?

This question is central to our goal of disentangling reasoning ability from domain-specific knowledge. Many current benchmarks blur this line, making it difficult to assess general reasoning in non-expert settings. VisualPuzzles was designed to target visual reasoning skills while deliberately minimizing reliance on specialized knowledge. To test whether VisualPuzzles achieves this goal, we prompted GPT-4o to generate "knowledge concept checklists" for 50 randomly selected questions from a widely-used knowledge-intensive reasoning dataset MMMU and 50 from VISUALPUZZLES. Each checklist comprises knowledge-specific questions intended to assess whether a model possesses the background information required to solve the original problem. For example, if a question depends on understanding two distinct physics laws, its checklist would include a question to explain each. The number of checklist items per instance serves as a proxy for knowledge intensivity. As shown in Table 3, we found that MMMU problems resulted in significantly more checklist items on average (3.9) compared to VISUALPUZZLES (1.1). This supports the hypothesis that VisualPuzzles is substantially less reliant on domain-knowledge. As a result, performance on VisualPuzzles more directly reflects a model's ability to reason over visual and textual content, offering a clearer signal of progress in multimodal reasoning.

Benchmark	# Knowledge Qs.
MMMU	3.9
VisualPuzzles	1.1

Table 3: Average number of knowledge concept questions generated per instance on MMMU vs. VisualPuzzles.

Do models already possess the knowledge required to solve VisualPuzzles?

To explore this, we measured models' knowledge accuracy-their ability to answer the knowledge checklist questions correctly-on both benchmarks. This metric reflects how much of the required knowledge is already known by the model, independent of reasoning. We found a stark contrast: while many models exceed 90% knowledge accuracy on VisualPuzzles, most score below 60% on MMMU, with smaller models frequently dropping under 50%. Only the largest models approach 80% accuracy on MMMU, underscoring its heavier reliance on domain-specific knowledge.

Does scaling up model size improve performance?

We also plot reasoning accuracy (i.e., overall performance on the benchmark) in Figure 2, revealing some interesting trends:
• MMMU. Larger models tend to have higher knowledge accuracy, and this often translates into higher overall benchmark performance. This aligns with MMMU's reliance on domain-specific understanding; models with more parameters and training data are better at recalling relevant factual knowledge, thus improving their overall performance.
• VisualPuzzles. Although many models achieve near-100% knowledge accuracy on VisualPuzzles, we observe no clear increase in both knowledge and reasoning accuracy as model size grows. In contrast to MMMU, simply scaling number of parameters does not guarantee better performance on VisualPuzzles, implying that further gains on VisualPuzzles must stem from improvements in models' reasoning abilities rather than reliance on extensive knowledge.

What is the relationship between knowledge and reasoning?

Figure 2 also shows two scatter plots with trend lines that measure how knowledge accuracy correlates with reasoning accuracy across different open models, where the relative sizes of the dots represent the sizes of the models. On MMMU (left), there is a strong positive correlation (0.8), suggesting that a model possessing more knowledge strongly correlates better reasoning performance. In contrast, VISUALPUZZLES (right) exhibits a more modest correlation (0.4). Although there is still an upward trend, gains in knowledge accuracy lead to smaller improvements in reasoning accuracy. This discrepancy implies that while overcoming knowledge gaps is central to reasoning success on MMMU, VISUALPUZZLES tasks demand more nuanced inference steps that depends less on domain knowledge.

Do questions in VISUALPUZZLES require more complex reasoning than those in existing benchmarks like MMMU?

Model	MMMU	VisualPuzzles
GPT-4o	75.1%	87.0%
Gemini-2.0-Flash	67.9%	77.3%

Table 4: Percentage of logical reasoning steps in solving benchmark questions.

Besides observing that models generally achieve lower accuracy on VisualPuzzles compared to MMMU, we further investigated whether this gap stems from increased reasoning complexity. To do so, we measured the proportion of reasoning steps required to solve each question. We began by gathering detailed, step-by-step solutions from the models for each question, which are manually verified for completeness. Then we classified if each step is a logical reasoning step with the help of LLM. We show the result in Table 4 . On average, logical reasoning steps take up 14.8% more total steps in solving VisualPuzzles questions compared to those of MMMU (82.1% v.s. 71.5%). This analysis is based on GPT-4o and Gemini-2.0-Flash across 200 randomly sampled questions per benchmark. These results suggest that VisualPuzzles demand more extensive reasoning, aligning with its goal of evaluating deeper multimodal reasoning beyond factual recall.

Do Reasoning Models Perform Better than Their Baselines?

Recent reasoning models often scale up inference compute by generating longer chains of thought (CoTs) to enhance reasoning ability. To assess the effectiveness of this strategy on VisualPuzzles, we compare several reasoning models with their non-reasoning counterparts in Figure 3. The reasoning model o1 outperforms GPT-4o overall. However, structured "thinking" modes, despite much higher number of completion tokens, show no consistent benefit.

Are Branching and Revalidation Reasoning Patterns Effective on VisualPuzzles?

To better understand this discrepancy, we examine Claude-3.7-Sonnet-Thinking's reasoning behaviors present in long CoTs, specifcally, branching and re-validation, which are known to play important roles in enhancing reasoning performance. As shown in Figure 4, our analysis reveals a striking contrast between benchmarks. On MMMU, both branching and re-validation correlate positively with model accuracy. These strategies help models explore alternative reasoning paths and revisit earlier steps, aiding in the retrieval of relevant factual knowledge,an essential component for solving MMMU's knowledge-intensive questions. Surprisingly, on VisualPuzzles, these reasoning behaviors are more frequent, yet less predictive of success. Despite their increased presence in long-form responses, we observe no significant correlation between these strategies and task accuracy. This suggests that models may be using branching and re-validation in ways that do not meaningfully contribute to solving the problem. Figure 5 highlights this with an example from Claude-3.7-Sonnet-Thinking, where the model applies branching on a VisualPuzzles puzzle. However, the additional reasoning paths remain shallow and fail to engage with the core challenge—understanding the spatial arrangement of chairs in the image.

Analysis

Do Models Approach VisualPuzzles Questions Differently?

Benchmark	Answer-First	Option-First
MMMU	29.3%	70.7%
VisualPuzzles (Image Options)	72.5%	27.5%
VisualPuzzles (Text Options)	98.3%	1.7%

Table 5: Answering Strategy

Table 5 shows the statistics of Claude-3.7-Sonnet-Thinking's answering strategy. We observe a clear divergence in answering strategies between MMMU and VisualPuzzles. On MMMU, the model tend to follow an option-driven approach—using the provided choices early to eliminate unlikely answers and select the most relevant one, often without explicitly solving the problem. In contrast, models more frequently adopt an answer-first strategy on VisualPuzzles, attempting to solve the question independently before comparing the result to the answer choices. This pattern holds across both textual and image-based options, though the option-first approach appears slightly more often (around 30%) for image-based tasks-likely due to the added complexity of visual comparison.

Does model performance transfer between reasoning categories?

Figure 6 presents a correlation heatmap illustrating the relationships among the five reasoning categories in VISUALPUZZLES. We report model correlations averaged across all models in Table 2. For humans, each reasoning category likely engages different cognitive or mental processes , so performance in one category might not transfer to performance in another. However, the correlation heatmap of the models tells a different story. We observe notably strong correlations across reasoning categories, with values ranging from 0.11 to as high as 0.94. In particular, algorithmic and deductive reasoning show high correlation (0.94), and other pairs such as algorithmic-analogical and deductive-analogical also exhibit strong associations. This suggests that model performance tends to generalize across categories. However, this generalization may not reflect true reasoning abilities. Instead, the high correlations could indicate that models are leveraging shared surface-level patterns or shortcut strategies that happen to work across multiple structurally different categories, unlike humans, who may rely on distinct cognitive processes.

Error Analysis

Figure 7 shows a pie chart illustrating the distribution of error categories of 100 instances generated by Claude3.7-Sonnet-Thinking on VisualPuzzles, revealing that reasoning errors dominate at 56%, reinforcing the fact that reasoning is greatest challenge to models in VisualPuzzles. Perceptual errors (21%) and spatial / orientation errors (17%) also constitute substantial portions of failures, reflecting difficulties in interpreting visual elements and understanding spatial relationships. These three categories together account for 94% of mistakes, emphasizing a need for multimodal models with stronger reasoning capabilities with more robust perception and spatial understanding. Textual and visual understanding errors (4%) and reject-to-answer cases (2%) are relatively rare.

VisualPuzzles