CODE-VISION
Evaluating MLLMs Logic Understanding and Reasoning Capabilities Through Code Generation

Hanbin Wang^*1, Xiaoxuan Zhou^*2, Zhipeng Xu², Keyuan Cheng¹, Yuxin Zuo³, Kai Tian², Jingwei Song⁴， Wenhui Hu¹， Xueyang Liu¹

^* These authors contributed equally to this work.

¹ Peking University

² Northeastern University

³Chinese Academy of Sciences ⁴The University of Hong Kong

Paper arXiv Code

🤗

Data

Abstract

We present CODE-VISION, a benchmark designed to evaluate the logical understanding and reasoning capabilities of Multimodal Large Language Models (MLLMs) through code generation. CODE-VISION challenges MLLMs to generate a correct program that fulfills specific functionality requirements based on a given flowchart, which visually represents the desired algorithm or process. CODE-VISION comprises three subsets—HumanEval-V, Algorithm, and MATH—which evaluate MLLMs' reasoning abilities across basic programming, algorithmic, and mathematical problem-solving domains. Our experiments evaluate 12 MLLMs on CODE-VISION. Experimental results demonstrate that there is a large performance difference between proprietary and open-source models. On hard problems, GPT-4o achieves a 79.3% pass@1, while the best open-source model only achieves 15%. Further experiments reveal that CODE-VISION poses unique challenges compared to other multimodal reasoning benchmarks, such as MMCode and MathVista. We also investigate the reasons behind the poor performance of open-source models. All of our source codes and data will be released via GitHub.

LeaderBoard on CodeVision

Model	HumanEval-V	Algorithm				MATH				Avg.
Model	HumanEval-V	Easy	Medium	Hard	Overall	Easy	Medium	Hard	Overall	Avg.
Proprietary Models
GPT-4o	94.5	91.1	89.3	79.3	87.9	100.0	97.5	77.5	92.0	91.5
Gemini 1.5 Pro (May 2024)	89.6	88.9	72.0	44.8	71.8	95.6	92.5	57.5	82.4	81.3
Claude 3.5 Sonnet 20240620	82.3	84.4	77.3	48.3	73.8	97.8	95.0	52.5	82.4	79.5
Claude 3 Sonnet	53.0	31.1	14.7	3.4	17.4	88.9	72.5	17.5	60.8	43.7
Claude 3 Haiku	48.8	22.2	4.0	0.0	8.7	73.3	62.5	12.5	50.4	37.0
Gemini 1.5 Flash (April 2024)	72.0	40.0	32.0	10.3	30.2	86.7	72.5	20.0	60.8	57.1
Open-source Models
Llama-3.2-90B-Vision-Instruct	40.9	17.8	8.0	0.0	9.4	80.0	75.5	15.0	57.6	36.9
Llama-3.2-11B-Vision-Instruct	29.3	8.9	1.3	0.0	3.4	62.2	27.5	2.5	32.0	23.8
Phi-3-vision-128k-instruct	29.3	15.6	4.0	0.0	6.7	35.6	7.5	2.5	16.0	19.6
Phi-3.5-vision-instruct	28.0	8.9	0.0	0.0	2.7	33.3	7.5	0.0	14.4	17.7
MiniCPM-V 2.6	40.2	22.2	9.3	0.0	11.4	46.7	15.0	2.5	22.4	27.8
Qwen-VL-Plus	17.1	11.1	0.0	0.0	3.4	13.3	2.5	0.0	5.6	11.7

	HumanEval-V	Algorithm	MATH
Problem	164.00	45.00	75.00	29.00	149.00	45.00	41.00	40.00	126.00
Avg. Test Cases	8.08	100.00	100.00	100.00	100.00	9.04	8.95	8.78	8.92
Avg. Flowchart Nodes	10.90	12.62	15.76	18.86	15.75	10.67	12.38	19.20	14.08
Avg. Flowchart Edges	11.07	13.29	15.87	19.14	16.10	10.49	12.00	19.08	13.85

Dataset Construction

Figure 1: The comparison of the dataset between CODE-VISION and MMCode.

Dataset Showcase

Figure 2: A sample problem from HumanEval-V: HumanEval-93.

Figure 3: A sample problem from HumanEval-V: HumanEval-94.

Figure 4: A sample problem from Algorithm: weekly-contest-360-maximize-value-of-function-in-a-ball-passing-game.

Figure 5: A sample problem from Algorithm: weekly-contest-369-minimum-equal-sum-of-two-arrays-after-replacing-zeros.

Figure 6: A sample problem from Math: 3116. Kth Smallest Amount With Single Denomination Combination.

Figure 7: A sample problem from Math: weekly-contest-378-find-longest-special-substring-that-occurs-thrice-ii.

CODE-VISION
Evaluating MLLMs Logic Understanding and Reasoning Capabilities Through Code Generation

Abstract

LeaderBoard on CodeVision

Dataset Overview

Dataset Construction

Dataset Showcase

BibTeX

CODE-VISION Evaluating MLLMs Logic Understanding and Reasoning Capabilities Through Code Generation

Abstract

LeaderBoard on CodeVision

Dataset Overview

Dataset Construction

Dataset Showcase

BibTeX

CODE-VISION
Evaluating MLLMs Logic Understanding and Reasoning Capabilities Through Code Generation