CODE-VISION
Evaluating MLLMs Logic Understanding and Reasoning Capabilities Through Code Generation

* These authors contributed equally to this work.

1 Peking University

2 Northeastern University

3Chinese Academy of Sciences 4The University of Hong Kong

Abstract

We present CODE-VISION, a benchmark designed to evaluate the logical understanding and reasoning capabilities of Multimodal Large Language Models (MLLMs) through code generation. CODE-VISION challenges MLLMs to generate a correct program that fulfills specific functionality requirements based on a given flowchart, which visually represents the desired algorithm or process. CODE-VISION comprises three subsets—HumanEval-V, Algorithm, and MATH—which evaluate MLLMs' reasoning abilities across basic programming, algorithmic, and mathematical problem-solving domains. Our experiments evaluate 12 MLLMs on CODE-VISION. Experimental results demonstrate that there is a large performance difference between proprietary and open-source models. On hard problems, GPT-4o achieves a 79.3% pass@1, while the best open-source model only achieves 15%. Further experiments reveal that CODE-VISION poses unique challenges compared to other multimodal reasoning benchmarks, such as MMCode and MathVista. We also investigate the reasons behind the poor performance of open-source models. All of our source codes and data will be released via GitHub.

LeaderBoard on CodeVision

Model HumanEval-V Algorithm MATH Avg.
Easy Medium Hard Overall Easy Medium Hard Overall
Proprietary Models
GPT-4o medal 94.5 91.1 89.3 79.3 87.9 100.0 97.5 77.5 92.0 91.5
Gemini 1.5 Pro (May 2024) medal 89.6 88.9 72.0 44.8 71.8 95.6 92.5 57.5 82.4 81.3
Claude 3.5 Sonnet 20240620 medal 82.3 84.4 77.3 48.3 73.8 97.8 95.0 52.5 82.4 79.5
Claude 3 Sonnet 53.0 31.1 14.7 3.4 17.4 88.9 72.5 17.5 60.8 43.7
Claude 3 Haiku 48.8 22.2 4.0 0.0 8.7 73.3 62.5 12.5 50.4 37.0
Gemini 1.5 Flash (April 2024) 72.0 40.0 32.0 10.3 30.2 86.7 72.5 20.0 60.8 57.1
Open-source Models
Llama-3.2-90B-Vision-Instruct 40.9 17.8 8.0 0.0 9.4 80.0 75.5 15.0 57.6 36.9
Llama-3.2-11B-Vision-Instruct 29.3 8.9 1.3 0.0 3.4 62.2 27.5 2.5 32.0 23.8
Phi-3-vision-128k-instruct 29.3 15.6 4.0 0.0 6.7 35.6 7.5 2.5 16.0 19.6
Phi-3.5-vision-instruct 28.0 8.9 0.0 0.0 2.7 33.3 7.5 0.0 14.4 17.7
MiniCPM-V 2.6 40.2 22.2 9.3 0.0 11.4 46.7 15.0 2.5 22.4 27.8
Qwen-VL-Plus 17.1 11.1 0.0 0.0 3.4 13.3 2.5 0.0 5.6 11.7

Dataset Overview

The CODE-VISION consists of three subsets: HumanEval-V, Algorithm, and MATH, covering various domains and difficulty levels (Easy, Medium, Hard). Each subset contains a distinct number of problems, with HumanEval-V having 164 problems, Algorithm averaging 49.67 problems across difficulties, and MATH averaging 42 problems. The number of test cases varies across subsets, with HumanEval-V averaging 8.08 test cases per problem, while Algorithm maintains a high average of 100 test cases. In terms of flowchart complexity, the Algorithm subset shows the highest average node and edge counts, with MATH and HumanEval-V following.

  HumanEval-V Algorithm MATH
Easy Medium Hard Total/Avg. Easy Medium Hard Total/Avg.
Problem 164.00 45.00 75.00 29.00 149.00 45.00 41.00 40.00 126.00
Avg. Test Cases 8.08 100.00 100.00 100.00 100.00 9.04 8.95 8.78 8.92
Avg. Flowchart Nodes 10.90 12.62 15.76 18.86 15.75 10.67 12.38 19.20 14.08
Avg. Flowchart Edges 11.07 13.29 15.87 19.14 16.10 10.49 12.00 19.08 13.85
Table 1:Statistics of the CODE-VISION benchmark, including the number of problems, average test cases, and flowchart complexity.

Dataset Construction

dataset construction

Figure 1: The comparison of the dataset between CODE-VISION and MMCode.

Dataset Showcase

dataset showcase

Figure 2: A sample problem from HumanEval-V: HumanEval-93.

showcase 2

Figure 3: A sample problem from HumanEval-V: HumanEval-94.

showcase 3

Figure 4: A sample problem from Algorithm: weekly-contest-360-maximize-value-of-function-in-a-ball-passing-game.

showcase 3

Figure 5: A sample problem from Algorithm: weekly-contest-369-minimum-equal-sum-of-two-arrays-after-replacing-zeros.

showcase 4

Figure 6: A sample problem from Math: 3116. Kth Smallest Amount With Single Denomination Combination.

showcase 4

Figure 7: A sample problem from Math: weekly-contest-378-find-longest-special-substring-that-occurs-thrice-ii.

BibTeX

        
         // some bibtex