We present CODE-VISION, a benchmark designed to evaluate the logical understanding and reasoning capabilities of Multimodal Large Language Models (MLLMs) through code generation. CODE-VISION challenges MLLMs to generate a correct program that fulfills specific functionality requirements based on a given flowchart, which visually represents the desired algorithm or process. CODE-VISION comprises three subsets—HumanEval-V, Algorithm, and MATH—which evaluate MLLMs' reasoning abilities across basic programming, algorithmic, and mathematical problem-solving domains. Our experiments evaluate 12 MLLMs on CODE-VISION. Experimental results demonstrate that there is a large performance difference between proprietary and open-source models. On hard problems, GPT-4o achieves a 79.3% pass@1, while the best open-source model only achieves 15%. Further experiments reveal that CODE-VISION poses unique challenges compared to other multimodal reasoning benchmarks, such as MMCode and MathVista. We also investigate the reasons behind the poor performance of open-source models. All of our source codes and data will be released via GitHub.
Model | HumanEval-V | Algorithm | MATH | Avg. | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Easy | Medium | Hard | Overall | Easy | Medium | Hard | Overall | ||||
Proprietary Models | |||||||||||
GPT-4o
![]() |
94.5 | 91.1 | 89.3 | 79.3 | 87.9 | 100.0 | 97.5 | 77.5 | 92.0 | 91.5 | |
Gemini 1.5 Pro (May 2024)
![]() |
89.6 | 88.9 | 72.0 | 44.8 | 71.8 | 95.6 | 92.5 | 57.5 | 82.4 | 81.3 | |
Claude 3.5 Sonnet 20240620
![]() |
82.3 | 84.4 | 77.3 | 48.3 | 73.8 | 97.8 | 95.0 | 52.5 | 82.4 | 79.5 | |
Claude 3 Sonnet | 53.0 | 31.1 | 14.7 | 3.4 | 17.4 | 88.9 | 72.5 | 17.5 | 60.8 | 43.7 | |
Claude 3 Haiku | 48.8 | 22.2 | 4.0 | 0.0 | 8.7 | 73.3 | 62.5 | 12.5 | 50.4 | 37.0 | |
Gemini 1.5 Flash (April 2024) | 72.0 | 40.0 | 32.0 | 10.3 | 30.2 | 86.7 | 72.5 | 20.0 | 60.8 | 57.1 | |
Open-source Models | |||||||||||
Llama-3.2-90B-Vision-Instruct | 40.9 | 17.8 | 8.0 | 0.0 | 9.4 | 80.0 | 75.5 | 15.0 | 57.6 | 36.9 | |
Llama-3.2-11B-Vision-Instruct | 29.3 | 8.9 | 1.3 | 0.0 | 3.4 | 62.2 | 27.5 | 2.5 | 32.0 | 23.8 | |
Phi-3-vision-128k-instruct | 29.3 | 15.6 | 4.0 | 0.0 | 6.7 | 35.6 | 7.5 | 2.5 | 16.0 | 19.6 | |
Phi-3.5-vision-instruct | 28.0 | 8.9 | 0.0 | 0.0 | 2.7 | 33.3 | 7.5 | 0.0 | 14.4 | 17.7 | |
MiniCPM-V 2.6 | 40.2 | 22.2 | 9.3 | 0.0 | 11.4 | 46.7 | 15.0 | 2.5 | 22.4 | 27.8 | |
Qwen-VL-Plus | 17.1 | 11.1 | 0.0 | 0.0 | 3.4 | 13.3 | 2.5 | 0.0 | 5.6 | 11.7 |
The CODE-VISION consists of three subsets: HumanEval-V, Algorithm, and MATH, covering various domains and difficulty levels (Easy, Medium, Hard). Each subset contains a distinct number of problems, with HumanEval-V having 164 problems, Algorithm averaging 49.67 problems across difficulties, and MATH averaging 42 problems. The number of test cases varies across subsets, with HumanEval-V averaging 8.08 test cases per problem, while Algorithm maintains a high average of 100 test cases. In terms of flowchart complexity, the Algorithm subset shows the highest average node and edge counts, with MATH and HumanEval-V following.
HumanEval-V | Algorithm | MATH | |||||||
---|---|---|---|---|---|---|---|---|---|
Easy | Medium | Hard | Total/Avg. | Easy | Medium | Hard | Total/Avg. | ||
Problem | 164.00 | 45.00 | 75.00 | 29.00 | 149.00 | 45.00 | 41.00 | 40.00 | 126.00 |
Avg. Test Cases | 8.08 | 100.00 | 100.00 | 100.00 | 100.00 | 9.04 | 8.95 | 8.78 | 8.92 |
Avg. Flowchart Nodes | 10.90 | 12.62 | 15.76 | 18.86 | 15.75 | 10.67 | 12.38 | 19.20 | 14.08 |
Avg. Flowchart Edges | 11.07 | 13.29 | 15.87 | 19.14 | 16.10 | 10.49 | 12.00 | 19.08 | 13.85 |
Figure 1: The comparison of the dataset between CODE-VISION and MMCode.
// some bibtex