MME is a comprehensive evaluation benchmark for multimodal large language models. It measures both perception and cognition abilities on a total of 14 subtasks, including existence, count, position, color, poster, celebrity, scene, landmark, artwork, OCR, commonsense reasoning, numerical calculation, text translation, and code reasoning.
Qwen-VL-Chat achieves SOTAs on both perception and cognition evaluation.
Perception Evaluation
Rank | Model | Version | Score |
---|---|---|---|
1 | Qwen-VL-Chat | Qwen-7B | 1487.57 |
2 | Skywork-MM | Skywork-MM-13B | 1419.08 |
3 | MMICL | FlanT5xxl | 1376.00 |
4 | Lynx | vicuna-7b | 1373.23 |
5 | BLIVA | FlanT5xxl | 1337.73 |
Cognition Evaluation
Rank | Model | Version | Score |
---|---|---|---|
1 | Qwen-VL-Chat | Qwen-7B | 360.71 |
2 | MMICL | FlanT5xxl | 360.36 |
3 | Skywork-MM | Skywork-MM-13B | 356.43 |
4 | BLIVA | FlanT5xxl | 331.43 |
5 | LRV-Instruction | LRV-7B | 328.21 |
Full Metrics
=========== Perception ===========
total score: 1487.576330532213
existence score: 158.33333333333331
count score: 150.0
position score: 128.33333333333334
color score: 170.0
posters score: 178.57142857142856
celebrity score: 120.58823529411764
scene score: 152.25
landmark score: 164.0
artwork score: 125.5
OCR score: 140.0
=========== Cognition ===========
total score: 360.71428571428567
commonsense_reasoning score: 130.7142857142857
numerical_calculation score: 40.0
text_translation score: 147.5
code_reasoning score: 42.5
- Download MME images and eval_tool from the MME repo
- Rearrange images by executing
python get_images.py
- Evaluate Qwen-VL-Chat results by executing
python eval.py
- Calculate MME results by executing
python calculation.py --results_dir Qwen-VL-Chat
, which the calculation script comes from the MME eval_tool.