Skip to content

Latest commit

 

History

History
59 lines (45 loc) · 2.67 KB

EVAL_MME.md

File metadata and controls

59 lines (45 loc) · 2.67 KB

MME Benchmark

MME is a comprehensive evaluation benchmark for multimodal large language models. It measures both perception and cognition abilities on a total of 14 subtasks, including existence, count, position, color, poster, celebrity, scene, landmark, artwork, OCR, commonsense reasoning, numerical calculation, text translation, and code reasoning.

Qwen-VL-Chat achieves SOTAs on both perception and cognition evaluation.

Perception Evaluation

Rank Model Version Score
1 Qwen-VL-Chat Qwen-7B 1487.57
2 Skywork-MM Skywork-MM-13B 1419.08
3 MMICL FlanT5xxl 1376.00
4 Lynx vicuna-7b 1373.23
5 BLIVA FlanT5xxl 1337.73

Cognition Evaluation

Rank Model Version Score
1 Qwen-VL-Chat Qwen-7B 360.71
2 MMICL FlanT5xxl 360.36
3 Skywork-MM Skywork-MM-13B 356.43
4 BLIVA FlanT5xxl 331.43
5 LRV-Instruction LRV-7B 328.21

Full Metrics

=========== Perception ===========
total score: 1487.576330532213 

         existence  score: 158.33333333333331
         count  score: 150.0
         position  score: 128.33333333333334
         color  score: 170.0
         posters  score: 178.57142857142856
         celebrity  score: 120.58823529411764
         scene  score: 152.25
         landmark  score: 164.0
         artwork  score: 125.5
         OCR  score: 140.0


=========== Cognition ===========
total score: 360.71428571428567 

         commonsense_reasoning  score: 130.7142857142857
         numerical_calculation  score: 40.0
         text_translation  score: 147.5
         code_reasoning  score: 42.5

How To Reproduce Results of MME Benchmark

  1. Download MME images and eval_tool from the MME repo
  2. Rearrange images by executing python get_images.py
  3. Evaluate Qwen-VL-Chat results by executing python eval.py
  4. Calculate MME results by executing python calculation.py --results_dir Qwen-VL-Chat, which the calculation script comes from the MME eval_tool.