The overall evaluation results of object, attribute, and relation hallucination in VALOR-Eval using GPT-4 as the LLM Agent. The highest is highlighted in blue, while the worst performance is highlighted in yellow.
Model | Source | Average Faithfulness | Average Coverage | Object Existence(F) | Object Existence(C) | Object Attribute(F) | Object Attribute(C) | People Attribute(F) | People Attribute(C) | Positional Relation(F) | Positional Relation(C) | Comparative Relation(F) | Comparative Relation(C) |
InstructBLIP (Vicuna-7B) | Link | 62.1 | 21.4 | 74.5 | 24.8 | 72.0 | 23.9 | 47.1 | 9.3 | 50.0 | 13.6 | 66.9 | 35.6 |
LLaVA-1.5 (Vicuna-13B) | Link | 61.3 | 25.9 | 72.1 | 24.7 | 74.6 | 37.8 | 43.3 | 12.1 | 64.8 | 14.9 | 51.9 | 40.1 |
MiniGPT-4 v2 (LLaMA-2-7B) | Link | 50.4 | 19.8 | 65.0 | 25.4 | 64.5 | 17.9 | 38.9 | 11.6 | 38.8 | 33.1 | 44.7 | 11.2 |
mPLUG-Owl2 (LLaMA-2-7B) | Link | 55.6 | 23.0 | 71.5 | 24.8 | 79.9 | 32.7 | 39.7 | 16.2 | 45.2 | 10.8 | 41.6 | 30.6 |
BLIVA (Vicuna-7B) | Link | 59.2 | 19.5 | 77.7 | 21.9 | 73.3 | 24.3 | 37.6 | 11.6 | 39.5 | 9.7 | 68.0 | 29.9 |
CogVLM (Vicuna-7B) | Link | 58.2 | 25.7 | 71.2 | 35.5 | 75.3 | 24.3 | 43.7 | 22.4 | 51.9 | 10.5 | 49.0 | 35.9 |
InternLM-XComposer2 | Link | 67.1 | 22.7 | 82.5 | 23.9 | 75.8 | 26.3 | 50.4 | 13.8 | 62.6 | 11.1 | 64.1 | 38.4 |
Qwen-VL-Chat | Link | 58.7 | 23.2 | 70.6 | 28.4 | 75.1 | 38.6 | 38.8 | 16.0 | 56.9 | 8.5 | 51.9 | 24.3 |
Emu2 | Link | 75.0 | 8.1 | 94.2 | 14.1 | 66.7 | 10.4 | 54.3 | 1.9 | 72.2 | 1.8 | 87.5 | 12.3 |
GPT-4V | Link | 54.6 | 28.0 | 61.6 | 38.8 | 78.5 | 36.3 | 34.7 | 23.8 | 46.7 | 12.6 | 51.6* | 28.5* |
*: For images that contain people, GPT-4V refrains from generating comments and we marked this score with an asterisk.
🚨 To submit your results to the leaderboard, please send to this email with your result json files.