VALOR-Eval: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Model

Example of the hallucination in open vocabulary generation task of LVLMs. Our proposed framework can identify objects, attributes, and relations from the generated captions and provide a more comprehensive evaluation of faithfulness and coverage. We highlight hallucinated features and uncovered features.

Introduction

Large Vision Language Models (LVLMs) suffer from hallucination problems, wherein the models generate plausible-sounding but factually incorrect outputs, undermining their reliability.

A comprehensive quantitative evaluation is necessary to identify and understand the extent of hallucinations in these models. However, existing benchmarks are often limited in scope, focusing mainly on object hallucinations. Furthermore, current evaluation methods struggle to effectively address the subtle semantic distinctions between model outputs and reference data, as well as the balance between hallucination and informativeness.

To address these issues, we introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative bias. Moreover, we propose an LLM-based two-stage evaluation framework that generalizes the popular CHAIR metric and incorporates both faithfulness and coverage into the evaluation. We provide a detailed assessment of 10 established LVLMs within our framework and results demonstrate that we provide a more comprehensive and human-correlated evaluation than existing work. Through this work, we highlight the critical balance between faithfulness and coverage of model outputs, and we hope our work encourages future progress on addressing hallucinations in LVLMs while keeping their outputs informative.

VALOR-Eval Framework

Overview of our proposed evaluation framework VALOR-Eval: (1) Firstly, LVLMs generate captions from benchmark images. (2) Following this, LLMs are employed to extract pivotal features that encapsulate from the generated descriptions. (3) Subsequently, these features are aligned with a pre-defined list of ground-truth features using LLMs, facilitating the creation of two essential outputs: a dictionary of matched features and a more extensive dictionary encompassing broader conceptual matches. (4) Finally, we calculate two key metrics: faithfulness and coverage. These metrics measure the LVLMs' comprehension by evaluating how well the generated captions encapsulate the salient features of the images and the breadth of concepts they cover, respectively.

VALOR Benchmark

The overall evaluation results of object, attribute, and relation hallucination in VALOR-Eval using GPT-4 as the LLM Agent. The highest is highlighted in blue, while the worst performance is highlighted in yellow.

Model	Source	Average Faithfulness	Average Coverage	Object Existence(F)	Object Existence(C)	Object Attribute(F)	Object Attribute(C)	People Attribute(F)	People Attribute(C)	Positional Relation(F)	Positional Relation(C)	Comparative Relation(F)	Comparative Relation(C)
InstructBLIP (Vicuna-7B)	Link	62.1	21.4	74.5	24.8	72.0	23.9	47.1	9.3	50.0	13.6	66.9	35.6
LLaVA-1.5 (Vicuna-13B)	Link	61.3	25.9	72.1	24.7	74.6	37.8	43.3	12.1	64.8	14.9	51.9	40.1
MiniGPT-4 v2 (LLaMA-2-7B)	Link	50.4	19.8	65.0	25.4	64.5	17.9	38.9	11.6	38.8	33.1	44.7	11.2
mPLUG-Owl2 (LLaMA-2-7B)	Link	55.6	23.0	71.5	24.8	79.9	32.7	39.7	16.2	45.2	10.8	41.6	30.6
BLIVA (Vicuna-7B)	Link	59.2	19.5	77.7	21.9	73.3	24.3	37.6	11.6	39.5	9.7	68.0	29.9
CogVLM (Vicuna-7B)	Link	58.2	25.7	71.2	35.5	75.3	24.3	43.7	22.4	51.9	10.5	49.0	35.9
InternLM-XComposer2	Link	67.1	22.7	82.5	23.9	75.8	26.3	50.4	13.8	62.6	11.1	64.1	38.4
Qwen-VL-Chat	Link	58.7	23.2	70.6	28.4	75.1	38.6	38.8	16.0	56.9	8.5	51.9	24.3
Emu2	Link	75.0	8.1	94.2	14.1	66.7	10.4	54.3	1.9	72.2	1.8	87.5	12.3
GPT-4V	Link	54.6	28.0	61.6	38.8	78.5	36.3	34.7	23.8	46.7	12.6	51.6^*	28.5^*

F: represents Faithfulness Score and C: represents Coverage Score, both are in percentage (%).
^*: For images that contain people, GPT-4V refrains from generating comments and we marked this score with an asterisk.

🚨 To submit your results to the leaderboard, please send to this email with your result json files.

Dataset Collection of VALOR

Overview of our proposed benchmark VALOR: collection procedure: (1) Co-occurrence statistics calculation: We employ two statistical measures (frequencies and conditional probabilities) to determine co-occurring features. In summary, we use these features to select images that contain strong associative relation object pairs A and B where A is in the image but B is not, leading model to hallucinate B. Our statistics covers three co-occurrence categories: object-object, object-attribute, and object-relation-object. (2) Image extraction: Next, we leverage the identified co-occurrence statistics to systematically extract images from existing datasets. Here we defined objects O that exhibit the most pronounced co-occurrence dependencies. Then we select features that are minimally associated with each identified object in O, denoted as set I, thereby spotlighting instances where common co-occurrences are absent. We determine features that are most frequently co-occurring with each identified object in O, denoted as set H, serving as strong associative tendencies. Finally, we collect images for each feature in I corresponding to an object in O, with the chosen images including the specified feature and object, yet excluding any features from H, to create clear cases for testing the model's associative bias. (3) Human Annotation: Finally, we manually annotate each image within the distinct feature subsets. Here, we provide an example of how we use the co-occurrence statistics to select images for object subsets and add human annotations for the later evaluation.

Qualitative Examples

Here are the evaluation examples from three representative models in our benchmark VALOR. Text in red indicating models' hallucinations.

BibTeX

@inproceedings{qiu-etal-2024-valor,
    title = "{VALOR}-{EVAL}: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models",
    author = "Qiu, Haoyi  and
      Hu, Wenbo  and
      Dou, Zi-Yi  and
      Peng, Nanyun",
    editor = "Ku, Lun-Wei  and
      Martins, Andre  and
      Srikumar, Vivek",
    booktitle = "Findings of the Association for Computational Linguistics ACL 2024",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand and virtual meeting",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-acl.105",
    doi = "10.18653/v1/2024.findings-acl.105",
    pages = "1783--1805"}

VALOR-EVAL

Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Model