BLIVA: Augmenting InstructBLIP for Image-Text Comprehension with Visual Assistent Integration

Work in Progress
*equal contribution
1University of California, San Diego, 2Coinbase Global, Inc.
Interpolate start reference image.

Comparison of approaches for aligning visual and language modalities. Both (a) Flamingo and (b) BLIP-2 architecture employed a fixed small number of query embeddings to extract visual information to LLM. (c) LLaVA directly aligned the encoded patch embeddings with LLM. (d) BLIVA combines learned query embeddings and additional encoded patch embeddings for enhanced visual understanding.

Abstract

Vision Language Models (VLMs), which extend Large Language Models (LLM) by incorporating visual understanding capability, have demonstrated significant advancements in addressing open-ended visual question-answering (VQA) tasks. However, these models reveal an evident shortfall when interpreting images infused with text, a common occurrence in real-world scenarios. Common practices for extracting image information typically revolve around learning a fixed set of query embeddings capable of encapsulating image contexts, which are subsequently treated as ``foreign language tokens'' within LLMs. Yet, this process may hit a bottleneck due to the limited token count, potentially curtailing the recognition of scenes with rich context, particularly those require Optical Character Recognition (OCR).

To address this challenge, we introduces an augmented version of InstructBLIP with Visual Assistent (BLIVA) an end-to-end large multimodal model retains the query embeddings from InstructBLIP, while also directly projecting encoded patch embeddings into the LLM inspired by LLaVA, thereby enriching the model with any intricate details that might be lost during the query decoding process.

We initialized our model from InstructBLIP and employed same training data to demonstrate the effectiveness of our design. A typical training paradigm, two-stage training is further utilized to first introduce a global description alignment between vision feature and LLM and then employ instruction tuning for detailed understanding viusal feature as languages.

Empirical evidence demonstrates that our model, BLIVA, notably improves the performance in processing text-rich VQA tasks (up to 6.3\% in OCR-VQA benchmark) and in undertaking typical VQA tasks requires spatial reasoning (up to 7.9\% in Visual Spatial Reasoning benchmark). BLIVA exhibits considerable potential in decoding real-world images, regardless of the presence of text.

To showcase the wide-ranging industry applications made feasible by BLIVA, we assess the model under a new dataset encompasses YouTube thumbnails with questions-answer pairs across a diverse spectrum of 13 categories.

Interpolate start reference image.

Two Stage Training Paradigm

A typical multi-stage training paradigm. Following InstructBLIP, we also employs this training paradigm. The first stage employs image text caption pairs to achieve a global alignment between visual and language modalities and the second stage employs instruction tuning data to enhance the alignment by detailed visual questions.

Interpolate start reference image.

New Evaluation Dataset: Youtube Thumbnails Visual Question Ansering (YTTB-VQA)

Two examples from our collected Youtube Thumbnails Visual Questioan Answer Dataset, \textbf{YTTB-VQA}. We have two scenarios for BLIVA's applicable use: (1) Detailed Captions. BLIVA can give a detailed caption describing all the visual information in the image. (2) Short Captions + VQA. BLIVA can also summarize the visual information to a short caption. Then, BLIVA can handle users who ask visual questions to this image for more detailed information.

Interpolate start reference image.

Our collected Youtube thumbnails datasets from Youtube official website. This Chart illustrates the distribution of 13 categories including technology, shopping, sports, entertainment, business companies, transporation, food, moive, astronomy, history, music, geography and academic sharing.

Interpolate start reference image.

More Examples

We employ real life scene images, movie poster, webpages and memes to demonstrate our model's performance regarding interaction with humans based on text-rich images. BLIVA demonstrates great OCR abilities in reading the road signs, food packaging, movie posters, webpages and texts in the memes. BLIVA understands the visual information and can localize the texts and objects in the imags clearly. BLIVA's reply is strictly based on visual content without hallucination like InstructBLIP. Beyond reading the texts, BLIVA demonstrates its understanding in the meaning of the memes by combining both the text and visual information together.

Interpolate start reference image.

Interpolate start reference image.

BibTeX

@misc{BLIVA2023,
    title={BLIVA: Augmenting InstructBLIP for Image-Text Comprehension with Visual Assistent Integration}, 
    author={Wenbo Hu, Yifan Xu, Yi Li, Weiyue Li, Zeyuan Chen, and Zhuowen Tu},
    year={2023},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}