
Vision Language Models (VLMs), which extend Large Language Models (LLM) by incorporating visual understanding capability, have demonstrated significant advancements in addressing open-ended visual question-answering (VQA) tasks. However, these models reveal an evident shortfall when interpreting images infused with text, a common occurrence in real-world scenarios. Common practices for extracting image information typically revolve around learning a fixed set of query embeddings capable of encapsulating image contexts, which are subsequently treated as ``foreign language tokens'' within LLMs. Yet, this process may hit a bottleneck due to the limited token count, potentially curtailing the recognition of scenes with rich context, particularly those require Optical Character Recognition (OCR).
To address this challenge, we introduces an augmented version of InstructBLIP with Visual Assistent (BLIVA) an end-to-end large multimodal model retains the query embeddings from InstructBLIP, while also directly projecting encoded patch embeddings into the LLM inspired by LLaVA, thereby enriching the model with any intricate details that might be lost during the query decoding process.
We initialized our model from InstructBLIP and employed same training data to demonstrate the effectiveness of our design. A typical training paradigm, two-stage training is further utilized to first introduce a global description alignment between vision feature and LLM and then employ instruction tuning for detailed understanding viusal feature as languages.
Empirical evidence demonstrates that our model, BLIVA, notably improves the performance in processing text-rich VQA tasks (up to 6.3\% in OCR-VQA benchmark) and in undertaking typical VQA tasks requires spatial reasoning (up to 7.9\% in Visual Spatial Reasoning benchmark). BLIVA exhibits considerable potential in decoding real-world images, regardless of the presence of text.
To showcase the wide-ranging industry applications made feasible by BLIVA, we assess the model under a new dataset encompasses YouTube thumbnails with questions-answer pairs across a diverse spectrum of 13 categories.
A typical multi-stage training paradigm. Following InstructBLIP, we also employs this training paradigm. The first stage employs image text caption pairs to achieve a global alignment between visual and language modalities and the second stage employs instruction tuning data to enhance the alignment by detailed visual questions.
Two examples from our collected Youtube Thumbnails Visual Questioan Answer Dataset, \textbf{YTTB-VQA}. We have two scenarios for BLIVA's applicable use: (1) Detailed Captions. BLIVA can give a detailed caption describing all the visual information in the image. (2) Short Captions + VQA. BLIVA can also summarize the visual information to a short caption. Then, BLIVA can handle users who ask visual questions to this image for more detailed information.
Our collected Youtube thumbnails datasets from Youtube official website. This Chart illustrates the distribution of 13 categories including technology, shopping, sports, entertainment, business companies, transporation, food, moive, astronomy, history, music, geography and academic sharing.
We employ real life scene images, movie poster, webpages and memes to demonstrate our model's performance regarding interaction with humans based on text-rich images. BLIVA demonstrates great OCR abilities in reading the road signs, food packaging, movie posters, webpages and texts in the memes. BLIVA understands the visual information and can localize the texts and objects in the imags clearly. BLIVA's reply is strictly based on visual content without hallucination like InstructBLIP. Beyond reading the texts, BLIVA demonstrates its understanding in the meaning of the memes by combining both the text and visual information together.
@misc{BLIVA2023,
title={BLIVA: Augmenting InstructBLIP for Image-Text Comprehension with Visual Assistent Integration},
author={Wenbo Hu, Yifan Xu, Yi Li, Weiyue Li, Zeyuan Chen, and Zhuowen Tu},
year={2023},
archivePrefix={arXiv},
primaryClass={cs.CV}
}