MQT-LLaVA

Abstract

Large Vision-Language Models (LVLMs) typically encode an image into a fixed number of visual tokens (e.g., 576) and process these tokens with a language model. Despite their strong performance, LVLMs face challenges in adapting to varying computational constraints. This raises the question: can we achieve flexibility in the number of visual tokens to suit different tasks and computational resources? We answer this with an emphatic yes.

Inspired by Matryoshka Representation Learning, we introduce the Matryoshka Query Transformer (MQT), capable of encoding an image into m visual tokens during inference, where m can be any number up to a predefined maximum. This is achieved by employing a query transformer with M latent query tokens to compress the visual embeddings. During each training step, we randomly select m <= M latent query tokens and train the model using only these first m tokens, discarding the rest.

Combining MQT with LLaVA, we train a single model once, and flexibly and drastically reduce the number of inference-time visual tokens while maintaining similar or better performance compared to training independent models for each number of tokens. Our model, MQT-LLaVA, matches LLaVA-1.5 performance across 11 benchmarks using a maximum of 256 tokens instead of LLaVA's fixed 576.

Reducing to 16 tokens (8x less TFLOPs) only sacrifices the performance by 2.4 points on MMBench. On certain tasks such as ScienceQA and MMMU, we can even go down to only 2 visual tokens with performance drops of just 3% and 6% each.

Our exploration of the trade-off between the accuracy and computational cost brought about by the number of visual tokens facilitates future research to achieve the best of both worlds.

Main Results

A Our model, MQT-LLaVA, matches LLaVA-1.5 performance on 11 benchmarks using only 256 visual tokens instead of 576. We achieve a 2x speed-up with 256 tokens and 8X speed-up in TFLOPs using 16 tokens with only a 2.4 performance drop compared to LLaVA-1.5 on MMBench.

Comparison with state-of-the-art methods on 11 vision-language benchmarks. Our model (MQT-LLaVA) with up to 256 tokens achieves on par or better than LLaVA-1.5 performance across 11 benchmarks, outperforming it on 6 of 11 benchmarks. MQT-LLaVA outperforms the baseline QT-LLaVA which is trained with fixed 256 tokens in 9 out of 11 benchmarks.

Visualization

Grad-CAM visualization of 1 randomly picked token from using 8, 16, 64, 256 visual tokens, respectively, to encode an image. The model effectively concentrates on high-level concepts using fewer tokens and delves into low-level details with more tokens. The complete input to the third image is "List all the objects on the desk. The objects on the desk include a computer monitor, a keyboard, a mouse, a cell phone, and a pair of headphones".

Tasks robust to visual token reduction.

Several benchmarks primarily targeting the visual perception skills of models, performance remains consistent when gradually reducing the number of visual tokens until a threshold is reached. Beyond this threshold, performance drops significantly. This “turning point" is observed in benchmarks such as MME Cognition, MME Perception, POPE, and MMMU.

BibTeX

@inproceedings{hu2024matryoshka, author = {Hu, Wenbo and Dou, Zi-Yi and Li, Liunian Harold and Kamath, Amita and Peng, Nanyun and Chang, Kai-Wei}, booktitle = {The 38th Conference on Neural Information Processing Systems (NeurIPS)}, title = {MQT-LLaVA: Matryoshka Query Transformer for Large Vision-Language Models}, year = {2024} }

Matryoshka Query Transformer for

Large Vision-Language Models

Our model employs a query transformer to encode images as visual tokens. We randomly select the first m tokens during training, and enable flexible choice of any m number under M during inference, where M is the maximum number of initialized tokens.

Abstract

Main Results

Visualization

Analysis on Different tasks

Tasks robust to visual token reduction.

Full visualization of the number of visual tokens impact the different tasks differently across 11 benchmarks.

More Analysis

When are fewer visual tokens better?

BibTeX