Matryoshka Query Transformer for

Large Vision-Language Models

University of California, Los Angeles
Interpolate start reference image.

Our model employs a query transformer to encode images as visual tokens. We randomly select the first m tokens during training, and enable flexible choice of any m number under M during inference, where M is the maximum number of initialized tokens.

Abstract

Large Vision-Language Models (LVLMs) typically encode an image into a fixed number of visual tokens (e.g., 576) and process these tokens with a language model. Despite their strong performance, LVLMs face challenges in adapting to varying computational constraints. This raises the question: can we achieve flexibility in the number of visual tokens to suit different tasks and computational resources? We answer this with an emphatic yes.

Inspired by Matryoshka Representation Learning, we introduce the Matryoshka Query Transformer (MQT), capable of encoding an image into m visual tokens during inference, where m can be any number up to a predefined maximum. This is achieved by employing a query transformer with M latent query tokens to compress the visual embeddings. During each training step, we randomly select m <= M latent query tokens and train the model using only these first m tokens, discarding the rest.

Combining MQT with LLaVA, we train a single model once, and flexibly and drastically reduce the number of inference-time visual tokens while maintaining similar or better performance compared to training independent models for each number of tokens. Our model, MQT-LLaVA, matches LLaVA-1.5 performance across 11 benchmarks using a maximum of 256 tokens instead of LLaVA's fixed 576.

Reducing to 16 tokens (8x less TFLOPs) only sacrifices the performance by 2.4 points on MMBench. On certain tasks such as ScienceQA and MMMU, we can even go down to only 2 visual tokens with performance drops of just 3% and 6% each.

Our exploration of the trade-off between the accuracy and computational cost brought about by the number of visual tokens facilitates future research to achieve the best of both worlds.

Main Results

A Our model, MQT-LLaVA, matches LLaVA-1.5 performance on 11 benchmarks using only 256 visual tokens instead of 576. We achieve a 2x speed-up with 256 tokens and 8X speed-up in TFLOPs using 16 tokens with only a 2.4 performance drop compared to LLaVA-1.5 on MMBench.

Interpolate start reference image.

Comparison with state-of-the-art methods on 11 vision-language benchmarks. Our model (MQT-LLaVA) with up to 256 tokens achieves on par or better than LLaVA-1.5 performance across 11 benchmarks, outperforming it on 6 of 11 benchmarks. MQT-LLaVA outperforms the baseline QT-LLaVA which is trained with fixed 256 tokens in 9 out of 11 benchmarks.

Interpolate start reference image.

Visualization

Grad-CAM visualization of 1 randomly picked token from using 8, 16, 64, 256 visual tokens, respectively, to encode an image. The model effectively concentrates on high-level concepts using fewer tokens and delves into low-level details with more tokens. The complete input to the third image is "List all the objects on the desk. The objects on the desk include a computer monitor, a keyboard, a mouse, a cell phone, and a pair of headphones".

Interpolate start reference image.

Analysis on Different tasks

Tasks robust to visual token reduction.

Several benchmarks primarily targeting the visual perception skills of models, performance remains consistent when gradually reducing the number of visual tokens until a threshold is reached. Beyond this threshold, performance drops significantly. This “turning point" is observed in benchmarks such as MME Cognition, MME Perception, POPE, and MMMU.

Interpolate start reference image.

Full visualization of the number of visual tokens impact the different tasks differently across 11 benchmarks.

Interpolate start reference image.

More Analysis

When are fewer visual tokens better?

We show examples from MME-Cognition, tasks involving commonsense reasoning, code reasoning, and numerical calculation can be performed effectively with as few as 16 visual tokens, allowing the model to focus on the relevant image sections.

Interpolate start reference image.

We observe MQT-LLaVA with 16 tokens can achieve better performance on ScienceQA compared to MQT-LLaVA with 144 tokens. To understand why fewer tokens may benefit this task, we qualitatively analyze instances where MQT-LLaVA succeeded with 16 visual tokens, but failed with 144. We show a representative example. MQT-LLaVA with 16 visual tokens attends to all three objects, allowing it to understand their mutual relationship and answer the question correctly.

Interpolate start reference image.

BibTeX

@misc{hu2024matryoshka,
      title={Matryoshka Query Transformer for Large Vision-Language Models}, 
      author={Wenbo Hu and Zi-Yi Dou and Liunian Harold Li and Amita Kamath and Nanyun Peng and Kai-Wei Chang},
      year={2024},
      eprint={2405.19315},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}