Large Vision-Language Models (LVLMs) typically encode an image into a fixed number of visual tokens (e.g., 576) and process these tokens with a language model. Despite their strong performance, LVLMs face challenges in adapting to varying computational constraints. This raises the question: can we achieve flexibility in the number of visual tokens to suit different tasks and computational resources? We answer this with an emphatic yes.
Inspired by Matryoshka Representation Learning, we introduce the Matryoshka Query Transformer (MQT), capable of encoding an image into m visual tokens during inference, where m can be any number up to a predefined maximum. This is achieved by employing a query transformer with M latent query tokens to compress the visual embeddings. During each training step, we randomly select m <= M latent query tokens and train the model using only these first m tokens, discarding the rest.
Combining MQT with LLaVA, we train a single model once, and flexibly and drastically reduce the number of inference-time visual tokens while maintaining similar or better performance compared to training independent models for each number of tokens. Our model, MQT-LLaVA, matches LLaVA-1.5 performance across 11 benchmarks using a maximum of 256 tokens instead of LLaVA's fixed 576.
Reducing to 16 tokens (8x less TFLOPs) only sacrifices the performance by 2.4 points on MMBench. On certain tasks such as ScienceQA and MMMU, we can even go down to only 2 visual tokens with performance drops of just 3% and 6% each.
Our exploration of the trade-off between the accuracy and computational cost brought about by the number of visual tokens facilitates future research to achieve the best of both worlds.
A Our model, MQT-LLaVA, matches LLaVA-1.5 performance on 11 benchmarks using only 256 visual tokens instead of 576. We achieve a 2x speed-up with 256 tokens and 8X speed-up in TFLOPs using 16 tokens with only a 2.4 performance drop compared to LLaVA-1.5 on MMBench.
Comparison with state-of-the-art methods on 11 vision-language benchmarks. Our model (MQT-LLaVA) with up to 256 tokens achieves on par or better than LLaVA-1.5 performance across 11 benchmarks, outperforming it on 6 of 11 benchmarks. MQT-LLaVA outperforms the baseline QT-LLaVA which is trained with fixed 256 tokens in 9 out of 11 benchmarks.
Grad-CAM visualization of 1 randomly picked token from using 8, 16, 64, 256 visual
tokens, respectively, to encode an image. The model effectively concentrates on high-level concepts
using fewer tokens and delves into low-level details with more tokens. The complete input to the
third image is "List all the objects on the desk. The objects on the desk include a computer monitor, a
keyboard, a mouse, a cell phone, and a pair of headphones".
Several benchmarks primarily targeting
the visual perception skills of models, performance remains consistent when gradually reducing
the number of visual tokens until a threshold is reached. Beyond this threshold, performance drops
significantly. This “turning point" is observed in benchmarks
such as MME Cognition, MME Perception, POPE, and MMMU.
We show examples from MME-Cognition, tasks involving commonsense reasoning, code reasoning, and
numerical calculation can be performed effectively with as few as 16 visual tokens, allowing the
model to focus on the relevant image sections.
We observe MQT-LLaVA with 16 tokens can achieve
better performance on ScienceQA compared to MQT-LLaVA with 144 tokens. To understand
why fewer tokens may benefit this task, we qualitatively analyze instances where MQT-LLaVA
succeeded with 16 visual tokens, but failed with 144. We show a representative example.
MQT-LLaVA with 16 visual tokens attends to all three objects, allowing it to understand their
mutual relationship and answer the question correctly.
@inproceedings{hu2024matryoshka,
author = {Hu, Wenbo and Dou, Zi-Yi and Li, Liunian Harold and Kamath, Amita and Peng, Nanyun and Chang, Kai-Wei},
booktitle = {The 38th Conference on Neural Information Processing Systems (NeurIPS)},
title = {MQT-LLaVA: Matryoshka Query Transformer for Large Vision-Language Models},
year = {2024}
}