OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks

1University of California, Los Angeles (UCLA)
Performance Improvements over Qwen3-VL-Instruct-8B

Performance improvement (relative) of OpenVLThinkerV2 over its baseline Qwen3-VL-Instruct-8B across diverse visual tasks. Our model establishes new state-of-the-art results among open-source models.

Abstract

Group Relative Policy Optimization (GRPO) has emerged as the de facto Reinforcement Learning (RL) objective driving recent advancements in Multimodal Large Language Models. However, extending this success to open-source multimodal generalist models remains heavily constrained by two primary challenges: the extreme variance in reward topologies across diverse visual tasks, and the inherent difficulty of balancing fine-grained perception with multi-step reasoning capabilities.

To address these issues, we introduce Gaussian GRPO (G2RPO), a novel RL training objective that replaces standard linear scaling with non-linear distributional matching. By mathematically forcing the advantage distribution of any given task to strictly converge to a standard normal distribution, G2RPO theoretically ensures inter-task gradient equity, mitigates vulnerabilities to heavy-tail outliers, and offers symmetric update for positive and negative rewards. Leveraging the enhanced training stability provided by G2RPO, we introduce two task-level shaping mechanisms to seamlessly balance perception and reasoning. First, response length shaping dynamically elicits extended reasoning chains for complex queries while enforce direct outputs to bolster visual grounding. Second, entropy shaping tightly bounds the model's exploration zone, effectively preventing both entropy collapse and entropy explosion. Integrating these methodologies, we present OpenVLThinkerV2, a highly robust, general-purpose multimodal model. Extensive evaluations across 18 diverse benchmarks demonstrate its superior performance over strong open-source and leading proprietary frontier models.

Gaussian GRPO (G2RPO)

Comparison of advantage formulations

Overcoming Statistical Fragility: Standard multi-task normalization relies on linear transformations, leaving optimization vulnerable to structural pathologies like heavy-tail outliers. G2RPO abandons scalar standardization in favor of non-linear distributional matching via 1D Optimal Transport. By mapping the relative rank of responses directly to the inverse CDF of a standard normal distribution N(0,1), G2 mathematically caps outliers, smooths bimodal step-functions into balanced tails, and ensures inter-task gradient equity.

Task-Level Length & Entropy Shaping

Length and Entropy dynamics during training

To balance fine-grained perception and multi-step reasoning, we introduce task-level shaping:

1. Response Length Shaping: We apply a customized trapezoidal reward envelope. It scales up reasoning length for complex reasoning-centric tasks (like Math VQA) while reducing overthinking for visual-centric tasks (like Grounding), effectively mitigating hallucinations.
2. Entropy Shaping: Reasoning tasks are prone to entropy explosion, while vision-centric tasks are susceptible to entropy collapse. We impose a strict margin-based penalty to bound the model's exploration within an optimal, task-specific zone, stabilizing training across diverse domains.

Main Results: Visual & Spatial Reasoning

Visual and Spatial Reasoning Results

OpenVLThinkerV2 establishes new state-of-the-art open-source results across a diverse suite of tasks encompassing general scientific knowledge, mathematics, chart understanding, and complex multimodal reasoning. Notably, OpenVLThinkerV2 reaches 71.6% on MMMU, 88.2% on MMBench, and 73.8% on MMStar, surpassing GPT-4o, as well as 87.4% on ChartQA, which exceeds the performance of Gemini 2.5 Pro.

Main Results: Document Understanding & Grounding

Document Understanding and Grounding Results

Our model demonstrates superior vision-centric capabilities. In Document Understanding, OpenVLThinkerV2 attains 911 on OCRBench, surpassing specialized dynamic zoom-in models like DeepEyesV2. In Spatial Reasoning tasks, despite not finetuned on this data, our model achieves the highest performance on EMbSpatial, and performs on par with the spatial-expert SpatialRGPT on the RoboSpatial. Furthermore, our model significantly surpass the capabilities of GPT-5 and Gemini 2.5 Pro.

BibTeX

@article{hu2026openvlthinkerv2,
      title={OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks}, 
      author={Wenbo Hu and Xin Chen and Yan Gao-Tian and Yihe Deng and Nanyun Peng and Kai-Wei Chang},
      year={2026},
      journal={arXiv preprint}
}