We investigate a critical bottleneck in Multimodal Large Language Models (MLLMs): their limited visual perception, which hinders their ability to solve complex geometric reasoning tasks. We find that even state-of-the-art models struggle to accurately perceive basic geometric concepts, preventing effective reasoning.
📝 Our Solution:
We introduce GeoPQA, a benchmark to measure this gap, and propose a two-stage training framework:
1️⃣ Perception First: We train the MLLM to accurately identify geometric structures.
2️⃣ Reasoning Second: With a solid visual foundation, we then train it on complex, multi-step reasoning.
📈 The results are exciting! Our two-stage approach boosts geometric problem-solving accuracy by 9.1% over traditional methods. Our work highlights a key principle: for MLLMs to truly reason, they must first learn to see.
If you find our work useful, please consider citing our paper:
@misc{chen2025geopqabridgingvisualperception,
title={GeoPQA: Bridging the Visual Perception Gap in MLLMs for Geometric Reasoning},
author={Guizhen Chen and Weiwen Xu and Hao Zhang and Hou Pong Chan and Deli Zhao and Anh Tuan Luu and Yu Rong},
year={2025},
eprint={2509.17437},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.17437},
}