Investigating Compositional Challenges in Vision-Language Models for Visual Grounding

CORC > 自动化研究所 > 中国科学院自动化研究所 > 智能感知与计算研究中心

	Investigating Compositional Challenges in Vision-Language Models for Visual Grounding
	Yunan Zeng ; Yan Huang ; Jinjin Zhang ; Zequn Jie ; Zhenhua Chai ; Liang Wang
	2024-06-18
会议日期	17-21 June 2024
会议地点	Seattle WA, USA
英文摘要	Pre-trained vision-language models (VLMs) have achieved high performance on various downstream tasks, which have been widely used for visual grounding tasks in a weakly supervised manner. However, despite the performance gains contributed by large vision and language pre-training, we find that state-of-the-art VLMs struggle with compositional reasoning on grounding tasks. To demonstrate this, we propose Attribute, Relation, and Priority grounding (ARPGrounding) benchmark to test VLMs' compositional reasoning ability on visual grounding tasks. ARPGrounding contains 11,425 samples and evaluates the compositional understanding of VLMs in three dimensions: 1) attribute, denoting comprehension of objects' properties; 2) relation, indicating an understanding of relation between objects; 3) priority, reflecting an awareness of the part of speech associated with nouns. Using the ARPGrounding benchmark, we evaluate several mainstream VLMs. We empirically find that these models perform quite well on conventional visual grounding datasets, achieving performance comparable to or surpassing state-of-the-art methods but showing strong deficiencies in compositional reasoning. Furthermore, we propose a composition-aware fine-tuning pipeline, demonstrating the potential to leverage cost-effective image-text annotations for enhancing the compositional understanding of VLMs in grounding tasks.
内容类型	会议论文
源URL	[http://ir.ia.ac.cn/handle/173211/57210]
专题	自动化研究所_智能感知与计算研究中心
通讯作者	Liang Wang
作者单位	1.School of Artificial Intelligence, University of Chinese Academy of Sciences 2.Meituan 3.Center for Research on Intelligent Perception and Computing 4.Institute of Automation, Chinese Academy of Science
推荐引用方式 GB/T 7714	Yunan Zeng,Yan Huang,Jinjin Zhang,et al. Investigating Compositional Challenges in Vision-Language Models for Visual Grounding[C]. 见:. Seattle WA, USA. 17-21 June 2024.

个性服务

查看访问统计

相关权益政策

暂无数据

收藏/分享

所有评论 (0)

[发表评论/异议/意见]

暂无评论

评论
权益异议
反馈意见

评注功能仅针对注册用户开放，请您登录

您对该条目有什么异议，请向管理员反馈。
内容：
Email：	*
单位:
验证码：	刷新

您在知识库使用过程中有什么好的想法或者建议可以反馈给我们。
标题：	*
内容：
Email：	*
验证码：	刷新

相关链接

CORC

联系我们