Fuse & Calibrate: A bi-directional Vision-Language Guided Framework for Referring Image Segmentation | |
Yichen Yan2,3; Xingjian He3; Sihan Chen2; Shichen Lu1; Jing Liu2,3 | |
2024-08 | |
会议日期 | 2024/08/05 |
会议地点 | Tianjin, China |
关键词 | Referring Image Segmentation, CLIP, Hierarchical Fusion, Computer Vision |
DOI | 3652583.3658095 |
英文摘要 | Referring Image Segmentation (RIS) aims to segment an object described in natural language from an image, with the main challenge being a text-to-pixel correlation. Previous methods typically rely on single-modality features, such as vision or language features, to guide the multi-modal fusion process. However, this approach limits the interaction between vision and language, leading to a lack of fine-grained correlation between the language description and pixel-level details during the decoding process. In this paper, we introduce FCNet, a framework that employs a bi-directional guided fusion approach where both vision and language play guiding roles. Specifically, we use a vision-guided approach to conduct initial multi-modal fusion, obtaining multi-modal features that focus on key vision information. We then propose a language-guided calibration module to further calibrate these multi-modal features, ensuring they understand the context of the input sentence. This bi-directional vision-language guided approach produces higher-quality multi-modal features sent to the decoder, facilitating adaptive propagation of fine-grained semantic information from textual features to visual features. Experiments on RefCOCO, RefCOCO+, and G-Ref datasets with various backbones consistently show our approach outperforming state-of-the-art methods. |
URL标识 | 查看原文 |
内容类型 | 会议论文 |
源URL | [http://ir.ia.ac.cn/handle/173211/58512] ![]() |
专题 | 自动化研究所_模式识别国家重点实验室_图像与视频分析团队 |
作者单位 | 1.Beihang University 2.School of Artificial Intelligence, University of Chinese Academy of Sciences 3.Institute of Automation, Chinese Academy of Sciences |
推荐引用方式 GB/T 7714 | Yichen Yan,Xingjian He,Sihan Chen,et al. Fuse & Calibrate: A bi-directional Vision-Language Guided Framework for Referring Image Segmentation[C]. 见:. Tianjin, China. 2024/08/05. |
个性服务 |
查看访问统计 |
相关权益政策 |
暂无数据 |
收藏/分享 |
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。
修改评论