Vision Enhanced Generative Pre-trained Language Model for Multimodal Sentence Summarization

doi:10.1007/s11633-022-1372-x

CORC > 自动化研究所 > 中国科学院自动化研究所 > 学术期刊 > International Journal of Automation and Computing

	Vision Enhanced Generative Pre-trained Language Model for Multimodal Sentence Summarization
	Liqiang Jing 1; Yiren Li 2; Junhao Xu 1; Yongcan Yu 1; Pei Shen 2; Xuemeng Song 1
刊名	Machine Intelligence Research
	2023
卷号	20 期号:2 页码:289-298
关键词	Multimodal sentence summarization (MMSS) generative pre-trained language model (GPLM) natural language gene ration deep learning artificial intelligence
ISSN号	2731-538X
DOI	10.1007/s11633-022-1372-x
英文摘要	Multimodal sentence summarization (MMSS) is a new yet challenging task that aims to generate a concise summary of a long sentence and its corresponding image. Although existing methods have gained promising success in MMSS, they overlook the powerful generation ability of generative pre-trained language models (GPLMs), which have shown to be effective in many text genera tion tasks. To fill this research gap, we propose to using GPLMs to promote the erformance of MMSS. Notably, adopting GPLMs to solve MMSS inevitably faces two challenges: 1) What fusion strategy should we use to inject visual information into GPLMs properly? 2) How to keep the GPLM′s generation ability intact to the utmost extent when the visual feature is injected into the GPLM. To address these two challenges, we propose a vision enhanced generative pre-trained language model for MMSS, dubbed as Vision-GPLM. In Vision-GPLM, we obtain features of visual and textual modalities with two separate encoders and utilize a text decoder to produce a summary. In particular, we utilize multi-head attention to fuse the features extracted from visual and textual modalities to inject the visual feature into the GPLM. Meanwhile, we train Vision-GPLM in two stages: the vision-oriented pre-training stage and fine-tuning stage. In the vision-oriented pre-training stage, we particularly train the visual encoder by the masked language model task while the other components are frozen, aiming to obtain homogeneous representations of text and image. In the fine-tuning stage, we train all the components of Vision-GPLM by the MMSS task. Extensive experiments on a public MMSS dataset verify the superiority of our model over existing baselines.
内容类型	期刊论文
源URL	[http://ir.ia.ac.cn/handle/173211/55981]
专题	自动化研究所_学术期刊_International Journal of Automation and Computing
作者单位	1.School of Science and Technology, Shandong University, Qingdao 266237, China 2.HBIS Digital Technology Co., Ltd, Shijiazhuang 050035, China
推荐引用方式 GB/T 7714	Liqiang Jing,Yiren Li,Junhao Xu,et al. Vision Enhanced Generative Pre-trained Language Model for Multimodal Sentence Summarization[J]. Machine Intelligence Research,2023,20(2):289-298.
APA	Liqiang Jing,Yiren Li,Junhao Xu,Yongcan Yu,Pei Shen,&Xuemeng Song.(2023).Vision Enhanced Generative Pre-trained Language Model for Multimodal Sentence Summarization.Machine Intelligence Research,20(2),289-298.
MLA	Liqiang Jing,et al."Vision Enhanced Generative Pre-trained Language Model for Multimodal Sentence Summarization".Machine Intelligence Research 20.2(2023):289-298.