Vision Enhanced Generative Pre-trained Language Model for Multimodal Sentence Summarization
Liqiang Jing1; Yiren Li2; Junhao Xu1; Yongcan Yu1; Pei Shen2; Xuemeng Song1
刊名Machine Intelligence Research
2023
卷号20期号:2页码:289-298
关键词Multimodal sentence summarization (MMSS) generative pre-trained language model (GPLM) natural language gene ration deep learning artificial intelligence
ISSN号2731-538X
DOI10.1007/s11633-022-1372-x
英文摘要Multimodal sentence summarization (MMSS) is a new yet challenging task that aims to generate a concise summary of a long sentence and its corresponding image. Although existing methods have gained promising success in MMSS, they overlook the powerful generation ability of generative pre-trained language models (GPLMs), which have shown to be effective in many text genera tion tasks. To fill this research gap, we propose to using GPLMs to promote the erformance of MMSS. Notably, adopting GPLMs to solve MMSS inevitably faces two challenges: 1) What fusion strategy should we use to inject visual information into GPLMs properly? 2) How to keep the GPLM′s generation ability intact to the utmost extent when the visual feature is injected into the GPLM. To address these two challenges, we propose a vision enhanced generative pre-trained language model for MMSS, dubbed as Vision-GPLM. In Vision-GPLM, we obtain features of visual and textual modalities with two separate encoders and utilize a text decoder to produce a summary. In particular, we utilize multi-head attention to fuse the features extracted from visual and textual modalities to inject the visual feature into the GPLM. Meanwhile, we train Vision-GPLM in two stages: the vision-oriented pre-training stage and fine-tuning stage. In the vision-oriented pre-training stage, we particularly train the visual encoder by the masked language model task while the other components are frozen, aiming to obtain homogeneous representations of text and image. In the fine-tuning stage, we train all the components of Vision-GPLM by the MMSS task. Extensive experiments on a public MMSS dataset verify the superiority of our model over existing baselines.
内容类型期刊论文
源URL[http://ir.ia.ac.cn/handle/173211/55981]  
专题自动化研究所_学术期刊_International Journal of Automation and Computing
作者单位1.School of Science and Technology, Shandong University, Qingdao 266237, China
2.HBIS Digital Technology Co., Ltd, Shijiazhuang 050035, China
推荐引用方式
GB/T 7714
Liqiang Jing,Yiren Li,Junhao Xu,et al. Vision Enhanced Generative Pre-trained Language Model for Multimodal Sentence Summarization[J]. Machine Intelligence Research,2023,20(2):289-298.
APA Liqiang Jing,Yiren Li,Junhao Xu,Yongcan Yu,Pei Shen,&Xuemeng Song.(2023).Vision Enhanced Generative Pre-trained Language Model for Multimodal Sentence Summarization.Machine Intelligence Research,20(2),289-298.
MLA Liqiang Jing,et al."Vision Enhanced Generative Pre-trained Language Model for Multimodal Sentence Summarization".Machine Intelligence Research 20.2(2023):289-298.
个性服务
查看访问统计
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


©版权所有 ©2017 CSpace - Powered by CSpace