Probe Efficient Feature Representation of Gapped K-mer Frequency Vectors from Sequences Using Deep Neural Networks
Cao, Zhen1,2; Zhang, Shihua1,2,3
刊名IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS
2020-03-01
卷号17期号:2页码:657-667
关键词DNA Bioinformatics Kernel Feature extraction Support vector machines Genomics Task analysis Bioinformatics machine learning gapped k-mer deep neural network transcription factor binding site prediction
ISSN号1545-5963
DOI10.1109/TCBB.2018.2868071
英文摘要Gapped k-mers frequency vectors (gkm-fv) has been presented for extracting sequence features. Coupled with support vector machine (gkm-SVM), gkm-fvs have been used to achieve effective sequence-based predictions. However, the huge computation of a large kernel matrix prevents it from using large amount of data. It is unclear how to combine gkm-fvs with other data sources in the context of string kernel. On the other hand, the high dimensionality, colinearity, and sparsity of gkm-fvs hinder the use of many traditional machine learning methods without a kernel trick. Therefore, we proposed a flexible and scalable framework gkm-DNN to achieve feature representation from high-dimensional gkm-fvs using deep neural networks (DNN). We first proposed a more concise version of gkm-fvs, which significantly reduce the dimension of gkm-fvs. Then, we implemented an efficient method to calculate the gkm-fv of a given sequence at the first time. Finally, we adopted a DNN model with gkm-fvs as inputs to achieve efficient feature representation and a prediction task. Here, we took the transcription factor binding site prediction as an illustrative application and applied gkm-DNN onto 467 small and 69 big human ENCODE ChIP-seq datasets to demonstrate its performance and compared it with the state-of-the-art method gkm-SVM.
资助项目National Natural Science Foundation of China[61621003] ; National Natural Science Foundation of China[11661141019] ; National Natural Science Foundation of China[61422309] ; National Natural Science Foundation of China[61379092] ; Strategic Priority Research Program of the Chinese Academy of Sciences (CAS)[XDB13040600] ; Ten Thousand Talent Program for Young Top-notch Talent ; Key Research Program of the Chinese Academy of Sciences[KFZD-SW-219] ; CAS Frontier Science Research Key Project for Top Young Scientist[QYZDB-SSW-SYS008]
WOS研究方向Biochemistry & Molecular Biology ; Computer Science ; Mathematics
语种英语
出版者IEEE COMPUTER SOC
WOS记录号WOS:000524236800025
内容类型期刊论文
源URL[http://ir.amss.ac.cn/handle/2S8OKBNM/51124]  
专题应用数学研究所
通讯作者Zhang, Shihua
作者单位1.Univ Chinese Acad Sci, Sch Math Sci, Beijing 100049, Peoples R China
2.Chinese Acad Sci, NCMIS, CEMS, RCSDS,Acad Math & Syst Sci, Beijing 100190, Peoples R China
3.Chinese Acad Sci, Ctr Excellence Anim Evolut & Genet, Kunming 650223, Yunnan, Peoples R China
推荐引用方式
GB/T 7714
Cao, Zhen,Zhang, Shihua. Probe Efficient Feature Representation of Gapped K-mer Frequency Vectors from Sequences Using Deep Neural Networks[J]. IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,2020,17(2):657-667.
APA Cao, Zhen,&Zhang, Shihua.(2020).Probe Efficient Feature Representation of Gapped K-mer Frequency Vectors from Sequences Using Deep Neural Networks.IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,17(2),657-667.
MLA Cao, Zhen,et al."Probe Efficient Feature Representation of Gapped K-mer Frequency Vectors from Sequences Using Deep Neural Networks".IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 17.2(2020):657-667.
个性服务
查看访问统计
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


©版权所有 ©2017 CSpace - Powered by CSpace