openblas: a high performance blas library on loongson 3a cpu | |
Zhang Xian-Yi ; Wang Qian ; Zhang Yun-Quan | |
刊名 | Ruan Jian Xue Bao/Journal of Software |
2011 | |
卷号 | 22期号:UPPL. 2页码:208-216 |
关键词 | Computer software Software engineering |
ISSN号 | 1000-9825 |
中文摘要 | BLAS is a fundamental math library in scientific computing. Thus, each CPU vendor releases optimized BLAS library for its own CPU. Loongson CPU series are developed by the Institute of Computing Technology, Chinese Academy of Sciences. In 2010, it released Loongson 3 CPU series. This paper introduces the open source BLAS library OpenBLAS, which is forked on GotoBLAS 2-1.13 BSD version. BLAS Level 3 functions of OpenBLAS is optimized on Loongson 3A quad cores CPU. In sequential optimizations, blocking, hand coding assembly kernel, Loongson 3A special instructions and reordering instructions are utilized. The performance of BLAS Level 3 subroutines exceeded GotoBLAS and ATLAS by about 75% and 17%. Meanwhile, it exceeded GotoBLAS and ATLAS by about 103% and 36% in double precision functions. In parallel multi-threads optimization, this study used interleaved data buffer layout to avoid shared L2 Cache conflictions among multi-threads. OpenBLAS achieved 3.47 speedups on quad cores. In 4 threads, the performance of OpenBLAS BLAS Level3 functions exceeded GotoBLAS and ATLAS by about 69% and 34%, 89% and 55% in double precision functions. ©2011 Journal of Software. |
英文摘要 | BLAS is a fundamental math library in scientific computing. Thus, each CPU vendor releases optimized BLAS library for its own CPU. Loongson CPU series are developed by the Institute of Computing Technology, Chinese Academy of Sciences. In 2010, it released Loongson 3 CPU series. This paper introduces the open source BLAS library OpenBLAS, which is forked on GotoBLAS 2-1.13 BSD version. BLAS Level 3 functions of OpenBLAS is optimized on Loongson 3A quad cores CPU. In sequential optimizations, blocking, hand coding assembly kernel, Loongson 3A special instructions and reordering instructions are utilized. The performance of BLAS Level 3 subroutines exceeded GotoBLAS and ATLAS by about 75% and 17%. Meanwhile, it exceeded GotoBLAS and ATLAS by about 103% and 36% in double precision functions. In parallel multi-threads optimization, this study used interleaved data buffer layout to avoid shared L2 Cache conflictions among multi-threads. OpenBLAS achieved 3.47 speedups on quad cores. In 4 threads, the performance of OpenBLAS BLAS Level3 functions exceeded GotoBLAS and ATLAS by about 69% and 34%, 89% and 55% in double precision functions. ©2011 Journal of Software. |
收录类别 | EI |
语种 | 中文 |
公开日期 | 2013-10-08 |
内容类型 | 期刊论文 |
源URL | [http://ir.iscas.ac.cn/handle/311060/16164] |
专题 | 软件研究所_软件所图书馆_期刊论文 |
推荐引用方式 GB/T 7714 | Zhang Xian-Yi,Wang Qian,Zhang Yun-Quan. openblas: a high performance blas library on loongson 3a cpu[J]. Ruan Jian Xue Bao/Journal of Software,2011,22(UPPL. 2):208-216. |
APA | Zhang Xian-Yi,Wang Qian,&Zhang Yun-Quan.(2011).openblas: a high performance blas library on loongson 3a cpu.Ruan Jian Xue Bao/Journal of Software,22(UPPL. 2),208-216. |
MLA | Zhang Xian-Yi,et al."openblas: a high performance blas library on loongson 3a cpu".Ruan Jian Xue Bao/Journal of Software 22.UPPL. 2(2011):208-216. |
个性服务 |
查看访问统计 |
相关权益政策 |
暂无数据 |
收藏/分享 |
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。
修改评论