Auto-tuning dense matrix multiplication for GPGPU with cache

CORC > 北京大学 > 信息科学技术学院

	Auto-tuning dense matrix multiplication for GPGPU with cache
	Cui, Xiang ; Chen, Yifeng ; Zhang, Changyou ; Mei, Hong
	2010
英文摘要	In this paper we discuss about our experiences in improving the performance of GEMM (both single and double precision) on Fermi architecture using CUDA, and how the new features of Fermi such as cache affect performance. It is found that the addition of cache in GPU on one hand helps the processers take advantage of data locality occurred in runtime but on the other hand renders the dependency of performance on algorithmic parameters less predictable. Auto tuning then becomes a useful technique to address this issue. Our auto-tuned SGEMM and DGEMM reach 563 GFlops and 253 GFlops respectively on Tesla C2050. The design and implementation entirely use CUDA and C and have not benefited from tuning at the level of binary code. ? 2010 IEEE.; EI; 0
语种	英语
DOI标识	10.1109/ICPADS.2010.64
内容类型	其他
源URL	[http://ir.pku.edu.cn/handle/20.500.11897/295487]
专题	信息科学技术学院
推荐引用方式 GB/T 7714	Cui, Xiang,Chen, Yifeng,Zhang, Changyou,et al. Auto-tuning dense matrix multiplication for GPGPU with cache. 2010-01-01.

个性服务

查看访问统计

相关权益政策

暂无数据

收藏/分享

所有评论 (0)

暂无评论

评注功能仅针对注册用户开放，请您登录

您在知识库使用过程中有什么好的想法或者建议可以反馈给我们。
标题：	*
内容：
Email：	*
验证码：	刷新

相关链接