稀疏矩阵向量乘的存储访问复杂度及自动性能优化技术应用研究

CORC > 软件研究所 > 并行计算实验室 > 学位论文

题名	稀疏矩阵向量乘的存储访问复杂度及自动性能优化技术应用研究
作者	袁娥
学位类别	博士
答辩日期	2008-06-05
授予单位	中国科学院软件研究所
授予地点	软件研究所
关键词	稀疏矩阵向量乘 DRAM(h)模型存储访问复杂度自适应算法启发式-寄存器分块算法
中文摘要	在科学计算中，稀疏矩阵向量乘（SpMV, y=Ax）是一个十分重要的，且经常被大量调用的计算内核，广泛应用在科学计算、信息检索、气象、航天、油藏模拟、天体物理、数据挖掘等科学计算和实际应用中。在实际工程应用中，重复调用稀疏矩阵向量乘内核的次数常常会达到成千上万次。但在现代基于Cache存储层次的计算平台上，稀疏矩阵向量乘的性能很低。如果能够提高稀疏矩阵向量乘的运算速度，整个工程计算的运行效率将会得到很大的改善，计算时间也会大幅度的减少。因此优化稀疏矩阵向量乘的性能成为提高工程效率的关键，在实际应用中有着十分重要的意义。 SpMV的传统算法实现形式运行效率很低，主要原因是浮点计算操作和存储访问操作比率非常低，且稀疏矩阵非零元分布的不规则性使得存储访问模式非常复杂。寄存器分块算法和启发式选择分块算法,通过自适应选择性能最佳的分块大小，然后将稀疏矩阵分成小的稠密分块，所有的非零子块顺序计算，达到重用保存在寄存器中向量x元素的目的，减少存储访问次数和时间，从而提高这一重要内核的性能。我们在Pentium IV、Alpha EV6和AMD Athlon三个平台上，分别测试了十个矩阵下的两种不同算法形式(压缩行存储算法和寄存器分块算法)的性能，平均加速比分别达到1.69、1.90和 1.48。同时也测试了不同次数调用SpMV两种算法所用的时间，发现在实际的迭代算法应用过程中，若想采用启发式-寄存器分块算法达到性能提高的目的，一般情况下，迭代次数需要达到上百次才能有加速效果。 DRAM(h)模型是基于存储层次的并行计算模型，指出算法的复杂性包括计算复杂性和存储访问复杂性，具有近乎相同时间和空间复杂性的同一算法的不同实现形式，会有不同的存储访问复杂度，导致程序实际运行性能的差异；利用DRAM(h)模型进行分析并比较不同算法实现形式的存储访问复杂度，可以判断两种算法形式的优劣，从而为选取性能更高的实现形式提供指导。但利用DRAM(h)模型分析SpMV存储访问复杂度的工作以前没有人做过，并且SpMV的计算性能和存储访问行为跟具体的稀疏矩阵有关，只有到程序运行的时候才能知道。本文中，我们提出模板法和动态统计分析法两种分析SpMV存储访问复杂度的方法。在Pentium IV和Alpha EV6平台上，用RAM(h) 模型分析和计算了稀疏矩阵向量乘两种算法实现形式（即压缩行存储算法和寄存器分块算法）的存储访问复杂度，通过分析和计算在SpMV过程中需要访问的所有数据的存储访问复杂度，可知存储访问行为对整体程序的实际性能有直接影响。我们还在Pentium IV平台上，测试了七个稀疏矩阵的SpMV性能，并统计了两种算法中L1, L2,和TLB的缺失率，实验结果与模型分析的结果一致。
英文摘要	Sparse Matrix-Vector Multiplication is an important computational kernel in scientific applications that tends to perform poorly on modern processors due to its low computation-to-memory-access ratio and irregular memory access patterns. And the memory hierarchy makes the optimization even more difficult to perform. Register-level blocking algorithm and heuristic algorithm stores a matrix as a sequence of small dense blocks and organizes the computation to compute each block before moving on to the next, reuses the elements of x to optimize the performance of SpMV. Our experiences indicate that for matrix arising in scientific applications, register-level blocking algorithm and heuristic algorithm improve the performance greatly. We have tested ten matrices to compare the heuristic-register blocking algorithm with the CSR storage algorithm on three different platforms. The average speedups are 1.69 on Pentium IV platform, 1.90 on Alpha EV6 platform and 1.48 on AMD Athlon platform. But in practical applications, such as the iterative algorithm solver, with the above techniques, we generally need to call SpMV kernel more than one hundred times in order to improve the performance. DRAM(h) is a parallel computation model that has h-level memory hierarchies. It indicates that different implementation forms of one same algorithm can have different memory access complexity though they have almost the same time and space complexity under traditional RAM model. With RAM(h) model, we performed memory access complexity analysis on two implementation forms of SpMV(CSR storage algorithm and register-level Blocking algorithm) on Pentium IV platform and Alpha EV6 platform. At the same time, we tested SpMV performance of seven sparse matrix and the times of misses of L1 Cache, L2 Cache and TLB on Pentium IV platform. Our model analytical results matched well with experimental results, which indicated the effectiveness of DRAM(h) on clarifying the different memory access pattern of various forms of the same algorithm.
语种	中文
公开日期	2011-03-17
页码	63
内容类型	学位论文
源URL	[http://124.16.136.157/handle/311060/6714]
专题	软件研究所_并行计算实验室 _学位论文
推荐引用方式 GB/T 7714	袁娥. 稀疏矩阵向量乘的存储访问复杂度及自动性能优化技术应用研究[D]. 软件研究所. 中国科学院软件研究所. 2008.

个性服务

查看访问统计

相关权益政策

暂无数据

收藏/分享

所有评论 (0)

[发表评论/异议/意见]

暂无评论

评论
权益异议
反馈意见

评注功能仅针对注册用户开放，请您登录

您对该条目有什么异议，请向管理员反馈。
内容：
Email：	*
单位:
验证码：	刷新

您在知识库使用过程中有什么好的想法或者建议可以反馈给我们。
标题：	*
内容：
Email：	*
验证码：	刷新

相关链接

CORC

联系我们