一种高效的支持原位计算的三角矩阵乘法向量化方法 An Efficient Vectorization of Triangular Matrix Multiplication Supporting In-Place Calculation期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

一种高效的支持原位计算的三角矩阵乘法向量化方法

引用本文：	刘仲,田希,陈磊.一种高效的支持原位计算的三角矩阵乘法向量化方法[J].国防科技大学学报,2014,36(6).

作者姓名：	刘仲田希陈磊

作者单位：	国防科学技术大学计算机学院,国防科学技术大学计算机学院,国防科学技术大学计算机学院

摘要：	向量处理器的向量化算法映射是难点问题。提出一种高效的支持原位计算的三角矩阵乘法向量化方法,采用将L1D配置为SRAM模式,用双缓冲的乒乓方式平滑多级存储结构的数据传输,使得内核的计算与DMA数据搬移完全重迭,让内核始终以峰值速度运行,从而取得最佳的计算效率；将不规则的三角矩阵乘法计算均衡分布到各个向量处理单元,充分开发向量处理器的多级并行性；将结果矩阵保存在乘数矩阵中,实现原位计算,节省了存储空间。在Matrix上的实验结果表明,提出的向量化方法使三角矩阵乘法性能达到1053.7GFLOPS,效率为91.47%。
关键词：	三角矩阵乘法原位计算向量化向量处理器.
An Efficient Vectorization of Triangular Matrix Multiplication Supporting In-Place Calculation

Abstract:	The Vectorization of algorithm mapping for vector processors is a critical issue. An efficient vectorization of triangular matrix multiplication supporting in-place calculation was presented. L1 data cache was configured as SRAM and a double buffering scheme was designed to smooth the data transfers between SRAM and external memory, which makes kernel computation and DMA data transfer to be fully overlapped, and kernel computation achieves optimization computation efficiency with peak speed. Irregular triangular matrix multiplication computation ware evenly distributed to all vector processing elements to fully exploit multiple levels of parallelism for vector processor. In-place calculation makes the result matrix to be saved to multiplier matrix and saves memory space. Experimental results on Matrix show that the performance of presented vectorization of triangular matrix multiplication achieves 1053.7 GFLOPS, and an efficiency of 91.47%.

Keywords:	Triangular Matrix Multiplication In-place Calculation Vectorization Vector Processor

	点击此处可从《国防科技大学学报》浏览原始摘要信息
	点击此处可从《国防科技大学学报》下载免费的PDF全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏