多核数字信号处理器矩阵乘卷积算法性能评测 Evaluating matrix multiplication-based convolution algorithm on multi-core digital signal processors期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

多核数字信号处理器矩阵乘卷积算法性能评测

引用本文：	王庆林,裴向东,廖林玉,王浩旭,李荣春,梅松竹,李东升.多核数字信号处理器矩阵乘卷积算法性能评测[J].国防科技大学学报,2023,45(1):86-94.

作者姓名：	王庆林裴向东廖林玉王浩旭李荣春梅松竹李东升

作者单位：	国防科技大学计算机学院, 湖南长沙 410073;国防科技大学并行与分布处理国防科技重点实验室, 湖南长沙 410073

基金项目：	国家自然科学基金资助项目(62002365)

摘要：	矩阵乘卷积算法能够为各种卷积配置提供高性能基础实现,是面向给定芯片进行卷积性能优化的首要选择。针对国防科技大学自主研制的飞腾异构多核数字信号处理器(digital signal processor, DSP)芯片的特征以及矩阵乘卷积算法自身的特点,提出了一种面向多核DSP架构的高性能并行矩阵乘卷积实现算法ftmEConv。该算法由输入特征图转换、卷积核转换、矩阵乘以及输出特征图转换这四个均运行在通用多核DSP上的并行化部分构成,通过有效挖掘通用DSP核中功能单元的潜力来提升各个部分的性能。实验结果表明,ftmEConv实现了高达42.90%的计算效率,与芯片上的其他矩阵乘卷积算法实现相比,获得了高达7.79倍的性能加速。
关键词：	多核数字信号处理器卷积神经网络卷积算法算法优化
收稿时间：	2022/9/13 0:00:00
Evaluating matrix multiplication-based convolution algorithm on multi-core digital signal processors

WANG Qinglin,PEI Xiangdong,LIAO Linyu,WANG Haoxu,LI Rongchun,MEI Songzhu,LI Dongsheng.Evaluating matrix multiplication-based convolution algorithm on multi-core digital signal processors[J].Journal of National University of Defense Technology,2023,45(1):86-94.

Authors:	WANG Qinglin PEI Xiangdong LIAO Linyu WANG Haoxu LI Rongchun MEI Songzhu LI Dongsheng

Institution:	College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China;Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology, Changsha 410073, China

Abstract:	The matrix multiplication-based convolutional algorithm, which can efficiently implement convolutions with different parameters, is the first choice of convolution performance optimization for a given chip. Based on the architecture of Phytium heterogeneous multi-core DSPs(digital signal processors) developed by National University of Defense Technology and the characteristic of the matrix multiplication-based convolutional algorithm, a parallel implementation of the matrix multiplication-based convolutional algorithm (called ftmEConv) for different convolutions on multi-core DSPs was proposed. The ftmEConv consists of four parallelized parts(input feature maps transformation, filter transformation, matrix multiplication, and output feature maps transformation), all of which were optimized for multi-core DSPs, and the performance of each part was improved by effectively exploiting the potential of all functional units in DSP cores. The experimental results demonstrate that ftmEConv achieves computational efficiency of up to 42.90%. Compared with other implementations of the matrix multiplication-based convolutional algorithm on heterogeneous chips, ftmEConv gets a speedup of up to 7.79 times.

Keywords:	multi-core digital signal processors convolutional neural networks convolutional algorithms algorithm optimization

	点击此处可从《国防科技大学学报》浏览原始摘要信息
	点击此处可从《国防科技大学学报》下载免费的PDF全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏