首页 | 本学科首页   官方微博 | 高级检索  
     

多核数字信号处理器矩阵乘卷积算法性能评测
引用本文:王庆林,裴向东,廖林玉,王浩旭,李荣春,梅松竹,李东升. 多核数字信号处理器矩阵乘卷积算法性能评测[J]. 国防科技大学学报, 2023, 45(1): 86-94
作者姓名:王庆林  裴向东  廖林玉  王浩旭  李荣春  梅松竹  李东升
作者单位:国防科技大学 计算机学院, 湖南 长沙 410073;国防科技大学 并行与分布处理国防科技重点实验室, 湖南 长沙 410073;国防科技大学 计算机学院, 湖南 长沙 410073
基金项目:国家自然科学基金资助项目(62002365)
摘    要:矩阵乘卷积算法能够为各种卷积配置提供高性能基础实现,是面向给定芯片进行卷积性能优化的首要选择。针对国防科技大学自主研制的飞腾异构多核数字信号处理器(digital signal processor, DSP)芯片的特征以及矩阵乘卷积算法自身的特点,提出了一种面向多核DSP架构的高性能并行矩阵乘卷积实现算法ftmEConv。该算法由输入特征图转换、卷积核转换、矩阵乘以及输出特征图转换这四个均运行在通用多核DSP上的并行化部分构成,通过有效挖掘通用DSP核中功能单元的潜力来提升各个部分的性能。实验结果表明,ftmEConv实现了高达42.90%的计算效率,与芯片上的其他矩阵乘卷积算法实现相比,获得了高达7.79倍的性能加速。

关 键 词:多核数字信号处理器  卷积神经网络  卷积算法  算法优化
收稿时间:2022-09-13

Evaluating matrix multiplication-based convolution algorithm on multi-core digital signal processors
WANG Qinglin,PEI Xiangdong,LIAO Linyu,WANG Haoxu,LI Rongchun,MEI Songzhu,LI Dongsheng. Evaluating matrix multiplication-based convolution algorithm on multi-core digital signal processors[J]. Journal of National University of Defense Technology, 2023, 45(1): 86-94
Authors:WANG Qinglin  PEI Xiangdong  LIAO Linyu  WANG Haoxu  LI Rongchun  MEI Songzhu  LI Dongsheng
Affiliation:College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China;Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology, Changsha 410073, China
Abstract:The matrix multiplication-based convolutional algorithm, which can efficiently implement convolutions with different parameters, is the first choice of convolution performance optimization for a given chip. Based on the architecture of Phytium heterogeneous multi-core DSPs(digital signal processors) developed by National University of Defense Technology and the characteristic of the matrix multiplication-based convolutional algorithm, a parallel implementation of the matrix multiplication-based convolutional algorithm (called ftmEConv) for different convolutions on multi-core DSPs was proposed. The ftmEConv consists of four parallelized parts(input feature maps transformation, filter transformation, matrix multiplication, and output feature maps transformation), all of which were optimized for multi-core DSPs, and the performance of each part was improved by effectively exploiting the potential of all functional units in DSP cores. The experimental results demonstrate that ftmEConv achieves computational efficiency of up to 42.90%. Compared with other implementations of the matrix multiplication-based convolutional algorithm on heterogeneous chips, ftmEConv gets a speedup of up to 7.79 times.
Keywords:multi-core digital signal processors   convolutional neural networks   convolutional algorithms   algorithm optimization
本文献已被 万方数据 等数据库收录!
点击此处可从《国防科技大学学报》浏览原始摘要信息
点击此处可从《国防科技大学学报》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号