首页 | 本学科首页   官方微博 | 高级检索  
   检索      

稀疏卷积计算高效数据加载与输出缓存策略
引用本文:刘彪,陈长林,张宇飞,刘思彤,唐励勤,于红旗.稀疏卷积计算高效数据加载与输出缓存策略[J].国防科技大学学报,2023,45(5):212-221.
作者姓名:刘彪  陈长林  张宇飞  刘思彤  唐励勤  于红旗
作者单位:国防科技大学 电子科学学院, 湖南 长沙 410073
基金项目:国家自然科学基金资助项目(61804181,62074166);国家重点研发计划资助项目(2019YFB2205102)
摘    要:针对现有神经网络加速器在处理稀疏神经网络时存在的数据加载效率低、乘加资源利用率低、输出缓存寻址逻辑复杂等问题,提出了稀疏卷积计算高效数据加载与输出缓存策略。将属于同一输入通道的非零输入特征图像数据和非零权重进行全对全乘累加运算,降低了非零数据配对难度,提高了乘加资源利用率;通过采用输入驻留计算,以及密集型循环加载特征图像数据,大幅减少了数据片外调取次数;优化了输出缓存设计,解决了现有方案中存在的输出缓存地址访问争用、存储拥塞等问题。实验表明,与采用类似架构的细粒度脉动加速器相比,在处理单元面积上减少了21.45%;在数据加载速度方面平均提高了117.71%在平均乘法器利用率方面提高了11.25%,达到89%。

关 键 词:神经网络加速器  稀疏卷积神经网络  输入驻留  全对全计算
收稿时间:2022/6/8 0:00:00

High-efficiency data loading and output buffering strategy for sparse convolutional computing
LIU Biao,CHEN Changlin,ZHANG Yufei,LIU Sitong,TANG Liqin,YU Hongqi.High-efficiency data loading and output buffering strategy for sparse convolutional computing[J].Journal of National University of Defense Technology,2023,45(5):212-221.
Authors:LIU Biao  CHEN Changlin  ZHANG Yufei  LIU Sitong  TANG Liqin  YU Hongqi
Institution:College of Electronic Science and Technology, National University of Defense Technology, Changsha 410073, China
Abstract:In view of the problems such as inefficient data loading, insufficient utilization of multiply-accumulates resources, complex output buffering and addressing logic in existing neural network accelerators when processing sparse neural networks, a high-efficiency data loading and output buffering strategy for sparse convolutional computing was proposed. It performed an all-to-all multiply-accumulates operation on the non-zero input feature map data and the non-zero weights belonging to the same input channel, which reduces the difficulty of non-zero data pairing and improves the utilization of multiply-accumulates resources. By using input stationary calculation and intensive cyclic loading of input feature map data, it significantly reduced the number of data off-chip fetches. It optimized the output buffer design and solved the problems of address access contention and storage congestion during output buffering in existing solutions. Experimental results show that, when compare to fine-grained systolic accelerator with similar architectures, the process element area of the proposed architecture is decreased by 21.45%; the data loading speed is increased by 117.71% on average; the average utilization of multiplier is increased by 11.25%, reaching 89%.
Keywords:neural network accelerator  sparse convolution neural network  input stationary  all-to-all calculation
点击此处可从《国防科技大学学报》浏览原始摘要信息
点击此处可从《国防科技大学学报》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号