首页 | 本学科首页   官方微博 | 高级检索  
     

MPI并行程序中通信等待问题的诊断方法及其应用
引用本文:武林平,景翠萍,刘旭,田鸿运. MPI并行程序中通信等待问题的诊断方法及其应用[J]. 国防科技大学学报, 2020, 42(2): 47-54
作者姓名:武林平  景翠萍  刘旭  田鸿运
作者单位:北京应用物理与计算数学研究所,北京应用物理与计算数学研究所
基金项目:国家重点研发计划“支持应用社区的全局资源供给与使用”(2018YFB0204003);青年科学基金项目“一类面向大规模数值模拟应用的结构网格并行重剖分算法研究”(11601034)
摘    要:随着并行规模的扩大,现有通信等待问题的诊断方法存在内存开销大、测量时间开销大等问题。通过对现有通信等待问题诊断方法的深入分析,同时考虑测量开销可控的实际需求,建立基于热点函数的通信等待问题诊断模型。基于上述模型,总结出一种更精简、更实用的通信等待问题诊断方法。将该诊断方法分别应用到二维LARED集成、LARED-S、LAP3D等大规模MPI并行程序的通信等待问题诊断过程,应用效果表明本诊断方法可精确定位导致通信等待问题的关键代码段,给出的优化方案及性能提升空间对于后续的程序改进具有参考价值,其中根据诊断结果优化后的LARED-S程序性能提升32%,通信等待时间减少44%。

关 键 词:通信等待  MPI并行程序  负载平衡  性能诊断
收稿时间:2019-09-20
修稿时间:2020-01-19

Diagnostic methods for communication waiting in MPI parallel programs and applications
WU Linping,JING Cuiping,LIU Xu,TIAN Hongyun. Diagnostic methods for communication waiting in MPI parallel programs and applications[J]. Journal of National University of Defense Technology, 2020, 42(2): 47-54
Authors:WU Linping  JING Cuiping  LIU Xu  TIAN Hongyun
Affiliation:Institute of Applied Physics and Computational Mathematics, Beijing 100094, China
Abstract:Communication overhead is an important factor restricting the scalability of MPI parallel programs. The "communication waiting time" is an important part of the communication overhead for large-scale parallel applications. The existing diagnostic methods can identify the root cause of communication waiting phenomenon. However, there still exist some problems such as large measurement cost and memory overhead when running on large scale parallel systems. With the deep analysis on the existing diagnostic methods, and considering the actual demand of controllable measurement, this paper establishes a diagnosis model based on hotspots functions, and presents a tidy and practical diagnostic method based on the above model. The above diagnostic method has been applied to the LARED integration program, the LARED-S program and so on. The application results show that this method can accurately identify the key code segment leading to communication waiting and the proposed optimization solution and performance improvement space have reference value for subsequent program improvement, and the optimized LARED-S program according to the diagnostic result gets an 18% performance increase.
Keywords:communication waiting    MPI parallel programs   load balance   performance diagnosis
本文献已被 CNKI 等数据库收录!
点击此处可从《国防科技大学学报》浏览原始摘要信息
点击此处可从《国防科技大学学报》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号