首页 | 本学科首页   官方微博 | 高级检索  
     

并行程序运行故障原因识别
引用本文:刘轶,高玉林,张国振. 并行程序运行故障原因识别[J]. 国防科技大学学报, 2022, 44(5): 45-52
作者姓名:刘轶  高玉林  张国振
作者单位:北京航空航天大学 计算机学院, 北京 100191
基金项目:总体技术及评测技术与系统研究资助项目(2016YFB0200100)
摘    要:高性能计算系统的复杂性和规模的不断增长使得系统的平均无故障时间越来越短,因此系统的硬软件故障导致并行程序运行出错的概率随之增加。此外,并行程序本身可能存在的编程错误也会导致运行出错。由于处理上述两类故障原因的措施迥异,所以在程序运行出现故障时,用户需要关注故障原因的类别。针对这一问题,设计和实现了一种基于作业管理系统Slurm的并行程序运行故障原因识别系统。通过对Slurm进行扩展,监控作业状态,重提交和重运行作业。根据作业运行结果,区分故障原因类别。故障注入方式进行的实验表明,该系统具有较高的识别准确率。

关 键 词:高性能计算系统  Slurm  运行故障  故障检测
收稿时间:2020-11-12

Identifying causes of execution failure for parallel programs
LIU Yi,GAO Yulin,ZHANG Guozhen. Identifying causes of execution failure for parallel programs[J]. Journal of National University of Defense Technology, 2022, 44(5): 45-52
Authors:LIU Yi  GAO Yulin  ZHANG Guozhen
Affiliation:School of Computer Science and Engineering, Beihang University, Beijing 100191, China
Abstract:With the increasing of scale and complexity of high-performance computing systems, the mean time between failures is getting shorter, which causes an increasing probability of execution-failure caused by the hardware and software failures for parallel programs. In addition, the possible programming errors (i.e. bugs) that exist in parallel programs can also lead to execution failure. Approaches to deal with the above two types of execution failures are totally different, therefore, when an execution-failure occurs, the programmer must figure out if the failure is caused by a system fault or a programming bug. In response to this issue, a system to identifying causes of execution-failures for parallel programs was designed and implemented on the basis of the Slurm. The system has all the supported features of Slurm, as well as the ability to monitor job status, re-submit and re-run jobs. The experimental results of the job execution show that the system can distinguish the type of program execution-failures. Experiments conducted with fault injection also demonstrates the accuracy of the system.
Keywords:high performance computing system   Slurm   execution failure   fault detection
点击此处可从《国防科技大学学报》浏览原始摘要信息
点击此处可从《国防科技大学学报》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号