首页 | 本学科首页   官方微博 | 高级检索  
     

超算环境科学工作流应用的容错
引用本文:李于锋,莫则尧,肖永浩,赵士操,段博文. 超算环境科学工作流应用的容错[J]. 国防科技大学学报, 2020, 42(6): 82-89
作者姓名:李于锋  莫则尧  肖永浩  赵士操  段博文
作者单位:中国工程物理研究院 计算机应用研究所, 四川 绵阳 621900;北京应用物理与计算数学研究所, 北京 100094
基金项目:国家重点研发计划资助项目(2018YFB0703903)
摘    要:超算环境中科学工作流技术广泛应用于科学研究和工程仿真领域。复杂多物理过程数值模拟、多阶段数据处理等应用往往需要使用多种应用软件相互协作,构建业务流程自动执行来提升工作效率。然而在超算环境中执行科学工作流应用面临着资源失效、任务配置错误等异常情况,造成工作流执行中断,严重影响完成效率,故容错功能对超算工作流应用的稳定持续运行有重要意义。介绍了科学工作流的容错设计分类,并对典型工作流系统的容错设计进行分析评述;提出了基于决策树的事件-条件-动作容错模型,设计了非侵入式可扩展的容错架构,并针对自主研发的部署在超算环境下的科学工作流应用平台HSWAP,实现了运行时可配置的容错策略。在实际的工程仿真任务中,基于所提出模型和架构实现的容错机制为提高工作流执行效率发挥了重要作用。

关 键 词:容错  科学工作流  决策树模型  工作流引擎
收稿时间:2019-09-21

Fault tolerance in HPC scientific workflow application
LI Yufeng,MO Zeyao,XIAO Yonghao,ZHAO Shicao,DUAN Bowen. Fault tolerance in HPC scientific workflow application[J]. Journal of National University of Defense Technology, 2020, 42(6): 82-89
Authors:LI Yufeng  MO Zeyao  XIAO Yonghao  ZHAO Shicao  DUAN Bowen
Affiliation:Institute of Computer Application, Chinese Academy of Engineering Physics, Mianyang 621900, China;Institute of Applied Physics and Computational Mathematics, Beijing 100094, China
Abstract:Scientific workflow technologies in HPC are extensively applied in scientific research and engineering simulation domain. Application such as numerical simulation in complex multi-physics problems and multi-stages data process need software to compose an automatic executable workflow to increase the efficiency. There are lots of exceptions such as resource failure, task configurations errors which may cause the workflow execution to be ceased, therefore robust and continuous execution is important for workflow application. A taxonomy of fault tolerance in workflow was made and some fault tolerance techniques in typical workflow systems were reviewed. A decision-tree based event-condition-action fault tolerance model was proposed, and a non-intrusive extendable framework which was implemented in our HPC scientific workflow system HSWAP was designed. Runtime configurable error recovery strategies were also implemented in our fault tolerance software module. In order to validate our new model and framework, the fault tolerance functions were tested in real engineering simulation project. Results show that fault tolerance plays an important role in increasing workflow execution efficiency.
Keywords:fault tolerance   scientific workflow   decision tree model   workflow engine
点击此处可从《国防科技大学学报》浏览原始摘要信息
点击此处可从《国防科技大学学报》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号