首页 | 本学科首页   官方微博 | 高级检索  
     

高性能互连网络中端口阻塞故障预测方法
引用本文:徐佳庆,胡小弢,杨汉芝,王强,张磊,唐付桥. 高性能互连网络中端口阻塞故障预测方法[J]. 国防科技大学学报, 2022, 44(5): 1-12
作者姓名:徐佳庆  胡小弢  杨汉芝  王强  张磊  唐付桥
作者单位:国防科技大学 计算机学院, 湖南 长沙 410073
基金项目:国家重点研发计划资助项目(2018YFB0204300);并行与分布处理国防科技重点实验室基金资助项目(6142110180101)
摘    要:随着系统规模、芯片功耗和链路速率的提升,高性能互连网络的整体故障率也不断上升,传统运维方式将难以为继,给高性能计算系统整体可靠性和可用性带来了巨大挑战。针对网络端口阻塞这类严重网络故障,提出无监督算法的预测模型。该模型从历史信息中挖掘征兆性规律并形成新的特征向量,应用K-means聚类算法对特征向量进行学习归类。在预测时,结合端口当前状态,利用二次指数平滑算法对未来状态进行预测,将得到的新特征向量使用K-means算法预判是否会发生阻塞故障。利用拓扑结构信息,分别对叶交换机和根交换机构建预测子模型,进而提升预测的精确率。结果表明,该预测模型能保持在召回率为88.2%的前提下,达到65.2%的准确率,可为运维人员提供有效的辅助。

关 键 词:互连网络  故障预测  机器学习
收稿时间:2020-11-08

Prediction method of port blocking failure in high performance interconnection networks
XU Jiaqing,HU Xiaotao,YANG Hanzhi,WANG Qiang,ZHANG Lei,TANG Fuqiao. Prediction method of port blocking failure in high performance interconnection networks[J]. Journal of National University of Defense Technology, 2022, 44(5): 1-12
Authors:XU Jiaqing  HU Xiaotao  YANG Hanzhi  WANG Qiang  ZHANG Lei  TANG Fuqiao
Affiliation:College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China
Abstract:With the increase of system scale, chip power consumption and link rate, the overall failure rate of high-performance interconnection networks will continue rising, and the traditional operation and maintenance methods will be difficult to sustain, which brings great challenges to the overall reliability and availability of HPC(high performance computing). An unsupervised algorithm prediction model for serious network failures such as network port blocking was proposed. In this model, the symptomatic rules were extracted from the history information of the switch port status register and a new feature vector was formed. The K-means clustering algorithm was used to learn and classify the feature vectors. In the prediction, the DES(double exponential smoothing) algorithm was used to predict the port state in the future through a combination of the current state of the port, and a new feature vector was obtained and K-means algorithm was used to predict whether the port blocking failure would occur. The topology information was used to build independent sub prediction models with ToR switch ports and Spine switch ports respectively, so as to further improve the accuracy of prediction. The experimental results show that the prediction model can maintain the recall rate of 88.2%, and reach the accuracy rate of 65.2%. It can provide effective early warning and guidance for the operation and maintenance personnel in the actual system.
Keywords:interconnection network   failure prediction   machine learning
点击此处可从《国防科技大学学报》浏览原始摘要信息
点击此处可从《国防科技大学学报》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号