Causal Graph Learning for Interpretable Fault Identification in Complex Backend Systems

Main Article Content

Mahsa Panahandeh

Abstract

This paper addresses the challenges of complex multidimensional metric coupling, dynamic structural dependencies, and difficult-to-interpret fault propagation paths in backend systems by proposing a Causal Graph Learning–based Fault Identification (CGL-BFI) method. The approach is built upon multi-source monitoring data and identifies latent dependencies and interprets anomaly propagation mechanisms through three stages: feature causal encoding, structural inference modeling, and causal effect quantification. In the feature causal encoding stage, the model extracts and normalizes temporal features of monitoring indicators to construct a multivariate causal candidate space. In the structural inference stage, a differentiable Directed Acyclic Graph (DAG) constraint is introduced to learn directional dependencies among variables, ensuring logical consistency of the causal structure. In the causal inference stage, the method integrates Average Treatment Effect (ATE) estimation and Causal Scoring to quantify the influence strength of each metric node, enabling precise localization of root causes and propagation paths. Experimental results show that the proposed method achieves high stability and accuracy under complex network disturbances and high-noise environments, outperforming comparison models in key metrics such as AUC, F1-score, Recall, and Precision. By introducing causal structure learning, this study advances backend system anomaly detection from correlation-based analysis to mechanism-level interpretable reasoning, significantly improving the accuracy, interpretability, and robustness of fault identification in complex systems, and providing a theoretical and methodological foundation for intelligent and adaptive system operation and maintenance.

Article Details

Section

Articles