Research on Hierarchical Multi-Agent Reinforcement Learning Resource Orchestration for Large-Scale Heterogeneous Distributed Clusters with Dynamic Load Awareness

Yuqi Tang

doi:10.5281/zenodo.19628821

pdf

Published: 2024-06-01

DOI: https://doi.org/10.5281/zenodo.19628821

Yuqi Tang

New York University

Abstract

To address the challenges of complex resource states, random task arrivals, significant differences in node capabilities, and multiple couplings of scheduling objectives in large-scale heterogeneous distributed clusters, a hierarchical multi-agent reinforcement learning resource orchestration method integrating dynamic load awareness is constructed. This method first extracts time-varying load information from machine resource occupancy, task queuing status, service deployment density, and node capacity constraints, and constructs a state representation with global context awareness capabilities by combining cluster topology relationships to enhance the method's ability to characterize complex operating environments. Based on this, a hierarchical collaborative decision-making mechanism is adopted, dividing the resource orchestration process into global coordination by the upper-level manager and local control by the lower-level executors, achieving effective connection between global planning and node-level execution through sub-objective propagation. Furthermore, to address the matching problem between heterogeneous nodes and task requirements, a compatibility-aware scoring mechanism is introduced to improve the rationality of resource allocation, execution stability, and orchestration accuracy. This method can balance resource utilization, task waiting control, load balancing, and scheduling success capability within a unified framework, thereby improving the overall resource organization quality in complex cluster environments. Comparative experimental results show that the proposed method exhibits good overall performance across multiple key evaluation metrics, indicating that the method can effectively adapt to large-scale heterogeneous distributed cluster resource orchestration scenarios driven by dynamic loads, and has strong method effectiveness and application value.

Issue

Vol. 4 No. 2 (2024)

Section

Articles

References

[1]T. Sukprasert, A. Souza, N. Bashir, D. Irwin and P. Shenoy, "On the Limitations of Carbon-Aware Temporal and Spatial Workload Shifting in the Cloud," Proceedings of the Nineteenth European Conference on Computer Systems, pp. 92-941, 2024.

[2]J. Cai, W. Liu, Z. Huang, et al., "Task decomposition and hierarchical scheduling for collaborative cloud-edge-end computing," IEEE Transactions on Services Computing, vol. 17, no. 6, pp. 4368-4382, 2024.

[3]L. Cheng, Y. Wang, F. Cheng, C. Liu, Z. Zhao, and Y. Wang, “A deep reinforcement learning-based preemptive approach for cost-aware cloud job scheduling,” IEEE Transactions on Sustainable Computing, vol. 9, no. 3, pp. 422-432, 2023.

[4]D. Qi, X. Xi, Y. Tang, Y. Zheng, and Z. Guo, “Real-time scheduling of power grid digital twin tasks in cloud via deep reinforcement learning,” Journal of Cloud Computing, vol. 13, no. 1, p. 121, 2024.

[5]J. Wang, S. Li, X. Zhang, F. Wu, and C. Xie, “Deep reinforcement learning task scheduling method based on server real-time performance,” PeerJ Computer Science, vol. 10, p. e2120, 2024.

[6]M. Mounesan, M. Lemus, H. Yeddulapalli, P. Calyam, and S. Debroy, “Reinforcement learning-driven data-intensive workflow scheduling for volunteer edge-cloud,” in Proc. IEEE 8th Int. Conf. Fog and Edge Computing (ICFEC), May 2024, pp. 79-88.

[7]E. Russo, F. G. Blanco, M. Palesi, G. Ascia, D. Patti, and V. Catania, “Towards fair and firm real-time scheduling in DNN multi-tenant multi-accelerator systems via reinforcement learning,” in Proc. IEEE Int. Symp. Circuits and Systems (ISCAS), May 2024, pp. 1-5.

[8]H. Liu, "Structural Regularization and Bias Mitigation in Low-Rank Fine-Tuning of LLMs," 2023.

[9]C. Hua, "A Semantic-Prior-Guided AI Framework for Collaborative Environment Understanding and Robust Agent Decision Making," 2024.

[10]M. T. Islam, S. Karunasekera, and R. Buyya, "Performance and cost-efficient spark job scheduling based on deep reinforcement learning in cloud computing environments," IEEE Transactions on Parallel and Distributed Systems, vol. 33, no. 7, pp. 1695-1710, 2021.

[11]F. Cheng, Y. Huang, B. Tanpure, et al., "Cost-aware job scheduling for cloud instances using deep reinforcement learning," Cluster Computing, vol. 25, no. 1, pp. 619-631, 2022.

[12]Y. Deng, "Transfer Methods for Large Language Models in Low-Resource Text Generation Tasks," 2024.

[13]Q. Gan, "Large Language Model Framework for Multi-Document Financial Anomaly Detection in Intelligent Auditing via Semantic Mapping and Risk Reasoning," 2024.

[14]Y. Li, "Task-aware Differential Privacy and Modular Structural Perturbation for Secure Fine-Tuning of Large Language Models," 2024.

Article Sidebar

Main Article Content

Abstract

Article Details

Issue

Section

References