Cross-Domain Optimization of Large Language Models via Data-Centric Representation Alignment with Self-Supervised Learning

Main Article Content

Rocco Hayes
Alaric Fenwick

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse domains, including natural language understanding, generation, and reasoning. However, their performance is often constrained by challenges such as data heterogeneity, distribution shifts, and inefficient utilization of large-scale corpora, particularly in real-world scenarios where data sources are diverse and continuously evolving. These limitations can lead to unstable training dynamics and reduced generalization when models are transferred to unseen domains. To address these challenges, this paper proposes a data-centric optimization framework that integrates adaptive representation alignment and self-supervised auxiliary tasks to enhance generalization and stability in LLM training. The framework focuses on improving the quality and consistency of learned representations by aligning feature distributions across domains while simultaneously leveraging unlabeled data through self-supervised learning objectives. This dual strategy enables the model to capture both domain-invariant features and task-relevant semantics, thereby improving robustness under distributional variations. The proposed method jointly optimizes task-specific objectives and distribution alignment constraints within a unified multi-objective learning paradigm, allowing for effective coordination between performance optimization and cross-domain consistency. In addition, an adaptive weighting mechanism is introduced to dynamically balance multiple learning objectives during training, improving convergence efficiency and mitigating optimization conflicts. Extensive experiments conducted on multiple benchmark datasets demonstrate that the proposed approach achieves improved accuracy, reduced divergence across data distributions, and enhanced training efficiency compared to conventional fine-tuning strategies. These results highlight the effectiveness of the framework in enabling scalable, robust, and generalizable LLM training in heterogeneous data environments.

Article Details

Section

Articles