Improving Model Training Efficiency through Data Harmonization in Autonomous Driving

Main Article Content

Dustin Hargrove
Kaushik Bhattacharya

Abstract

Autonomous driving systems rely heavily on large volumes of sensor data collected from distributed vehicle fleets. In practice, inconsistencies in data formats, calibration standards, and labeling protocols introduce significant challenges for centralized model training and system validation. This study analyzes multi-modal datasets collected from 1,180 test vehicles operating in three metropolitan areas over a period of 14 months. Differences in LiDAR resolution, camera exposure profiles, and GPS correction methods were examined. It was observed that nearly 21% of collected samples required manual or semi-automatic correction before being integrated into training pipelines. A data harmonization framework was developed to normalize sensor metadata and align labeling standards across platforms. After deployment, model retraining cycles were shortened by approximately 26%, and cross-region performance variance was reduced. These results indicate that data governance mechanisms are as critical as perception algorithms in large-scale autonomous driving systems.

Article Details

Section

Articles