System-Level Bottlenecks in Edge-Based Video Analytics Workloads
Main Article Content
Abstract
AI acceleration in edge data centers is often evaluated in terms of model throughput alone. Yet practical deployments reveal that inference time constitutes only a fraction of end-to-end latency. Preprocessing, decoding, storage I/O, and metadata synchronization introduce significant overhead. We instrumented a face recognition service pipeline deployed across a three-node edge cluster and analyzed resource utilization at millisecond granularity. As GPU inference latency decreased through hardware acceleration, network traffic and SSD queue depth emerged as primary limiting factors. The imbalance produced diminishing returns from further accelerator scaling. A redesigned storage-aware scheduling scheme was introduced to co-locate preprocessing and inference tasks while limiting redundant frame transfers. The resulting system achieved improved throughput stability under burst workloads. These findings reinforce the notion that accelerator-centric evaluation overlooks infrastructure coupling effects.