Trading Timeliness and Accuracy in Geo-Distributed Streaming Analytics
Many applications must ingest rapid streams of data and produce analytics results in near-real-time. Whether the input streams represent sensor data from smart homes, user interaction logs from streaming video clients, or server logs from a content delivery network (CDN), it is common for such streams to originate from geographically distributed sources. The typical infrastructure for processing these geo-distributed streams follows a hub-and-spoke model, where several edge resources perform partial computation before forwarding results over a wide-area network (WAN) to a central location for final processing. Due to limited WAN bandwidth, it is not always possible to produce exact results in near-real-time. When this is the case, applications must either sacrifice timeliness by allowing delayed---and in turn stale---results, or sacrifice accuracy by allowing some error in final results. In this paper, we focus on windowed grouped aggregation, an important and widely used primitive in streaming analytics, and we study the tradeoff between the key metrics of staleness and error. We present optimal offline algorithms for minimizing staleness under an error constraint and for minimizing error under a staleness constraint. Using these offline algorithms as references, we present practical online algorithms for effectively trading off timeliness and accuracy in the face of bandwidth limitations. Using a workload derived from a web analytics service offered by a large commercial CDN, we demonstrate the effectiveness of our techniques through a trace-driven simulation. Our results show that our proposed algorithms outperform several baseline algorithms for a range of error and staleness bounds, for a variety of aggregation functions under different network bandwidth constraints.