Deterministic Time-Series Joins for Asynchronous High-Throughput Data Streams
Christoph Schranz, Peter Michael Jeremias (2020): Deterministic Time-Series Joins for Asynchronous High-Throughput Data Streams In: 2020 25th IEEE International Conference on Emerging Technologies and Factory Automation (ETFA)
A variety of data stream problems that affect two or more data streams rely on joining them based on a common or similar timing attribute. With the advent of stream processing frameworks like Apache Spark and Apache Flink within the last years, processing of streamed data has become much easier. Repeated processing of relatively small data batches in so-called windows increases flexibility with respect to implementation and task distribution across multiple nodes. Using event times instead of ingestion times avoids, among other problems, incorrect joins. However, in this work we argue that batch-processing leads to a significant trade-off between increased computational complexity and latency of the resulting join pairs. A concept for time-series joins of streaming data is presented. This concept, which is built upon a resilient data stream framework, minimizes both the computational costs and latency times. It uses the guarantees associated with this underlying framework to join the data records deterministically according to event times instead of processing times. This work represents a work-in-progress paper, as detailed benchmarks are pending.