Over the last decades we have seen an explosion in the data volume generated at companies, governments, and even private households. Data is so readily available that analyzing it becomes a real challenge. The data volume is often too large for ordinary machines to process it in a timely fashion, and although supercomputers could do it, it is simply too costly to use them. To address this, new methods have emerged for cost-effectively processing large amounts of data using arrays of interconnected, commodity hardware machines. This idea was first demonstrated by the MapReduce model developed by Google in the early 2000s. Only a couple year later (2006), Yahoo replicated the MapReduce paradigm and released it to the open-source world as Apache Hadoop. In the 2010s, next generation systems like Apache Spark and Apache Flink evolved the programming model and the capabilities of the execution engines.

The Path from Batch Processing to Stream Processing

The first step in the evolution of processing large amounts of data using commodity hardware was found in batch processing which is considered to be the starting point of the “big data” trend.

Batch processing

Batch processing originates from the early computer age when a terminal could only be used by one person at a time. In order to allow multiple users to share the underlying computing resources, users could submit jobs. Jobs would be stored in a job queue. The batch processing occurs by running multiple of these jobs in a batch, typically sequential but later systems could also run multiple jobs at once. A large part of this process was later done by operating systems which allowed multiple programs to share the overall computing resources.

Despite the evolution towards large-scale data processing, some of the most basic assumptions remained the same. The input to a processing job is finite (bounded) data, e.g. we use a file or a database query as the input. Due to the bounded input, the processing eventually finishes (if, for once, we disregard Alan Turing’s halting problem). Conceptually, this is still similar to the early batch processing systems because we run a set of pre-programmed code (jobs) and then return the output to the user.

Distributed Batch Processing

The distributed batch processing evolution is largely in the execution layer. Instead of running the processing on a single machine, we partition the input such that in can be processed in parallel by multiple machines. This approach to speeding up the processing is also referred to as horizontal scaling, as opposed to vertical scaling where one would increase the processing power of the individual machines. Horizontal scaling can be an effective, fault-tolerant, and cost-efficient way to process data.

The Case for Stream Processing

While distributed batch processing is a leap forward, it is still a static one-off process. Data needs to be already produced for the processing to start. If new data arrives during processing, it can’t be considered because a fundamental assumption is that all the input is available when the processing starts. This is due to parts of the processing like sorting which only work correctly when the entire input is available.

You might say, why not run this process more frequently? Of course, there is still the option to schedule the processing more often, e.g. every day, every hour, every 30 minutes. But what if the processing itself is so involved that it takes several hours? Is it worth recomputing the entire result every time?

Imagine you wanted to maintain a live counter of the number of people who visited particular pages on your website. We could process the web server logs to compute these counters. To calculate an up-to-date hourly counter for the day, a batch processing job would need to read through all the logs of the current day. That process would be repeated as often as you wanted to get an up-to-date counter on a given day. One way to solve this problem would be to cache the already computed results for that day, even if that slightly complicates the process. But there might be an even better solution.

Stream processing lends itself naturally to this problem because we could store the counters for each page in memory and update them as new visitors arrive. As we update the live counter, we don’t need to reprocess any prior events from the past as we have to do with batch processing. It is important to note that other issues can arise with stream processing, in particular how to handle late or out of order events. We will cover these issues later. For now let’s figure out when stream processing would be a good fit.

Here are some general criteria when evaluating whether to use stream processing:

  1. The application needs to provide real-time results or decision making.

  2. The application logic requires fresh data and long-lived state.

  3. Incoming data needs to be pre-aggregated or reduced before it is stored.

  4. The data consists of events which may arrive out of order.

Enter Stream Processing

Stream processing is designed to continuously process data and provide low-latency results. Unlike batch processing, it is always-on, allowing new results to be emitted at any desired time or in predefined intervals.

Stateful Stream Processing

The most challenging aspect of stream processing is its state which needs to be persisted across failures or application updates. If we don’t persist the state, we would need to re-process all data up until the point of failure. To prevent this, stateful streaming applications hold and persist state, similarly to a database. But conversely to a database, read and write operations are fast because the state resides in the process memory (with the option to offload to disk to prevent running out of memory).

State is checkpointed at regular intervals which means we won’t have to re-process any data up until the checkpoint is complete. Checkpointing involves writing the application state to an external storage. In the event of a failure, the state will be recovered from this storage and the processing can resume from when the last checkpoint was made.

Is Batch Processing a subset of Stream Processing?

Some argue that batch processing is just a special case of stream processing, but in practice, batch processing implemented in terms of the more broad stream processing paradigm usually lacks the batch-specific optimizations, such as operating in larger batches for more throughput, efficient sorting algorithms which can spill to disk, or intelligent recovery using intermediate results. Further, the application logic tends to be different because streaming use cases are defined in terms of groups of events in time and can’t do certain batch processing operations like scanning through the entire data. Streams are by nature unbounded and continous which enforces a different programming model.

Batch and stream processing are two different approaches to data processing. Neither one is better or superior in terms of processing semantics, it merely depends on how time-critical your data processing needs are.

Use cases

Many services are based on historic data and do not factor in recent data. With stream processing, we can provide real-time insights based on recent events generated by the user or the environment. The following examples illustrate that:

Monitoring & Observability

Nowadays almost every machine generates data about its condition, e.g. maintenance cycles, production speed, temperature, etc. If such data can be aggregated in real time, a broken or malfunctioning machine can be shut off or replaced before any damage occurs.

The same applies to any kind of (software) deployments. Stream processing can generate real-time alerts in case the application metrics are not within their desired bounds.

Fraud detection

For every bank user, we want to instantly decide whether a login attempt, credit card use, or a wire transfer is legitimate or not. This can be done by analyzing the stream of events for a particular user. By looking at recent events and summarized past events, we can decide whether a login attempt is legitimate or not.

Taxi ride pricing

Taxi rides for ride services like Lyft or Uber are often calculated dynamically. We can more accurately predict the price of taxi rides using traffic information, available drivers, and how many users request the service. Ideally, we want to do this as close to real time as possible.

Recommendation systems

We can provide better recommendation based one real-time data. For example, recommending new songs based on the last played songs. Users would like their recommendations to change based on the recently played or liked songs.

Stream processing programming model

The Dataflow paper pointed out that stream processing can be viewed as merely a concern of the execution engine. The programming model for the user can be designed independently of its execution. However, that is somewhat of a simplification. Many stream processors like Apache Spark or Apache Flink have different programming interfaces for batch and streaming. While it is possible to have a unified API like Dataflow’s, there are going to be streaming concepts in batch execution mode that aren’t going to be useful, even if they do not break the batch processing semantics. It is worth listing some of the streaming-specific concepts below.

Windows

In batch, we can scan and crunch through all available data. This allows us to be very flexible with respect to the type aggregation of the data. In streaming, this is different because we are never guaranteed to see all available data. This is where windows come into play.

A window has a start and an end timestamp and marks a time span. In streaming, data elements (events) have a timestamp associated and can be associated with a window based on the timestamp. For example, a 5 minute window could be [2:00pm, 2:05pm). Note that the square bracket means inclusive while the rounded parenthesis means not inclusive.

Windows can be tumbling or sliding

Tumbling window

Tumbling means that the next windows begins directly after the end of the old one. For example, 5 minute tumbling windows:

... [2:00pm, 2:05pm) [2:05pm, 2:10pm) [2:10pm, 2:15pm) ...

Sliding window

Sliding means that in addition to beginning every X interval, it also slides every Y interval. For example, 5 minutes windows sliding every 1 minute:

... [2:00pm, 2:05pm) [2:01pm, 2:06pm) [2:02, 2:07pm) ...

Time

In stream processing, time does not strictly advance linearly like we would assume from a regular clock. There are two fundamentally different time schemes:

  1. Processing time: The regular time we would use on a computer or a regular clock.
  2. Event time: A time associated with and derived from the processed events.

Watermarks

Watermarks are used in conjunction with event time. Low watermarks are special time stamps which indicate the current minimum event time. Similarly, high watermark indicate the maximum event time seen. Watermarks are generated by a function which receives the event timestamps of the inflowing events as input. Watermarks functions can be as simple as taking the latest observed timestamp. However, time must advance monotonously, i.e. we must not go back in time. Watermark functions may use custom logic to decide when it is safe to observe time. For example, it could use a fixed offset from the high watermark as the low watermark. This tolerates some out of orderness of the arriving event timestamps. Let’s see why this is important below.

Late or out of order events

Events are considered “late” when their timestamp is before the current event time as determined by the latest emitted low watermark. Out of orderness is often a reason for late data because events do not arrive in their expected order which leads to prematurely advancing the event time via emitting a watermark which is past the event time of the incoming data.

Fault-tolerance & State

State is one of the most interesting and hard parts about stream processing. Most applications have state of some sort, e.g. remembering when a user last logged on requires state, storing any pending records within a system requires state, storing a position in a file or log which we are reading from. Whenever state is present, this has implications on the fault-tolerance of the system. In case of stateless applications, we can simply restart the job. However, if we do have state, we need to take care to re-initialize the state after a failure in a way that the process semantics stay the same.

When do failures occur? Failures can happen due to hardware failures, network failures, external systems failing, applications errors, malformed data, etc.

Stream processors must periodically externalize their state to be able to recover it in case of failures. This is done by writing their state to an external data store. It is not a trivial problem to do this in a way that the processing semantics remain unchanged when restoring the persisted state. Typically there are three semantics we distinguish between (from least to most strict):

  1. at most once

    If systems can guarantee at most once, they essentially guarantee that all events are processed once or not at all. This is the weakest guarantee because there can be data loss in the case when an event is not processed.

  2. at least once

    If systems guarantee at least once, they guarantee that a record is processed one or more times. This requires some form of acking or checkpointing to persist the stream state. There may be duplicate processing of data after restoring from a checkpoint because the same data will be read that was already processed before the failure which led to restoring the state from the checkpoint.

  3. exactly once

    Exactly once is the strongest but also most difficult semantic to guarantee. This is especially difficult when writing to external systems which might not support exactly once semantics. We need some form of support for transactional processing for external systems to ensure that we only yield a result once.

Stream processors

After describing the most important concepts in stream processing, it may be worth introducing some of the common stream processing engines. Here is a selection of open-source stream processing engines:

Apache Flink is the de facto standard when it comes to stream processing in the open source world. It fully supports stream processing as described in this article. It is suitable for large-scale stream processing with hundreds of processing nodes. Flink can be described as the Swiss army knife or stream processors. It comes with a wide range of connectors. It bundles memory backends for storing stateful stream processing applications with memory demands exceeding main memory which requires spilling to disc. It also has its own scheduler which integrates with Kubernetes and a number of other cluster management solutions.

Kafka Streams

Apache Kafka is often used together with Flink as a message queue and storage layer for events. Kafka also comes with a stream processing library called Kafka Streams. Kafka Streams allows to write stream processing applications which do not require a dedicated runtime like Flink. Kafka Streams leverages the Kafka storage layer to shuffle and persist data. This doesn’t always make it the most performent solution to run stream processing pipelines. However, this operational simplicity is also a huge advantage if the user already has a Kafka cluster. It is to note that the Kafka cluster might become a bottleneck which might also be operationally challenging.

Apache Spark

Apache Spark builds its stream processing around their core abstraction: RDDs (Resilient Distributed Datasets). RDD are sets of data which can be processed in a fault-tolerant way. There is some overhead which occurs with this process which is why the data is split into large enough chunks. For stream processing the amount of data is further reduced to achieve lower latencies (micro batching). However, this means throughput is not as good as in systems like Flink which stream data and use a less granular method to ensure fault tolerance (checkpointing).

Apache Storm

Apache Storm is a legacy engine originally developed by Nathan Marz. It was one of the first open source stream processing solutions. It does per event acknowledgments for fault-tolerance which comes with a big performance-penalty. Due to its clunky API and slow execution engine, it’s not typically used anymore today.

Apache Beam / Google Cloud Dataflow

Google has its own stream processing product called Google Cloud Dataflow. Apache Beam is the programming library for Dataflow. The programming model is very similar to Flink’s. However, its underlying execution engine is quite different from Flink’s. The most important differences are its externally persisted state which makes it much more elastic than Flink. For example, in Dataflow it is possible to add or remove workers during runtime which would require an application restart in Flink.

Conclusion

We have outlined why stream processing is important and when stream processing is a good fit. We have learned about the core concepts and problems in stream processing. Finally, we have introduced some of the stream processing engines available.

There are many more things to learn about stream processing. I would encourage you to read more about stream processing in one of the following books:

  • Stream Processing with Apache Flink by Fabian Hueske and Vasiliki Kalavri
  • Streaming Systems by Tyler Akidau, Slava Chernyak, and Reuven Lax
  • Designing Data-Intensive Applications by Martin Kleppmann

This might only be the beginning of your stream processing journey.