Maximilian Michels

Why Stream Processing?

Sat, 26 Nov 2022 12:40:05 +0100

Over the last decades we have seen an explosion in the data volume generated at companies, governments, and even private households. Data is so readily available that analyzing it becomes a real challenge. The data volume is often too large for ordinary machines to process it in a timely fashion, and although supercomputers could do it, it is simply too costly to use them. To address this, new methods have emerged for cost-effectively processing large amounts of data using arrays of interconnected, commodity hardware machines. This idea was first demonstrated by the MapReduce model developed by Google in the early 2000s. Only a couple year later (2006), Yahoo replicated the MapReduce paradigm and released it to the open-source world as Apache Hadoop. In the 2010s, next generation systems like Apache Spark and Apache Flink evolved the programming model and the capabilities of the execution engines.

The Path from Batch Processing to Stream Processing

The first step in the evolution of processing large amounts of data using commodity hardware was found in batch processing which is considered to be the starting point of the “big data” trend.

Batch processing

Batch processing originates from the early computer age when a terminal could only be used by one person at a time. In order to allow multiple users to share the underlying computing resources, users could submit jobs. Jobs would be stored in a job queue. The batch processing occurs by running multiple of these jobs in a batch, typically sequential but later systems could also run multiple jobs at once. A large part of this process was later done by operating systems which allowed multiple programs to share the overall computing resources.

Despite the evolution towards large-scale data processing, some of the most basic assumptions remained the same. The input to a processing job is finite (bounded) data, e.g. we use a file or a database query as the input. Due to the bounded input, the processing eventually finishes (if, for once, we disregard Alan Turing’s halting problem). Conceptually, this is still similar to the early batch processing systems because we run a set of pre-programmed code (jobs) and then return the output to the user.

Distributed Batch Processing

The distributed batch processing evolution is largely in the execution layer. Instead of running the processing on a single machine, we partition the input such that in can be processed in parallel by multiple machines. This approach to speeding up the processing is also referred to as horizontal scaling, as opposed to vertical scaling where one would increase the processing power of the individual machines. Horizontal scaling can be an effective, fault-tolerant, and cost-efficient way to process data.

The Case for Stream Processing

While distributed batch processing is a leap forward, it is still a static one-off process. Data needs to be already produced for the processing to start. If new data arrives during processing, it can’t be considered because a fundamental assumption is that all the input is available when the processing starts. This is due to parts of the processing like sorting which only work correctly when the entire input is available.

You might say, why not run this process more frequently? Of course, there is still the option to schedule the processing more often, e.g. every day, every hour, every 30 minutes. But what if the processing itself is so involved that it takes several hours? Is it worth recomputing the entire result every time?

Imagine you wanted to maintain a live counter of the number of people who visited particular pages on your website. We could process the web server logs to compute these counters. To calculate an up-to-date hourly counter for the day, a batch processing job would need to read through all the logs of the current day. That process would be repeated as often as you wanted to get an up-to-date counter on a given day. One way to solve this problem would be to cache the already computed results for that day, even if that slightly complicates the process. But there might be an even better solution.

Stream processing lends itself naturally to this problem because we could store the counters for each page in memory and update them as new visitors arrive. As we update the live counter, we don’t need to reprocess any prior events from the past as we have to do with batch processing. It is important to note that other issues can arise with stream processing, in particular how to handle late or out of order events. We will cover these issues later. For now let’s figure out when stream processing would be a good fit.

Here are some general criteria when evaluating whether to use stream processing:

The application needs to provide real-time results or decision making.
The application logic requires fresh data and long-lived state.
Incoming data needs to be pre-aggregated or reduced before it is stored.
The data consists of events which may arrive out of order.

Enter Stream Processing

Stream processing is designed to continuously process data and provide low-latency results. Unlike batch processing, it is always-on, allowing new results to be emitted at any desired time or in predefined intervals.

Stateful Stream Processing

The most challenging aspect of stream processing is its state which needs to be persisted across failures or application updates. If we don’t persist the state, we would need to re-process all data up until the point of failure. To prevent this, stateful streaming applications hold and persist state, similarly to a database. But conversely to a database, read and write operations are fast because the state resides in the process memory (with the option to offload to disk to prevent running out of memory).

State is checkpointed at regular intervals which means we won’t have to re-process any data up until the checkpoint is complete. Checkpointing involves writing the application state to an external storage. In the event of a failure, the state will be recovered from this storage and the processing can resume from when the last checkpoint was made.

Is Batch Processing a subset of Stream Processing?

Some argue that batch processing is just a special case of stream processing, but in practice, batch processing implemented in terms of the more broad stream processing paradigm usually lacks the batch-specific optimizations, such as operating in larger batches for more throughput, efficient sorting algorithms which can spill to disk, or intelligent recovery using intermediate results. Further, the application logic tends to be different because streaming use cases are defined in terms of groups of events in time and can’t do certain batch processing operations like scanning through the entire data. Streams are by nature unbounded and continous which enforces a different programming model.

Batch and stream processing are two different approaches to data processing. Neither one is better or superior in terms of processing semantics, it merely depends on how time-critical your data processing needs are.

Use cases

Many services are based on historic data and do not factor in recent data. With stream processing, we can provide real-time insights based on recent events generated by the user or the environment. The following examples illustrate that:

Monitoring & Observability

Nowadays almost every machine generates data about its condition, e.g. maintenance cycles, production speed, temperature, etc. If such data can be aggregated in real time, a broken or malfunctioning machine can be shut off or replaced before any damage occurs.

The same applies to any kind of (software) deployments. Stream processing can generate real-time alerts in case the application metrics are not within their desired bounds.

Fraud detection

For every bank user, we want to instantly decide whether a login attempt, credit card use, or a wire transfer is legitimate or not. This can be done by analyzing the stream of events for a particular user. By looking at recent events and summarized past events, we can decide whether a login attempt is legitimate or not.

Taxi ride pricing

Taxi rides for ride services like Lyft or Uber are often calculated dynamically. We can more accurately predict the price of taxi rides using traffic information, available drivers, and how many users request the service. Ideally, we want to do this as close to real time as possible.

Recommendation systems

We can provide better recommendation based one real-time data. For example, recommending new songs based on the last played songs. Users would like their recommendations to change based on the recently played or liked songs.

Stream processing programming model

The Dataflow paper pointed out that stream processing can be viewed as merely a concern of the execution engine. The programming model for the user can be designed independently of its execution. However, that is somewhat of a simplification. Many stream processors like Apache Spark or Apache Flink have different programming interfaces for batch and streaming. While it is possible to have a unified API like Dataflow’s, there are going to be streaming concepts in batch execution mode that aren’t going to be useful, even if they do not break the batch processing semantics. It is worth listing some of the streaming-specific concepts below.

Windows

In batch, we can scan and crunch through all available data. This allows us to be very flexible with respect to the type aggregation of the data. In streaming, this is different because we are never guaranteed to see all available data. This is where windows come into play.

A window has a start and an end timestamp and marks a time span. In streaming, data elements (events) have a timestamp associated and can be associated with a window based on the timestamp. For example, a 5 minute window could be [2:00pm, 2:05pm). Note that the square bracket means inclusive while the rounded parenthesis means not inclusive.

Windows can be tumbling or sliding

Tumbling window

Tumbling means that the next windows begins directly after the end of the old one. For example, 5 minute tumbling windows:

... [2:00pm, 2:05pm) [2:05pm, 2:10pm) [2:10pm, 2:15pm) ...

Sliding window

Sliding means that in addition to beginning every X interval, it also slides every Y interval. For example, 5 minutes windows sliding every 1 minute:

... [2:00pm, 2:05pm) [2:01pm, 2:06pm) [2:02, 2:07pm) ...

Time

In stream processing, time does not strictly advance linearly like we would assume from a regular clock. There are two fundamentally different time schemes:

Processing time: The regular time we would use on a computer or a regular clock.
Event time: A time associated with and derived from the processed events.

Watermarks

Watermarks are used in conjunction with event time. Low watermarks are special time stamps which indicate the current minimum event time. Similarly, high watermark indicate the maximum event time seen. Watermarks are generated by a function which receives the event timestamps of the inflowing events as input. Watermarks functions can be as simple as taking the latest observed timestamp. However, time must advance monotonously, i.e. we must not go back in time. Watermark functions may use custom logic to decide when it is safe to observe time. For example, it could use a fixed offset from the high watermark as the low watermark. This tolerates some out of orderness of the arriving event timestamps. Let’s see why this is important below.

Late or out of order events

Events are considered “late” when their timestamp is before the current event time as determined by the latest emitted low watermark. Out of orderness is often a reason for late data because events do not arrive in their expected order which leads to prematurely advancing the event time via emitting a watermark which is past the event time of the incoming data.

Fault-tolerance & State

State is one of the most interesting and hard parts about stream processing. Most applications have state of some sort, e.g. remembering when a user last logged on requires state, storing any pending records within a system requires state, storing a position in a file or log which we are reading from. Whenever state is present, this has implications on the fault-tolerance of the system. In case of stateless applications, we can simply restart the job. However, if we do have state, we need to take care to re-initialize the state after a failure in a way that the process semantics stay the same.

When do failures occur? Failures can happen due to hardware failures, network failures, external systems failing, applications errors, malformed data, etc.

Stream processors must periodically externalize their state to be able to recover it in case of failures. This is done by writing their state to an external data store. It is not a trivial problem to do this in a way that the processing semantics remain unchanged when restoring the persisted state. Typically there are three semantics we distinguish between (from least to most strict):

at most once

If systems can guarantee at most once, they essentially guarantee that all events are processed once or not at all. This is the weakest guarantee because there can be data loss in the case when an event is not processed.
at least once

If systems guarantee at least once, they guarantee that a record is processed one or more times. This requires some form of acking or checkpointing to persist the stream state. There may be duplicate processing of data after restoring from a checkpoint because the same data will be read that was already processed before the failure which led to restoring the state from the checkpoint.
exactly once

Exactly once is the strongest but also most difficult semantic to guarantee. This is especially difficult when writing to external systems which might not support exactly once semantics. We need some form of support for transactional processing for external systems to ensure that we only yield a result once.

Stream processors

After describing the most important concepts in stream processing, it may be worth introducing some of the common stream processing engines. Here is a selection of open-source stream processing engines:

Apache Flink

Apache Flink is the de facto standard when it comes to stream processing in the open source world. It fully supports stream processing as described in this article. It is suitable for large-scale stream processing with hundreds of processing nodes. Flink can be described as the Swiss army knife or stream processors. It comes with a wide range of connectors. It bundles memory backends for storing stateful stream processing applications with memory demands exceeding main memory which requires spilling to disc. It also has its own scheduler which integrates with Kubernetes and a number of other cluster management solutions.

Kafka Streams

Apache Kafka is often used together with Flink as a message queue and storage layer for events. Kafka also comes with a stream processing library called Kafka Streams. Kafka Streams allows to write stream processing applications which do not require a dedicated runtime like Flink. Kafka Streams leverages the Kafka storage layer to shuffle and persist data. This doesn’t always make it the most performent solution to run stream processing pipelines. However, this operational simplicity is also a huge advantage if the user already has a Kafka cluster. It is to note that the Kafka cluster might become a bottleneck which might also be operationally challenging.

Apache Spark

Apache Spark builds its stream processing around their core abstraction: RDDs (Resilient Distributed Datasets). RDD are sets of data which can be processed in a fault-tolerant way. There is some overhead which occurs with this process which is why the data is split into large enough chunks. For stream processing the amount of data is further reduced to achieve lower latencies (micro batching). However, this means throughput is not as good as in systems like Flink which stream data and use a less granular method to ensure fault tolerance (checkpointing).

Apache Storm

Apache Storm is a legacy engine originally developed by Nathan Marz. It was one of the first open source stream processing solutions. It does per event acknowledgments for fault-tolerance which comes with a big performance-penalty. Due to its clunky API and slow execution engine, it’s not typically used anymore today.

Apache Beam / Google Cloud Dataflow

Google has its own stream processing product called Google Cloud Dataflow. Apache Beam is the programming library for Dataflow. The programming model is very similar to Flink’s. However, its underlying execution engine is quite different from Flink’s. The most important differences are its externally persisted state which makes it much more elastic than Flink. For example, in Dataflow it is possible to add or remove workers during runtime which would require an application restart in Flink.

Conclusion

We have outlined why stream processing is important and when stream processing is a good fit. We have learned about the core concepts and problems in stream processing. Finally, we have introduced some of the stream processing engines available.

There are many more things to learn about stream processing. I would encourage you to read more about stream processing in one of the following books:

Stream Processing with Apache Flink by Fabian Hueske and Vasiliki Kalavri
Streaming Systems by Tyler Akidau, Slava Chernyak, and Reuven Lax
Designing Data-Intensive Applications by Martin Kleppmann

This might only be the beginning of your stream processing journey.

A Brief History of Open Source

Mon, 31 May 2021 15:17:41 +0200

Open-source software (OSS) rules the world. Virtually any product, service, or platform is powered by or built with OSS. We carry OSS around in our pockets as Android or iPhones devices. Whenever we feel like it, we download OSS off the Internet to solve our every day tasks like writing documents, listening to music, or accessing our email. It is fair to say that OSS conquered the world, but does that mean it always has?

Source vs binary

The term “open-source” refers to code for software being published in the open, as opposed to only having a “binary” which contains the compiled source code and may have an explicit owner (“proprietary”). Binaries contain opaque information which only computers understand. Source code, on the other hand, is the human-understandable format of the logic contained in computer programs. Computers run binaries, humans can only really reason about binaries by reading their source code.

Clearly, there is more to open-source than just the code being open. It is a mindset and a way to collaborate in the open. But where does the term open-source come from? Surprisingly, “open-source” was and to some degree still is a controversial term.

Open-Source History

50s

Starting out in the 50s, software was only developed at universities and corporate research centers. Software was not ubiquitous like in today’s world. Only top-notch experts knew how to develop and use software.

It may come as a surprise that in those days the source code was often the only artifact distributed. Nowadays software comes pre-compiled for the most popular computer architectures, but back then there was no standard computer architecture. If you wanted to use a program, you had to compile it first (i.e. build the source code), or even adjust the source code to be able to run it on your computer or mainframe.

During those times, companies like IBM even asked their users to send in source code suggestions.

60s

In the 60s, most software came bundled together with hardware - a trend that we are seeing again today (hello Apple), though for different reasons.

Software was entirely supported through a one-time payment for the hardware. There was also no Internet, so any updates (“patches”) had to be distributed via hole punch cards early on, later via floppy discs.

At the end of the 60s, with the invention of operating systems, databases, and high-level programming languages, software became increasingly more complex. The development costs for software increased so much, it became hard to justify giving software away for free with the hardware. Eventually, companies started charging money for their software.

70s

In 1974, software became copyrightable (source), but that didn’t have a big impact because many companies had already stopped to distribute source code to prevent copying their software.

Companies like Microsoft and Apple were founded in the 70s. Clearly, those were for-profit companies which saw open or free software as a threat. Bill Gates famously wrote an “Open Letter to Hobbyists”, asking them to stop copying his company’s software.

It was only much later companies would realize the potential of OSS.

80s

The 80s saw OSS on the rise.

In 1983, Richard Stallman created the GNU project, because he was frustrated with the proprietary nature of computer systems he worked on (particularly Unix). The GNU project contained open-source rewrites of closed-source software Stallman used.

In 1985 Stallman founded the Free Software Foundation (FSF), a foundation to support the free software movement. Despite the problematic statements Stallman has made from time to time, he is a true visionary and pioneer of open-source software. The FSF takes a radical stance as it demands total control over software and its code.

90s

The term “open source” was first coined at the Foresight Institute (source). Computer security researchers wanted to promote the idea of “free software” but they were worried about people thinking it was merely software for free. In an act to promote free software to improve security they looked for term that would reflect the collaborative nature of OSS.

The term open-source was readily adopted at the end of the 90s by the Linux, Perl, and Python community, but also by companies like, Netscape and Red Hat.

The FSF does not like the term open-source because they think it undermines the freedom associated with open software: “Open source is a development methodology; free software is a social movement.” (source)

The Internet and the increasing presence of software in the world led to more developers sharing their code openly. Although the movement started from a non-commercial, idealistic movement, very soon companies adopted open source as a new way to promote their products or subscriptions, and to cut development costs by using existing open-source software.

In 1998, the non-profit Open Source Initiative was founded, inspired by Netscape which had just open-sourced their web browser Netscape Communicator (which later became Firefox).

In 1999, a group of developers of the Apache web server realized their methodology could be applied to other open-source projects as well. They proceeded to found the non-profit Apache Software Foundation.

In 1999, Sourceforge.com was launched which allowed developers to easily share and develop source code.

2000s

In 2000, the Linux Foundation was founded. It became on of the biggest and most influential open-source software foundations with a $100 million in revenue (2018, source).

In 2001, the Python Software Foundation was founded. To this date, the foundation remains committed to developing Python in the open.

In 2005, Linus Torvalds created the initial version of Git, an open-source version control system to speed up the distributed development of the Linux kernel.

In 2008, GitHub.com was launched - a central platform for source code development based on Git.

Also in 2008, Google released the first version of Android, today’s most used mobile operating system.

2010s

Over the years, more and more companies started to embrace open-source development as they realized the benefits of developing in the open. Expertise in open-source has become a competitive advantage.

Since 2017, Microsoft is one of the biggest open-source contributors in the world (source). In 2018, Microsoft acquired GitHub, the largest open-source development platform to this date.

We’ve really come a long way.

The Future of Open-Source

Open-source software has seen tremendous change:

In its early days it was a scientific practice to share code among other researchers. Source code used to be given away for free with computer hardware, but when companies realized they could make more money with software than hardware, they began to license their source code. The Free Software Foundation (FSF), the Apache Software Foundation (ASF), and the Linux Foundation became the largest foundations for open-source software. Even companies which fought open-source for decades started embracing open-source.

Open-Source Software clearly has conquered the world. Yet, there remain challenges around creating and maintaining open-source software, for example:

Legal issues

Open-source projects continue to see legal threats from companies who claim their intellectual property violated. Without strong legal support, open-source can be a costly endeavour. Fortunately, open-source foundations can provide a legal umbrella against these threats.

Licensing

Licenses for OSS are fragmented.

On the one hand, there are Copyleft licenses which are more restrictive when it comes to using OSS commercially. For example, by requiring to contribute back changes which are distributed elsewhere. Yet, in the age of Software as a Service (SaaS) this can often be circumvented.

On the other hand, Liberal licenses (e.g. Apache license) which do not have aforementioned restrictions, do not always promote the best behavior when it comes to contributing back to OSS.

Trademarks (branding)

Often open-source projects uses trademarked brands. This puts the project under the risk of losing their name if the company or person owning the trademark does not want the project to use their brand anymore.

Funding

Writing good software takes time and money. Many open-source projects continue to be underfunded.

Additional sources

https://en.wikipedia.org/wiki/Open-source_software#History

https://en.wikipedia.org/wiki/History_of_free_and_open-source_software

https://en.wikipedia.org/wiki/Timeline_of_free_and_open-source_software

Kubernetes in a Nutshell: 10 Things You Need to Know

Mon, 15 Mar 2021 14:26:44 +0100

Kubernetes changed everything about how we deploy applications. Yet many people struggle to understand the essence of Kubernetes. I’ve assembled the 10 most important things I believe everyone should know about Kubernetes.

1. Kubernetes vs. k8s

The cool kids abbreviate Kubernetes with “k8s” which stands for, you might have guessed it, Kubernetes. Simply drop the eight (8) letters between the first letter “K” and the last letter “s”, et voila.

2. Google, Kubernetes, and the Cloud Native Foundation

Google open-sourced Kubernetes in 2015 and partnered with the Linux Foundation to create the Cloud Native Computing Foundation. Kubernetes was the first project at the Cloud Native Foundation. Kubernetes is licensed under the permissive Apache 2.0 license.

Kubernetes was not made out of thin air. Its design is based on a container orchestration technology called Borg, to this date being developed and used internally at Google.

3. Kubernetes killed YARN, Mesos, and Docker Swarm

Kubernetes was not the first of its kind. Before Kubernetes came out, there were other cluster management systems in the open-source:

Apache Hadoop YARN (Yet Another Resource Negotiator)
Apache Mesos (incl. Marathon)
Docker Swarm

It is fair to say that Kubernetes superseded all of these systems. The reasons are manifold but to summarize: Mesos tried to be platform for solving all kinds of problems including fine-grained resource allocation and non-containerized applications. YARN was too tightly integrated into the Hadoop ecosystem. On other hand, Docker Swam was much like Kubernetes in the sense that it focused on container deployments but it lacked too many features that Kubernetes came with out-of-the-box.

4. Kubernetes manages containers

Kubernetes focuses on managing container deployments in a computer cluster, including their communication with each other. Think of a container as a portable and reproducible instance of a software environment including its dependencies.

Typically, the container format used is the Docker container format. More formats like containerd are supported and new ones can be plugged in as needed using Kubernetes’ Container Runtime Interface (CRI).

Kubernetes smallest operational unit is a Pod. Pods hold one or more containers. Usually Pods are not created by hand but by so called Deployments.

5. Kubernetes is declarative

Kubernetes takes a different approach than many other systems when it comes to creating the desired deployments.

Instead of specifying how the application should be deployed, users specify what should be deployed. Kubernetes then ensures that the declared requirements are met. Some examples of what can be declared:

A container image and its startup arguments
Minimum / maximum resources such as CPU, memory
Number of instances to create
Volumes to be mounted
Environment variables or configuration
Ports to communicate with other services
Credentials or secrets to be loaded
etc.

All this is specified via YAML. There is no code involved. Previously, one had to write code to achieve this (infrastructure as code), but with Kubernetes infrastructure is data. We have shifted from “how” to “what” and leave the rest to Kubernetes.

6. Kubernetes is fault-tolerant and self-healing

Over time, failures are inevitably in computer clusters. Failures can occur due to hardware issues but also due to software bugs or upgrades.

Kubernetes is designed to continue to work in the presence of failures. From Kubernetes’ point of view, a failure is just a deviation of the declared specification. Kubernetes will simply strive to restore the desired state.

To be able to do that, Kubernetes replicates its own state. By doing that, it can tolerate failures of its own nodes. It implements health checks on nodes and containers to be able to tell apart a healthy from an unhealthy entity. If a container is detected to be unhealthy, it will be removed and a new version of the container will be started.

7. Kubernetes is ubiquitous

A major reason for the success of Kubernetes is its availability in the modern cloud. All the major cloud providers (Amazon AWS, Microsoft Azure, Google Cloud) have managed Kubernetes offerings. Kubernetes can easily be integrated with the storage and networking implementations of any cloud provider.

Since Kubernetes is available in many cloud offerings, there is little to no vendor lock-in.

8. Kubernetes comes with batteries included

Kubernetes comes with powerful abstractions but it’s not only a tool for experts. It has been built with decades of practical experience in cluster deployments in mind. It includes proven, easy-to-use recipes for working with containers, storage, configuration, secrets, service discovery, networking, etc.

9. Kubernetes is extensible

Besides the included resource types, Kubernetes allows to create custom resource types and custom operators which help to realize the resource specifications.

For example, if you were to run an application on Kubernetes that required custom state management which cannot be expressed by Kubernetes deployments, you could define your own resource type alongside with an operator which creates this custom resource. Oftentimes, the operator can compose this new resource in terms of the included Kubernetes resource types, which allows to write an operator with relatively little code.

10. Kubernetes is efficient

Kubernetes is great at ensuring efficient resource usage. It has built-in load balancing which is able to balance load across all containers associated with a deployment.

Kubernetes packs its computing nodes with containers such that the containers and their computing needs maximize the utilization of each node. It regularly performs de-fragmentation by migrating containers to other nodes in order to achieve maximum utilization.

Kubernetes provides resource isolation and resource usage limitation by leveraging the container options for resource limits for CPU or memory (via Linux’s cgroups).

Final Thoughts

Thank you for reading this post. I hope I could shed some light on Kubernetes. If you found the article helpful, please send it to a friend or feel free to share it on social media.

If you want to learn more about Kubernetes, the official Kubernetes docs are a great place to start: https://kubernetes.io/docs/

The Significance of "Upstream First"

Fri, 22 Jan 2021 10:59:14 +0100

In open-source software, the term “upstream” refers to the main place of development. Many people talk about doing “upstream first” for open-source contributions, but what does that really mean?

Upstream

Upstream is the place where open-source software lives. It is where it is continuously improved, it is where it is maintained, it is where people come to ask questions or report bugs.

If you download the source code of an open-source project, then you have a copy of the upstream code, either a released version or a development (unreleased) version.

Forks

If you modify the source code, you have diverged from upstream. Only when you contribute back to upstream, you have resolved that discrepancy.

There are many reasons why you would modify the source code: adaptations for your system, infrastructure, internal processes. A copy of upstream with modifications is called a fork.

Most forks track upstream by regularly merging in the upstream changes and applying their small adaptations on top of the upstream code.

The more changes you make in your fork, the more expensive it can become to maintain. You will find that the changes you are making are not exclusive to your company or infrastructure because the problems you solve with your changes are just as much of a problem to other people than they are to you.

Contributing Back

It can be frustrating that every time you merge in changes, it becomes more and more difficult to make them work with your adaptations. It would be so much easier, if the upstream project knew about your problem and needs. Then you begin to realize that open-source software is only available and free of charge to you because other people took the time to publish their ideas and code in the open. You know that you could do the same and avoid the upstream changes working against you. You realize it would be smart to contribute back.

Upstream First

Upstream First means that whenever you solve a problem in your copy of the upstream code which others could benefit from, you contribute these changes back upstream, i.e. you send a patch or open a pull request to the upstream repository.

In the course of contributing back, you may have to discuss with the community members of the upstream project. You may find there is already a solution you were not aware of. You may find your solution can be improved. Most importantly, when the upstream project releases a new version, you can delete your modifications from your fork and stop worrying about the adaptation work you would have to do if you hadn’t contributed back.

Does Upstream First just mean being kind?

Upstream First is more than just “being kind”. It means you have a say in the project. It means predictability. It means you are in control. It means you act rather than react. It means you understand open-source.

The 4 Pillars of Successful Open-Source Communities

Thu, 31 Dec 2020 12:38:14 +0100

The community is the backbone of an open-source project. It establishes a framework for collaboration, innovation, growth, and sustainability. In order to for an open-source project to be a successful, it needs to develop a community.

Let’s have a look at what makes a good community, how communities are structured, and how to build them. I’ll also try to point out common pitfalls when developing communities.

Asking the Right Questions

There are pros and cons to building a community. Let’s have a look at why you should or shouldn’t invest in building a community:

Why should you build a community?

The following are good reasons to build an open-source community:

Attract attention to your personal self or your company

Believe it or not, but selfish motives are a common theme in open-source. Open-source can enable you to be recognized and be seen as an industry leader. Open-source provides attention across organizations and corporate environments.

If you run your own company, you could build a community to attract attention, then offer services around the open-source product, e.g. selling subscriptions, consulting, or proprietary products related to the open-source project.
Innovation and ideas

Lively open-source communities allow for a rapid exchange of ideas. This drives innovation in open-source software. Releases can be made often and interaction with users is possible at any point in time. This can yield a very effective feedback loop which accelerates innovation.
Recruitment

Companies or organizations which have a good reputation in open-source are popular amongst many developers. This can be a factor for driving people to your project. Many communities are built on top of existing communities, or relate to existing open-source communities because they share code with each other. This allows bootstrapping a community more easily by collaborating with existing communities.
Saving costs

By building a community, you can share the costs for maintenance, testing, and innovation. In a perfect community you get all this from the community and you just have to invest a small part in it. Initially, building a community will be work but can very rewarding.
Forking

You may have forked a project whose community is dying or controlled by an actor which does not play fair. There are many cases where that’s not a good idea but it can make sense. A good example where forking worked is the CI server Hudson which was backed by Sun. After Sun was acquired by Oracle, the community forked the project and it became Jenkins. This was a success and allowed the project to grow independently of Oracle’s interests.
Open-source is fun.

Exchanging ideas, making new connections, working together towards a goal, sharing resources - all that is great fun. Being open is a modern mindset that makes solving many problems easier.

Why shouldn’t you build a community?

There a legitimate reasons why building a community might not be a good idea. Let’s have a look at some of the counter-arguments:

You think the community is going to build itself

Nope. Building a community will be a lot of work. How much precisely depends on the momentum of your project (demand) and how you effectively generate it (marketing). The more people you have initially and how well they are connected to the industry or other open-source projects will make a big impact.
The technology is not innovative or useful to people

It is going to be hard to build a community around software which people do not really want to use. You may ask yourself, “Who cares about my project?” If you can’t think of a handful of people who would find your software immensely useful, chances are that nobody does.
You want to offload the maintenance and development.

Things will likely not turn out the way you intended. If you do not want to participate in leadership of a project, it will likely not go anywhere, or you’ll lose influence over it.
(Hostile) Forking

People will not like forks unless there is a good reason for it. Consider Open Office, which was open-sourced by Sun. When Sun was acquired by Oracle, Oracle decided to not invest into Open Office anymore. This led to a fork which was called Libre Office. The fork couldn’t use the name Open Office because Oracle owns the brand Open Office. Even though Libre Office can be considered a successful fork, to this day it suffers from the popularity of the Open Office brand. Yet, there are many examples of (hostile) forks which never went anywhere.
You think open-source itself is a business model

Through OSS, one can find great way to build a business, but do not expect open-source to generate money if you haven’t developed a good business idea. Nowadays, many people talk about big players like Amazon offering open-source products as a service and not contributing back to them. Although one may criticize that, please take this possibility into account when you develop your business idea.
You are afraid of having IP stolen.

Generally speaking, you should be comfortable with sharing IP. Not sharing it usually leads to licensing problems, which can hinder adoption. Open-source is not a one-way street, you get something in return for your ideas, but you may have to start sharing first.
You are a control freak.

Just don’t.

What about existing communities?

Do you really need to start a new community? Maybe there is already a community that would fit your needs. Think about joining one. Building a community from scratch can be much harder than joining an existing one. You can start making an impact in the project and focus on the things that matter most to you.

The 4 Pillars of a Successful Open-Source Community

For a strong open-source community you need the following ingredients:

1. Code

Why code? A community can only evolve about a meaningful piece of software. If you don’t provide value to people, nobody is going to care about your project. So think about the existing open-source software and how you will provide value to motivate others to join and participate.

Code also attracts a certain type of community. Innovativeness, ease of use, complexity, and the programming language play an important part in what kind of community evolves around a project. For example, compare a Lisp project with a Javascript project. Compare system software like a database to a web framework or a JS library. The type of code has a huge impact on the type of community that grows around a project.

You should ask yourself:

“Do I know the type of people that could contribute?“
“Is there a demand for a community around your code?”

2. People!

Code aside, of course a community is all about the people. If you look up who is part of an open-source community, you usually find this definition:

Users
Developers
Contributors (most of the time these are both Users and Developers)

In reality, things are much more complicated:

Take software developers as an example. They are part of the community but they are rarely employed by the project — the open-source funding dilemma. Instead, they work at companies whose corporate goals vary. As a project you need to acknowledge this.

Don’t expect that everyone can spend the same amount of time in the community, but still give everyone a chance to participate.

Do not rely on a single person to maintain a project, build in some redundancy by having a second maintainer or at least document things well. Otherwise you have a single point of failure for your project. Maintainer burnout is a very real thing, so acknowledge the work of developers and find a way to balance the work the developers have to do.

Recognize that your community can span outside the code domain. This makes your community more diverse and can significantly grow it. Consider the following roles:

Decision makers

Every project has people behind the scenes which vouch for open-source projects. Knowing who that is finding a way to eventually recognize their contributions can be very valuable.
What about writers / bloggers / organizers / evangelists / influencers / enthusiasts?

Make it easy for non-developers to reach out. It is the stories that lead people to using your software or joining your community.
Think about all the roles that you need in your community and whether all of them currently exist:

What could be those roles? For example: Coders, architects, reviewers, leaders, organizers, supporters, helpers, questioners, explainers, moderators. Having a diverse set of roles is crucial for a functioning community.

The Critical Mass

Every community needs a group of people which takes responsibility for a project. I call that the critical mass. What does the critical mass do? They are the project managers of the project. They are self-motivated with a long-term interest. They build structures and delegate responsibility. They promote the project in the industry, they are active in the day-to-day business of the project. They help to organize offline meetings, such as meetups and conferences to foster the relationships.

What would be a good example of a things done by the critical mass?

Starting off the project
- Donating the code / licensing it appropriately
- Creating bylaws / code of conduct / contributions guidelines
Day to day business
- Foster open discussions (e.g. mailing list)
- Ensuring the project stays relevant (technology)
- Getting rid of technical debt
- Ensure project’s infrastructure works correctly
- Meditating conflicts
- Recognize contributions and recruit new community members (!)
- Highlight achievements e.g. reaching milestones 1000 PRs, new committers, anniversaries
Outreach
- Organize meetups and conferences
- Write books or articles
- Promote on social media
- Being excited

How do you find a critical mass? To give you an idea, here’s a great quote by Jan Lenhardt:

I tried to be involved with every thread on the mailing list, showing exemplary behaviour, being nice to people, taking their issues seriously and trying to be helpful overall. After a while, people stuck around not to only ask questions, but to help with answering as well, and to my complete delight, they mimicked my style.

Quote taken from https://writing.jan.io/2015/11/20/sustainable-open-source.html

The critical mass is the role model of the community. It all starts with a few people but it grows over time.

3. Processes

Every community needs clearly defined processes for:

Workflows
- How do you contribute to the project?
- How are changes reviewed?
- How do we document changes?
- How do we test changes?
- How do we release software? How often?
- What tools do we rely on?
Communication
- How do we communicate with each other? (Tools, code of conduct, work flows)
- How do you earn merit?
- Who takes care of recognizing contributions?
Decision-making
- How do we decide?
- Consensus, majority-based?
- Who has the final say?
Legal
- What do we have to watch out for?
  - licensing,
  - use of libraries, logos
  - compliance

All these processes should be formalized and documented as much as possible. This not only helps the existing community but also allows others to understand how the community works and whether it is safe to rely on the produced software.

4. Ownership

Ownership is often associated with the license of the code, so let’s review some common open-source licenses:

Open-Source licenses

Public domain

Not actually a license but a statement which permits free use. This can be problematic in some countries because of the jurisdictional consequences if the license does not state that the software comes without warranties.
Permissive: Apache, BSD, MIT, etc.

This is still pretty much “do whatever you want” but with a legal framework for warranty, branding, distribution and attribution.
Copyleft: GPL, Eclipse Public, etc.

These require you to contribute back your changes if you ever were to distribute the Copyleft licensed code.
Propriertary-ish: Open core and other custom licenses.

Licenses are important for communities because they set the legal framework for contributions. They also influence the type of communities which built around them because they allow for different use of the software.

For example, the Apache license doesn’t require you to contribute back if you make changes and distribute/sell them. That may seem bad for the community but it is also a great way to grow the adoption of the code and help the community.

For a community to grow, contributing back upstream and engaging with the community is necessary. There should be an incentive for individuals or companies to stay relevant in the project and participate in it. If you make it easy for people or companies to do that, the license may be secondary.

Beyond the license

Ownership is not only limited to the license. It also applies to:

Infrastructure (repository, mailing list, chat, servers)
Name (Trademark)
Decision-making / governance

The two most prominent models are:

Owned by a single person / company

Often these projects are owned and led by a “Benevolent dictator”. Prominent examples are Guido van Rossum (Python) and Linus Torvalds (Linux).

There are pros and cons to this model. For one, it’s easy to stay in control and keep the project focused. On the other hand, the dictator can become the bottleneck of the project which will slow down its growth. This happened to Linux in the early 2000s when Linus Torvalds was still reviewing and approving all incoming changes.

Foundation owned

Famous foundations for open-source are:

Apache Software Foundation
Python Software Foundation
Linux Foundation
Eclipse Foundation / Mozilla Foundation
Free Software Foundation

In my eyes, the ASF and the FSF are the most open although they are quite different in terms of defining freedom in code. They both accept monetary donations, but the only way to influence projects is by gaining merit in them.

What does that mean? It means that you based on your contributions to the project you will be granted responsibility (comittership) which gives you power in the project. At the ASF, there is an incubation process which teaches and probes new projects on good behavior inside Apache.

At the Linux and Python software foundation there is also a merit-based model, but contrary to the ASF or FSF, you can also donate money to have a saying in the project’s decisions.

In the end, it doesn’t matter what model you choose but it does look like the open governance model in software foundations has the most potential to grow a large community.

Conclusion

There is not easy recipe for open-source communities. This probably shouldn’t come as a big surprise. Building a community is a not something that happens overnight. Rather it takes a continuous effort to nourish a community. The level of effort may decline once the community is more mature, but there are inevitable going to be challenges which the community will face. To succeed with building a community you need to invest long-term.

A good community is one which develops a good standpoint in all four domains: Code, People, Processes, and Ownership (COPP).

The mindset for success

We have learned about the 4 pillars for a successful open-source community (COPP). In addition, here are 5 important principles which you should watch out for in the course of building a successful community:

Communicate openly
Document well
Innovate frequently
Recruit always
Foster relationships, e.g. via Meetups/Conferences

With a critical mass, you can build the foundation for your community. Your critical mass might be small to begin with, so keep an eye open for potential new members. Remember, relationships are the core of every strong community.

This post is based on talks given at Fossbackstage 2020 and Berlin Buzzwords 2020.

The Art of Debugging Distributed Systems

Wed, 07 Oct 2020 11:05:33 +0200

Debugging is the process of identifying the root cause of an unexpected behavior of a software program. In software development, bugs are inevitable — No matter how good programmers are. Distributed systems are no exception in this regard, but they are often more difficult to debug.

Let’s take a look at why this is the case.

The Hierarchy of Complexity for Debugging

There is a hierarchy of complexity when it comes to debugging software:

Level 1: Nonconcurrency

The program to debug is strictly nonconcurrent, meaning there is a single program path which is executed.

These are single-threaded programs, i.e. programs running on a single machine, using only a single CPU core.

Level 2: Concurrency

The program utilizes concurrency in its execution paths. At any point in time there are multiple paths in the program which are executed without a guaranteed order.

Typical these programs use one of the following:

Note that concurrency does not mean that multiple programs paths execute at the same time. Rather, it means that the order in which the program paths execute cannot be guaranteed. If the execution paths have a dependency on each other, e.g. one thread needs to hand over a partial result to another thread, correct execution can only be ensured if the two execution paths can execute serially. Serial execution is defined to yield the same result, no matter of the order in which the execution paths run.

Level 3: Distribution

The program runs in a distributed fashion, meaning it’s composed of multiple independent but interconnected computing nodes. The nodes exchange messages to coordinate the distributed computing. Regardless of whether the individual nodes use concurrency, the overall distributed application is automatically concurrent because each node has its own execution path.

Why debugging distributed systems is hard

Debugging distributed systems is hard because we operate on Level 3 which includes both the concurrency as well as the distributed execution as a source of errors. While it may be trivial to read the code to figure out a bug on Level 1, it can become challenging to figure out what different threads on Level 2 are doing. On Level 3 we have another dimension for bugs arising from the message exchange of nodes. It’s often non-trivial to capture the state in a distributed system, as we can’t attach a debugger to all nodes at the same time. Likewise, integration tests are harder to write because they require a test scenario as close to running the actual distributed system as possible.

To understand this better, let’s look at common debugging techniques and how they relate to debugging distributed systems:

Common Debugging Techniques

Let’s look at some of the common debugging techniques which can be applied at Level 1 and Level 2.

Understanding error messages

As basic as it sounds, understanding the error message or related messages is often enough to fix the cause of a bug. Usually, that requires good knowledge of the internals of the software which reports the error. However, the error message may not be related to the actual cause of the error and thereby distract from the real problem. In any case, the error message is usually the starting point of the investigation.

In a distributed system, obtaining an error message is not always trivial because the actual cause of the error might occur and get logged on a different machine. Errors messages are not guaranteed to be propagated back to the client which initiated the request. Understanding the message flow between the nodes of a distributed systems, as well as having an infrastructure to obtain metrics and logs from all nodes is crucial here.

Reading the code

It may seem counter-intuitive, but going back to the code and verifying its desired behavior can be the most effective way to identify and fix bugs.

In a distributed system, the code paths are often spread across multiple modules which execute across many machines. The message exchange may not always be defined clearly and this makes debugging hard. Also, the error could depend on a non-obvious interleaving of messages.

If a bug cannot be properly identified by just reading the code, more information needs to be gathered by using one or more of the following methods.

Using code checkers

Code checkers such as valgrind (C, C++), Findbugs / Spotbugs (Java) use a set of rules to detect programming mistakes which can lead to bugs. These can run statically at runtime (e.g. Findbugs / Spotbugs) or dynamically at runtime (e.g. valgrind). It makes sense to inspect the errors or warnings on a regular basis.

While code checkers are good to find edge cases or memory leaks, they do not cover the whole spectrum of possible bugs. Also, they often just give a hint about an error and more debugging has to be performed afterwards.

Adding tests

By adding tests for the broken functionality, we can ensure that our assumptions gathered from reading the code hold true. Tests also allow us to check for edge cases in the program execution which are easily missed if checked by hand. Tests come in various forms:

unit tests
integration tests
end-to-end tests

The downside of using testing for debugging is that the test creation is usually biased. Programmers tend to only test for the cases they can imagine. Non-trivial bugs are often hard to find using this method. However, tests are a great way to ensure regressions do not occur once a bug has been identified and fixed.

In distributed systems, testing the actual distributed setup including simulation of real-world failure scenarios, such as machine failures or network partitions, is hard to do. Also developing the proper testing utilities to built a distributed test environment locally, can be challenging.

Bisecting

In debugging, bisecting is the process of running a binary search on the commit log of the version control system such as Git. For every version found in the commit log, a new version of the software is built. Then, a test is run which determines whether this version is healthy or not. One starts by declaring the latest healthy (correctly working, also called good) version and a known unhealthy (buggy, also called bad) version. Using binary search, we then search between the healthy and the unhealthy version. A test is run to determine whether a version is healthy. Eventually, this will find the first commit which was unhealthy. The change set can then reveal information about what introduced the bug.

On a single node, this process is usually efficient if the test runs fast. In a distributed system, this process can become very time-intense due to the need to deploy a new version every time. Also, this only works efficiently if all tasks can be automated.

Logging

logs

If logs are available, they can give an idea of what happened before the bug occurred. Logs are usually easily accessible on the same machine.

In a distributed system, logs are spread across multiple machines. Getting access to the right machine and searching the logs becomes much more involved.

If not enough output is available, we may have to generate some output:

printf debugging

printf is a function from the C standard library. In Printf debugging we are outputting information about the state of the program to a console, file or any other output method. This allows us to understand the behavior of the program better.

The drawback of this method is that it requires us to alter the program itself. In concurrent programs this can lead to a bug not showing up anymore, which does not make it applicable in all scenarios.

In distributed system, we face the difficulty of deployment costs and time. Deploying a new version of the software might take a considerate amount of time.

Tracing

Tracing tools allow us to understand the instruction flow of the program execution. This can involve the state changes in the program or system calls to the underlying operating systems (e.g. strace).

In a distributed system, the tracing needs to be aware of the multiple instances of the software. Otherwise, the trace on a single node may not reveal the necessary information for debugging the problem.

Using a debugger

A debugger is a separate program which attaches to the program we want to debug. Debuggers can gather information about the state of the program at any point in time. They also allow to set breakpoints at specific instructions or, if the source code is available, at lines of the source code.

Debuggers are very powerful. In the case of concurrent programs (Level 2) using multiple threads, we can halt the execution of all threads and inspect their state.

In distributed systems, there is the hurdle of using the debugger on the correct node. Halting the execution of the program may cause timeouts on other nodes and provoke a unwanted failure scenario in the distributed system.

Profiling

In profiling, we can sample CPU, memory, or disk usage. For statically compiled languages such as C/C++, we usually need to recompile to enable profiling information. For dynamically compiled languages with a runtime environment such as Java with its JVM, we can use a profiler such as JProfiler which gathers the information from the running program without having to recompile.

A profiler can be useful to detect memory leaks or performance-related bugs. Like the debugger, profiling will be performed per-node. This information has to be inspected individually, or we can collect and merge this information to be able to get an overview of all the nodes in a distributed system.

Debugging Techniques for Distributed Systems

We have already learned that the overall complexity in debugging distributed systems is higher. Here are a few techniques which, additionally to the already described debugging techniques, can help with debugging distributed systems:

Contracts and documentation

Without an understanding of the communication between nodes in a distributed system, it is often impossible to debug. Specifying the message protocol between nodes in a distributed systems, provides a reference for debugging illegal message exchange. Modern distributed system architecture like the Actor model have made documenting message exchange easier, yet it is still up to developers to enforce and document contracts.

Defensive programming

Whenever contracts or assumptions the code makes are violated, we should print out a warning or fail with an appropriate error message. This provides feedback to the developers and users early on in the software cycle and allows to fix bugs before they are shipped to the customer, and before they become hard to fix.

Better tests

Instrumentation for integration tests

Code should be easy to test. If test utilities allow for an easy setup of local versions of the distributed systems, more tests will be written. This can be achieved by parameterizing components to allow them to run locally instead of fully distributed.

End-to-end tests

End-to-end tests are a great way to see if the common code paths work correctly across all nodes. Compared to unit or integration tests, these are more expensive in terms of setup and computing resources but they also provide the most realistic test scenario. The downside is that end-to-end tests are not good at testing edge cases which can be more easily tested using unit or integration tests.

Remote Debugging

In remote debugging, we connect a locally running debugger to a remote node of the distributed system. This allows us to use the same features as if we were debugging a locally running program. Key problem here is to identify which node to connect to and to avoid network timeouts while debugging parts of the distributed system.

Distributed logging

Log collection and visualization

Log collection tools (e.g. Prometheus or Logstash) collect and transform log files, then insert them into a data store (e.g. Prometheus or Elasticsearch). Visualization tools (e.g. Prometheus or Kibana) allow you to query and visualize the data across all nodes of a distributed system or selectively for certain nodes. This can be very helpful when looking for errors messages or certain outputs.

Metrics

In addition to logs, distributed systems usually emit metrics which can be an indicator for what is happening during execution. Apart from application specific metrics, standard metrics like CPU usage, memory usage, and network saturation are typically available.

Distributed tracing

In distributed tracing, we extend the tracing for a single node to a distributed system. For example, we could store all messages exchanged such that we can replay the messages to reproduce the bug. Note, that this way of replaying may not always work because the state changes within the nodes might also be important for reproducing the bug. This is where deterministic replay becomes interesting.

Distributed deterministic simulation and replay

It’s a typical scenario that we see a bug occur, even during testing, but we can’t reproduce the bug by running the test again. If only there was a way to deterministically replay the test run? Turns out, this is possible by creating a simulation layer which abstracts the underlying hardware and the network interfaces to allow for an exact replay of all the state changes and messages exchanged between the nodes. To my best knowledge, this was pioneered by FoundationDB.

Model checking

By building a model of the concurrent and distributed aspects of the systems, we can formally specify important aspects of it. Further, we can then run model checks to see if the system is guaranteed to run correctly, according to our model. One language which is used by Amazon and Microsoft is TLA+.

Conclusion

Debugging distributed systems is hard, but not impossible. With the right tools and practices it is a reasonable endeavour. Did I miss something here? Let me know via email or feel free to comment on the Twitter thread.

5 Steps to Get Started with Data Processing in Python Using Apache Beam

Fri, 18 Sep 2020 16:05:33 +0200

Over two years ago, Apache Beam introduced the portability framework which allowed pipelines to be written in other languages than Java, e.g. Python and Go. Here’s how to get started writing Python pipelines in Beam.

1. Creating a virtual environment

Let’s first create a virtual environment for our pipelines. Note that we want to use Python 3 because Python 2 is now obsolete and won’t be supported in future Beam releases.

> virtualenv --python=python3 venv
> source venv/bin/activate

Now let’s install the latest version of Apache Beam:

> pip install apache_beam

2. Writing a Beam Python pipeline

Next, let’s create a file called wordcount.py and write a simple Beam Python pipeline. I recommend using PyCharm or IntelliJ with the PyCharm plugin, but for now a simple text editor will also do the job:

import apache_beam as beam
import apache_beam.transforms.window as window
from apache_beam.options.pipeline_options import PipelineOptions

def run_pipeline():
  # Load pipeline options from the script's arguments
  options = PipelineOptions()
  # Create a pipeline and run it after leaving the 'with' block
  with beam.Pipeline(options=options) as p:
    # Wrap in paranthesis to avoid Python indention issues
    (p
     # Load some dummy data, this can be replaced with a proper source later on
     | 'Create words' >> beam.Create('to be or not to be')
     # Split the words into one element per word
     | 'Split words' >> beam.FlatMap(lambda words: words.split(' ')
     # We are assigning a count of 1 to every word (very relevant if we had more data)
     | 'Pair with 1' >> beam.Map(lambda word: (word, 1))
     # We are interested in 10 second periods of words
     | 'Window of 10 seconds' >> beam.WindowInto(window.FixedWindows(10))
     # Group all the values (counts) of each unique word
     | 'Group by key' >> beam.GroupByKey()
     # Sum the counts for each word and return the result
     | 'Sum word counts' >> beam.Map(lambda kv: (kv[0], sum(kv[1])))
     # Just print to the console for testing
     | 'Print to console' >> beam.Map(lambda wordcount: print(wordcount))
    )

if __name__ == '__main__':
  run_pipeline()

Please see the inline comments for an explanation of what the code does.

We can now run the pipeline:

> python wordcount.py
('to', 2)
('be', 2)
('or', 1)
('not', 1)

Arguably, that’s a very simple pipeline but you get the gist. Later on, we will change the data source to read from Kafka.

3. Choosing a Runner

By default, the so called DirectRunner runs your pipeline. The DirectRunner is only intended for local development purposes. It’s very slow and does not support distributed execution.

Let’s run the same pipeline with the Flink Runner which will runs the pipeline (you guessed it) on top of Apache Flink:

> python wordcount.py --runner=FlinkRunner

What happens when you run your script with the --runner argument? Beam will look up the Runner (FlinkRunner) and attempt to run the pipeline. By default, this will download the Flink Runner JAR which contains the Beam JobService. The JobService will receive the pipeline and submit the pipeline to a Flink cluster. If you do not specify a cluster address via --flink_master, a local Flink cluster will be started.

Fore more information visit the Flink Runner page. The page also contains information on other Runners, such as Google Cloud Dataflow or Apache Spark.

4. Configuring the environment

By default, the Python code will run in a so called LOOPBACK environment. That’s an environment intended for development and testing purposes. It’s called LOOPBACK because a local Python process is started which runs the Python code. However, if you submit to a cluster, the environment will default to DOCKER which will bring up Docker containers on each of the hosts.

If you want to test the Docker-based execution locally, you can specify the following:

> python wordcount.py --runner=FlinkRunner --environment_type=DOCKER

The Beam community publishes Docker images for all releases which are used by default. You can build / specify a custom image.

See the environment documentation page for more information on environment configuration.

5. Cross-language pipelines

As a next step, let’s read some data using Beam’s KafkaIO. Oh no! Turns out, there is no native Kafka connector in the Python API. No problem, we can use KafkaIO in from the Java SDK:

import apache_beam as beam
import apache_beam.transforms.window as window
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.io.external.kafka import ReadFromKafka, WriteToKafka

def run_pipeline():
  with beam.Pipeline(options=PipelineOptions()) as p:
    (p
     | 'Read from Kafka' >> ReadFromKafka(consumer_config={'bootstrap.servers': kafka_bootstrap,
                                                           'auto.offset.reset': 'latest'},
                                          topics=['demo'])
     | 'Par with 1' >> beam.Map(lambda word: (word, 1))
     | 'Window of 10 seconds' >> beam.WindowInto(window.FixedWindows(10))
     | 'Group by key' >> beam.GroupByKey()
     | 'Sum word counts' >> beam.Map(lambda kv: (kv[0], sum(kv[1])))
     | 'Write to Kafka' >> WriteToKafka(producer_config={'bootstrap.servers': kafka_bootstrap},
                                        topic='demo-output'))
    )

if __name__ == '__main__':
  run_pipeline()

The essential pipeline logic hasn’t changed, but we have swapped out the simple Create / Print transforms for reading / writing to / from Kafka. Note that these transforms are not native Python transforms but so called external transforms. External transforms are placeholders which get replaced by the actual transform when the pipeline is built.

To understand what will happen when we run this pipeline, have a look at this image:

When we run the Python script, the pipeline is constructed. During this, a lookup to the ExpansionService is performed to resolve ReadFromKafka / WriteToKafka. Once the pipeline has been assembled, it is submitted to the JobService which also receives any required artifacts (e.g. Python libraries). The Runner then submits the pipeline against a Flink cluster.

Native transforms like GroupByKey can be processed directly by Flink. Any language-specific code runs in a separate environment for the language. The environment contains the SDK Harness which is responsible for running the language-specific code.

The good news is that you normally do not have to worry about this process. FlinkRunner, as part of the Python SDK, abstracts away a lot of the complexity.

6. Reach out to the Beam community

That’s it! For more information, check out the Beam documentation. Still stuck? Feel free to reach out to the Beam community.

Apache Arrow: The Hidden Champion of Data Analytics

Tue, 15 Sep 2020 19:05:33 +0200

In today’s open-source software stack you can find many indispensable dependencies in the form of software libraries. They are logging frameworks, testing frameworks, HTTP libraries, or code style checkers. But it doesn’t happen often that a new library emerges which changes the way we think about computing.

One of such libraries in the data processing and data science space is Apache Arrow. Arrow is used by open-source projects like Apache Parquet, Apache Spark, pandas, and many commercial or closed-source services. It provides the following functionality:

In-memory computing
A standardized columnar storage format
An IPC and RPC framework for data exchange between processes and nodes respectively

Why is this such a big deal?

In-Memory Computing

Let’s look at how things worked before Arrow existed:

We can see that in order for Spark to read data from a Parquet file, we needed to read and deserialize the data in the Parquet format. This requires us to make a full copy of the data by loading it into memory. First, we read the data into a memory buffer, then we use Parquet’s conversion methods to turn the data, e.g. a String or a number, into the representation of our programming language. This is necessary because Parquet represents a number differently from how the Python programming language represents it.

This is a pretty big deal for performance for a number of reasons:

We are copying the data and running conversion steps on it. The data is in a different format, we need to read all of it and convert it before doing any computation with the data.
The data we are loading has to fit into memory. Do you only have 8GB of RAM and your data is 10GB? You are out of luck!

Now let’s look at how Apache Arrow improves this:

Instead of copying and converting the data, Arrow understands how to read and operate on the data directly. For this to work, the Arrow community defined a new file format alongside with operations which works directly on the serialized data. This data format can be read directly from disk without the need to load it into memory and convert / deserialize the data. Of course, parts of the data is still going to be loaded into RAM but your data does not have to fit into memory. Arrow uses memory-mapping of its files to load only as much data into memory as necessary and possible.

Standardized Column Storage Format

The heart of Apache Arrow is its columnar data format. What does columnar data mean? In traditional file formats or databases, data is stored row-wise. For example, if we had a record with the fields product, quantity, and price:

Product	Quantity	Price
Banana	3	1.8
Apple	5	2.5

In row-wise storage the data would be stored on disk row by row. That makes a lot of sense. However, if you quickly want to sum up the total price of all the items, you would have to read all the records and extract the price column from them. Wouldn’t it be better if the data already came in a format that allowed to read the columns efficiently?

Enter columnar storage.

For columnar storage we arrange the data in columnar format. In our example this would look like this:

Product	Banana	Apple
Quantity	3	5
Price	1.8	2.5

If we store the data like this, we have all the column data in one place and can iterate over it efficiently. Not only is this more efficient in terms of extracting values but we can also take advantage of modern CPU architecture which applies the same operation (e.g. summation) on a continuous data segment in memory. This is also referred to as Single Instruction Multiple Data (SIMD). It is very efficient due to caching and pipelining at the processor level.

How much faster is this? The simpler answer: a lot faster! Here are some performance numbers from Dremio:

Parquet and C++: Reading data into Parquet from C++ at up to 4GB/s

Pandas: Reading into pandas up to 10GB/s

Clearly, the speedup depends on the application but there is no doubt that, besides its functional advantages, Arrow can provide a tremendous performance boost, to the point that it enables new applications which were not feasible before.

Supported languages

The following languages are supported by Apache Arrow:

C++
C#
Go
Java
JavaScript
Rust
Python (through the C++ library)
Ruby (through the C++ library)
R (through the C++ library)
MATLAB (through the C++ library).

Not just a more efficient file format

IPC (inter-process communication)

It is important to understand that Apache Arrow is not merely an efficient file format. The Arrow library also provides interfaces for communicating across processes or nodes. That means that processes, e.g. a Python and a Java process, can efficiently exchange data without copying it locally. Nodes in a computer network also benefit from this, while the data has to be transferred over the network, we only need to transfer the relevant columns from the data. In both cases, we don’t have to deserialize data because Arrow understands how to operate directly on the data.

According to Dremio, the following speedup was achieved in PySpark:

IBM measured a 53x speedup in data processing by Python and Spark after adding support for Arrow in PySpark

RPC (remote procedure call)

Within arrow there is a project called Flight which allows to easily build arrow-based data endpoints and interchange data between them. Flight is optimized in terms of parallel data access. It is possible to receive data from multiple endpoints in parallel and request data from a new endpoint while still reading from an endpoint. This highly parallel way of interacting with the services can provide a performance boost for network transfers.

It was observed by Dremio in their Arrow Flight connector that you could achieve a 20-50x better performance than ODBC over a TCP connection.

What’s next? Building a query engine on top of Arrow

As of now, to use Arrow you need to know how Arrow works and how the data is stored. Many projects such as pandas have taken advantage of that. However, the missing piece is a query engine on top of Arrow capabilities which would allow users to easily query and process data stored in the Arrow format. The Arrow community is working on that.

Post Mortem

There is/was a lively discussion on Twitter which brought up DuckDB and vaex as query engines built on top of Arrow. Also, it was mentioned that DataFusion, a Rust-based query engine, has been donated to Arrow.

Images taken from the Apache Arrow site.

10 Reasons for Choosing Apache Pulsar over Apache Kafka

Mon, 31 Aug 2020 09:05:33 +0200

I’ve been taking a closer look at Apache Pulsar and how it relates to Apache Kafka. In case you are curious, here are ten of my findings:

Pulsar’s brokers are stateless. The state is kept in a separate storage layer (Apache BookKeeper). This means you can leverage a new broker without the need to re-partition existing data, which is required by Kafka.
Pulsar’s storage layer is organized into segments which are spread across all storage nodes. Segments can be written to the main storage or off-loaded to a different type of storage. This allows Pulsar to offer tiered storage which Kafka does not support yet.
For replication, Pulsar uses a quorum-based algorithm, as opposed to a leader/follower-based approach in Kafka. The guarantees are the same, but the quorum approach tends to yield lower and more consistent latencies.
Pulsar includes support for multi-tenancy which allows multiple user groups to share the same cluster, either via access control, or in entirely different namespaces. In Kafka, this is still under discussion.
Pulsar offers full end-to-end encryption from the client to the storage nodes. Kafka currently does not have end-to-end encryption.
Pulsar speaks other protocols such as RabbitMQ, AMQP, or even Kafka (!) which makes it easy to integrate Pulsar with existing applications. Further, there is support for Presto.
Pulsar Functions is a way to do lightweight stream processing on top of Pulsar, conceptually similar to Kafka Streams. What I found interesting is that Pulsar’s functions are directly deployed on the broker nodes, whereas Kafka’s streams run as separate applications.
The Pulsar community has been very open about the limitations of Pulsar Functions, e.g. state management and DAG flows. In case Pulsar Functions doesn’t do it for you, there is an actively maintained Pulsar <> ApacheFlink connector.
It’s not all sunshine and rainbows: Pulsar requires two systems: Apache BookKeeper and Apache Zookeeper. Kafka just requires Zookeeper. More systems could increase the operational complexity. On the other hand, it’s also the reason why Pulsar provides additional flexibility.
Pulsar is not new. It was originally developed and used at Yahoo, later donated to the Apache Software Foundation in 2016. It’s used by Tencent, Splunk, and many others at large scale.

Obviously, this is not a full comparison of Apache Pulsar and Apache Kafka, but rather a compilation of the things I was surprised to find out about Pulsar, coming from the Kafka landscape.

Join the conversation on Twitter.

How
Apache Beam
Runs on Top of
Apache Flink

Sat, 22 Feb 2020 08:05:33 +0100

Introduction

Apache Flink and Apache Beam are open-source frameworks for parallel, distributed data processing at scale. Unlike Flink, Beam does not come with a full-blown execution engine of its own but plugs into other execution engines, such as Apache Flink, Apache Spark, or Google Cloud Dataflow. In this blog post we discuss the reasons to use Flink together with Beam for your batch and stream processing needs. We also take a closer look at how Beam works with Flink to provide an idea of the technical aspects of running Beam pipelines with Flink. We hope you find some useful information on how and why the two frameworks can be utilized in combination. For more information, you can refer to the corresponding documentation on the Beam website or contact the community through the Beam mailing list.

What is Apache Beam

Apache Beam is an open-source, unified model for defining batch and streaming data-parallel processing pipelines. It is unified in the sense that you use a single API, in contrast to using a separate API for batch and streaming like it is the case in Flink. Beam was originally developed by Google which released it in 2014 as the Cloud Dataflow SDK. In 2016, it was donated to the Apache Software Foundation with the name of Beam. It has been developed by the open-source community ever since. With Apache Beam, developers can write data processing jobs, also known as pipelines, in multiple languages, e.g. Java, Python, Go, SQL. A pipeline is then executed by one of Beam’s Runners. A Runner is responsible for translating Beam pipelines such that they can run on an execution engine. Every supported execution engine has a Runner. The following Runners are available: Apache Flink, Apache Spark, Apache Samza, Hazelcast Jet, Google Cloud Dataflow, and others.

The execution model, as well as the API of Apache Beam, are similar to Flink’s. Both frameworks are inspired by the MapReduce, MillWheel, and Dataflow papers. Like Flink, Beam is designed for parallel, distributed data processing. Both have similar transformations, support for windowing, event/processing time, watermarks, timers, triggers, and much more. However, Beam not being a full runtime focuses on providing the framework for building portable, multi-language batch and stream processing pipelines such that they can be run across several execution engines. The idea is that you write your pipeline once and feed it with either batch or streaming data. When you run it, you just pick one of the supported backends to execute. A large integration test suite in Beam called “ValidatesRunner” ensures that the results will be the same, regardless of which backend you choose for the execution.

One of the most exciting developments in the Beam technology is the framework’s support for multiple programming languages including Java, Python, Go, Scala and SQL. Essentially, developers can write their applications in a programming language of their choice. Beam, with the help of the Runners, translates the program to one of the execution engines, as shown in the diagram below.

Reasons to use Beam with Flink

Why would you want to use Beam with Flink instead of directly using Flink? Ultimately, Beam and Flink complement each other and provide additional value to the user. The main reasons for using Beam with Flink are the following:

Beam provides a unified API for both batch and streaming scenarios.
Beam comes with native support for different programming languages, like Python or Go with all their libraries like Numpy, Pandas, Tensorflow, or TFX.
You get the power of Apache Flink like its exactly-once semantics, strong memory management and robustness.
Beam programs run on your existing Flink infrastructure or infrastructure for other supported Runners, like Spark or Google Cloud Dataflow.
You get additional features like side inputs and cross-language pipelines that are not supported natively in Flink but only supported when using Beam with Flink.

The Flink Runner in Beam

The Flink Runner in Beam translates Beam pipelines into Flink jobs. The translation can be parameterized using Beam’s pipeline options which are parameters for settings like configuring the job name, parallelism, checkpointing, or metrics reporting.

If you are familiar with a DataSet or a DataStream, you will have no problems understanding what a PCollection is. PCollection stands for parallel collection in Beam and is exactly what DataSet/DataStream would be in Flink. Due to Beam’s unified API we only have one type of results of transformation: PCollection.

Beam pipelines are composed of transforms. Transforms are like operators in Flink and come in two flavors: primitive and composite transforms. The beauty of all this is that Beam only comes with a small set of primitive transforms which are:

Source (for loading data)
ParDo (think of a flat map operator on steroids)
GroupByKey (think of keyBy() in Flink)
AssignWindows (windows can be assigned at any point in time in Beam)
Flatten (like a union() operation in Flink)

Composite transforms are built by combining the above primitive transforms. For example, Combine = GroupByKey + ParDo.

Flink Runner Internals

Although using the Flink Runner in Beam has no prerequisite to understanding its internals, we provide more details of how the Flink runner works in Beam to share knowledge of how the two frameworks can integrate and work together to provide state-of-the-art streaming data pipelines.

The Flink Runner has two translation paths. Depending on whether we execute in batch or streaming mode, the Runner either translates into Flink’s DataSet or into Flink’s DataStream API. Since multi-language support has been added to Beam, another two translation paths have been added. To summarize the four modes:

The Classic Flink Runner for batch jobs: Executes batch Java pipelines
The Classic Flink Runner for streaming jobs: Executes streaming Java pipelines
The Portable Flink Runner for batch jobs: Executes Java as well as Python, Go and other supported SDK pipelines for batch scenarios
The Portable Flink Runner for streaming jobs: Executes Java as well as Python, Go and other supported SDK pipelines for streaming scenarios

The “Classic” Flink Runner in Beam

The classic Flink Runner was the initial version of the Runner, hence the “classic” name. Beam pipelines are represented as a graph in Java which is composed of the aforementioned composite and primitive transforms. Beam provides translators which traverse the graph in topological order. Topological order means that we start from all the sources first as we iterate through the graph. Presented with a transform from the graph, the Flink Runner generates the API calls as you would normally when writing a Flink job.

While Beam and Flink share very similar concepts, there are enough differences between the two frameworks that make Beam pipelines impossible to be translated 1:1 into a Flink program. In the following sections, we will present the key differences:

Serializers vs Coders

When data is transferred over the wire in Flink, it has to be turned into bytes. This is done with the help of serializers. Flink has a type system to instantiate the correct coder for a given type, e.g. StringTypeSerializer for a String. Apache Beam also has its own type system which is similar to Flink’s but uses slightly different interfaces. Serializers are called Coders in Beam. In order to make a Beam Coder run in Flink, we have to make the two serializer types compatible. This is done by creating a special Flink type information that looks like the one in Flink but calls the appropriate Beam coder. That way, we can use Beam’s coders although we are executing the Beam job with Flink. Flink operators expect a TypeInformation, e.g. StringTypeInformation, for which we use a CoderTypeInformation in Beam. The type information returns the serializer for which we return a CoderTypeSerializer, which calls the underlying Beam Coder.

Read

The Read transform provides a way to read data into your pipeline in Beam. The Read transform is supported by two wrappers in Beam, the SourceInputFormat for batch processing and the UnboundedSourceWrapper for stream processing.

ParDo

ParDo is the swiss army knife of Beam and can be compared to a RichFlatMapFunction in Flink with additional features such as SideInputs, SideOutputs, State and Timers. ParDo is essentially translated by the Flink runner using the FlinkDoFnFunction for batch processing or the FlinkStatefulDoFnFunction, while for streaming scenarios the translation is executed with the DoFnOperator that takes care of checkpointing and buffering of data during checkpoints, watermark emissions and maintenance of state and timers. This is all executed by Beam’s interface, called the DoFnRunner, that encapsulates Beam-specific execution logic, like retrieving state, executing state and timers, or reporting metrics.

Side Inputs

In addition to the main input, ParDo transforms can have a number of side inputs. A side input can be a static set of data that you want to have available at all parallel instances. However, it is more flexible than that. You can have keyed and even windowed side input which updates based on the window size. This is a very powerful concept which does not exist in Flink but is added on top of Flink using Beam.

AssignWindows

In Flink, windows are assigned by the WindowOperator when you use the window() in the API. In Beam, windows can be assigned at any point in time. Any element is implicitly part of a window. If no window is assigned explicitly, the element is part of the GlobalWindow. Window information is stored for each element in a wrapper called WindowedValue. The window information is only used once we issue a GroupByKey.

GroupByKey

Most of the time it is useful to partition the data by a key. In Flink, this is done via the keyBy() API call. In Beam the GroupByKey transform can only be applied if the input is of the form KV<Key, Value>. Unlike Flink where the key can even be nested inside the data, Beam enforces the key to always be explicit. The GroupByKey transform then groups the data by key and by window which is similar to what keyBy(..).window(..) would give us in Flink. Beam has its own set of libraries to do that because Beam has its own set of window functions and triggers. Essentially, GroupByKey is very similar to what the WindowOperator does in Flink.

Flatten

The Flatten operator takes multiple DataSet/DataStreams, called P[arallel]Collections in Beam, and combines them into one collection. This is equivalent to Flink’s union() operation.

The “Portable” Flink Runner in Beam

The portable Flink Runner in Beam is the evolution of the classic Runner. Classic Runners are tied to the JVM ecosystem, but the Beam community wanted to move past this and also execute Python, Go and other languages. This adds another dimension to Beam in terms of portability because, like previously mentioned, Beam already had portability across execution engines. It was necessary to change the translation logic of the Runner to be able to support language portability.

There are two important building blocks for portable Runners:

A common pipeline format across all the languages: The Runner API
A common interface during execution for the communication between the Runner and the code written in any language: The Fn API

The Runner API provides a universal representation of the pipeline as Protobuf which contains the transforms, types, and user code. Protobuf was chosen as the format because every language has libraries available for it. Similarly, for the execution part, Beam introduced the Fn API interface to handle the communication between the Runner/execution engine and the user code that may be written in a different language and executes in a different process. Fn API is pronounced “fun API”, you may guess why.

How Are Beam Programs Translated In Language Portability?

Users write their Beam pipelines in one language, but they may get executed in an environment based on a completely different language. How does that work? To explain that, let’s follow the lifecycle of a pipeline. Let’s suppose we use the Python SDK to write the pipeline. Before submitting the pipeline via the Job API to Beam’s JobServer, Beam would convert it to the Runner API, the language-agnostic format we described before. The JobServer is also a Beam component that handles the staging of the required dependencies during execution. The JobServer will then kick-off the translation which is similar to the classic Runner. However, an important change is the so-called ExecutableStage transform. It is essentially a ParDo transform that we already know but designed for holding language-dependent code. Beam tries to combine as many of these transforms into one “executable stage”. The result again is a Flink program which is then sent to the Flink cluster and executed there. The major difference compared to the classic Runner is that during execution we will start environments to execute the aforementioned ExecutableStages. The following environments are available:

Docker-based (the default)
Process-based (a simple process is started)
Externally-provided (K8s or other schedulers)
Embedded (intended for testing and only works with Java)

Environments hold the SDK Harness which is the code that handles the execution and the communication with the Runner over the Fn API. For example, when Flink executes Python code, it sends the data to the Python environment containing the Python SDK Harness. Sending data to an external process involves a minor overhead which we have measured to be 5-10% slower than the classic Java pipelines. However, Beam uses a fusion of transforms to execute as many transforms as possible in the same environment which share the same input or output. That’s why in real-world scenarios the overhead could be much lower.

Environments can be present for many languages. This opens up an entirely new type of pipelines: cross-language pipelines. In cross-language pipelines we can combine transforms of two or more languages, e.g. a machine learning pipeline with the feature generation written in Java and the learning written in Python. All this can be run on top of Flink.

Conclusion

Using Apache Beam with Apache Flink combines (a.) the power of Flink with (b.) the flexibility of Beam. All it takes to run Beam is a Flink cluster, which you may already have. Apache Beam’s fully-fledged Python API is probably the most compelling argument for using Beam with Flink, but the unified API which allows to “write-once” and “execute-anywhere” is also very appealing to Beam users. On top of this, features like side inputs and a rich connector ecosystem are also reasons why people like Beam.

With the introduction of schemas, a new format for handling type information, Beam is heading in a similar direction as Flink with its type system which is essential for the Table API or SQL. Speaking of, the next Flink release will include a Python version of the Table API which is based on the language portability of Beam. Looking ahead, the Beam community plans to extend the support for interactive programs like notebooks. TFX, which is built with Beam, is a very powerful way to solve many problems around training and validating machine learning models.

For many years, Beam and Flink have inspired and learned from each other. With the Python support being based on Beam in Flink, they only seem to come closer to each other. That’s all the better for the community, and also users have more options and functionality to choose from.

This blog post is co-authored by Markos Sfikas from Ververica. It’s a summary of the talk “Beam on Flink: How Does It Actually Work?” which I gave at FlinkForward Berlin 2019. It has also been featured on the Flink blog.

The CAP Theorem's Common Misconception

Thu, 19 Jul 2018 15:10:12 +0200

A couple of days ago I tweeted about the CAP theorem which sparked some interest and a small conversation on Twitter. Since tweets tend to disappear in the void pretty quickly, I thought I also put this here:

The Misconception around the CAP Theorem

The CAP theorem in distributed systems states that you cannot satisfy more than two of these three properties:

Consistency
Availability
Partition Tolerance

But here’s what’s commonly misunderstood about the CAP theorem:

Being Partition Tolerant or not is not really a choice. In distributed systems, partitions may occur at any time. It’s more of a choice what to give up during a network partition: Consistency or Availability.
Giving up Availability implies a part of your system will become unresponsive or return an error during a network partition.
Giving up Consistency implies your system might return stale data (=dirty reads) because data was written to some nodes but not yet to all due to a network partition.
Consistency according to CAP is Linearizability – not what the ‘C’ in ACID means. Linearizability enforces a linear order to operations, e.g. read(key) should return the latest written value of key.
Consistency in A[C]ID is ensured through Serializability which is a mechanism to ensure concurrent transactions in databases can be split up into multiple serial executions. No strict order is required.
Whether a system is actually CP or CA is often very hard to tell. It depends on the configuration of the system, e.g. number of replicas, consensus algorithms, quorum sizes, bugs, etc.
The CAP theorem is a starting point for characterizing systems, but to properly understand guarantees of distributed systems, you have to dive deeper. Here’s an excellent article on this by @martinkl: martin.kleppmann.com/2015/05/11/ple…

Hope you found this useful. Let me know on Twitter if you have any remarks.

Who Is Behind Maven Central?

Fri, 29 Dec 2017 11:15:14 +0100

Ever wondered who or what is behind the central repository which is used to download the dependencies in your Maven / Gradle / SBT builds? Did you specify the repository URL in your build file? You didn’t?

The actual URL of the repository is http://repo1.maven.org/maven2. You can browse all the artifacts by package name if you click the link.

This URL is hardcoded inside Maven and is used to lookup and retrieve build artifacts.

In fact, when you build Maven itself, this URL is used to retrieve Maven’s dependencies from Maven Central (crazy, no?). Why? Because Maven is built with…Maven :)

Here is a snippet from Maven’s bootstrap code for constructing the initial Maven version which is then used to build Maven itself:

Hardcoded URL from ArtifactDownloader:

private static final String REPO_URL = "http://repo1.maven.org/maven2";

Note that this bootstrap process is pretty common in the compiler domain where you want to write the compiler itself in the language for which you build the compiler. Likewise, the authors of Maven wanted to use Maven to build…Maven.

The author already had some doubts whether hardcoding the URL would be a good idea and, thus, commented in the same file:

// TODO: use super POM?
Repository repository = new Repository( REPO_URL, Repository.LAYOUT_DEFAULT );

In fact, the Maven super POM sets the Maven Central URL. If you haven’t worked with Maven, the pom.xml (POM) is the build file where you define the dependencies and properties of your project. When you define your pom.xml, by default, you inherit from the super pom which defines the central repository in the <repositories> section:

<repositories>
    <repository>
      <id>central</id>
      <name>Central Repository</name>
      <url>https://repo.maven.apache.org/maven2</url>
      ...
    </repository>
</repositories>

However, it looks like it is not possible to go without the Maven Central repository because Maven contains code which checks if Maven Central is defined as a repository and, if not, re-adds it:

if ( !definedRepositories.contains( RepositorySystem.DEFAULT_REMOTE_REPO_ID ) ) {
    try {
        request.addRemoteRepository(
            repositorySystem.createDefaultRemoteRepository( request ));
    } catch ( Exception e ) {
        throw new MavenExecutionRequestPopulationException(
            "Cannot create default remote repository.", e );
}

Taken from DefaultMavenExecutionRequestPopulator.java.

And indeed, the above method createDefaultRemoteRepository uses the default central repository URL:

public ArtifactRepository createDefaultRemoteRepository( MavenExecutionRequest request ) throws Exception {
    return createRepository( RepositorySystem.DEFAULT_REMOTE_REPO_URL, RepositorySystem.DEFAULT_REMOTE_REPO_ID, true, ArtifactRepositoryPolicy.UPDATE_POLICY_DAILY, false, ArtifactRepositoryPolicy.UPDATE_POLICY_DAILY, ArtifactRepositoryPolicy.CHECKSUM_POLICY_WARN );
}

DEFAULT_REMOTE_REPO_URL is stored in the RepositorySystem class:

String DEFAULT_REMOTE_REPO_URL = "https://repo.maven.apache.org/maven2";

So you can add your own repositories through the <repository> XML tag but you can’t get rid of the default. That’s interesting.

Wait, now the URL is https://repo.maven.apache.org/maven2/?!

What is going on here?

Well, the truth is that Maven Central is a CDN. A CDN is basically a set of servers which distribute data (e.g. web pages, videos, Maven artifacts, etc.) in such a way that the data is reliably and quickly accessible throughout the internet.

Who operates the CDN? Surprisingly, this is a service of a company called Sonatype Inc. Sonatype offers commercial services for open-source software with products related to repositories, continuous integration, and security. Their non-commercial branch is sonatype.org which gives back some of their services to open-source projects. One of these services is Maven Central.

The Maven Central website lists “Producers” who publish artifacts to Maven Central. The following organizations are currently “Producers”:

Apache
Atlassian
eXo Platform
JBoss/RedHat
Liferay
Oracle / java.net

As we discovered, the default Maven Central URL has been changed from http://repo1.maven.org/maven2 to https://repo.maven.apache.org/maven2/. Keep in mind that maven.org is a domain owned by Sonatype, as opposed to apache.org which is owned by the Apache Software Foundation.

Maven Central is a crucial component for Apache projects. As we have learned, Apache projects publish their artifacts on Maven Central and almost all of their dependencies are hosted on Maven Central as well.

Some stats of the central repository:

Total number of artifacts indexed (GAV):	2,426,748
Total number of unique artifacts indexed (GA):	215,031

That’s a lot of artifacts. If one day, the central repository breaks for whatever reason, developers would be in for a treat :)

Luckily, thanks to the repository URL now defaulting to repo.maven.apache.org, we have the Apache Software Foundation to completely takeover any traffic in case the Sonatype CDN shut down. And I’m assuming they also have a copy of the artifacts.

Phew. So we are good after all.

Your browser does not support the video tag.

How do other build systems serve their build dependencies? Is the situation any different there? Not really. Python build tools, for instance, use PyPi which is also backed by a central repository. Ruby Gemfiles use RubyGems. C/C++ build tools typically assume the dependencies have already been installed, e.g. via the package manager of the operating system. Of course the package manager also accesses central repositories. I have yet to find out who runs all these repositories :)

Thanks for reading this post. It was fun to learn about the central repository that we mostly take for granted. Thank you Sonatype for your service!

If you don’t know Maven, it’s a a great build system which powers a lot of open-source projects. Check it out on GitHub. Or check out the The 5 Minute Guide to Maven.

EDIT: As pointed out by Robert Scholte, there are two Maven Central mirrors listed on the maven.org Central Repository page. One at ibiblio.org, the other hosted by Google.

So we are really good after all :)

An Introduction to Apache Software — What you need to know

Fri, 03 Feb 2017 09:05:33 +0100

A revised version of this post has been published on the blog of the Apache Software Foundation.

Introduction

If you’re reading this post, you have already been using Apache software. The Apache web server is used by about every second web page on the WWW, including this website. You could say, Apache software runs the WWW. But it doesn’t stop there. Apache is more than a web server. Apache software also runs on mobile devices. Apache software is part of enterprise and banking software. Apache software is literally everywhere in today’s software world.

Apache has become a powerful brand and a philosophy of software development which remains unmatched in the world of open-source. Although the Apache© trademark is a known term even among the less tech-savvy people, many people struggle to define what Apache software really is about, and what role it plays for today’s software development and businesses.

In the last years I’ve learned a lot about Apache through my work on Apache Flink and Apache Beam with Data Artisans. In this post I present some of the things I learned by giving an overview of the Apache Software Foundation and its history. Moreover, I want to show how the “Apache way” of software development shaped the open-source software development as it is today.

The History of the Foundation

The Apache Software Foundation (ASF) was founded in 1999 by a group of open-source enthusiasts and some corporate entities which were eager to sponsor the foundation’s work. Among the first projects was the famous web server called Apache HTTP, which is also simply referred to as “Apache web server”. At that time, the Apache web server was already quite mature. In fact, not only did the Apache web server give the foundation its name but it became the role model for the “Apache way” of open and collaborative software development. To see how that took place, we have to go back a bit further in time.

A Web Server goes a long way

As early as 1994, Rob McCool at the National Center for Supercomputing Applications (NCSA) in Illinois created a simple web server which served pages using one of the early versions of today’s HTTP protocol. Web servers were not ubiquitous like they are today. In these days, the Web was still in its early days and there was only one web browser developed at CERN where the WWW was invented only shortly before. Rob’s web server was adopted quite fruitfully throughout the web due to its extensible nature. When its source code spread, web page administrators around the world developed extensions for the web server and helped to fix errors. When Rob left the NCSA in late 1994, he left a void because nobody maintained the web server along with its extensions. Quickly it became apparent that the group of existing users and developers needed to join forces to be able to maintain NCSA HTTP.

At the beginning of 1995, the Apache Group was formed to coordinate the development of the NCSA HTTP web server. This led to the first release of the Apache web server in April 1995. During the same time, development at NCSA started picking off again and the two teams were in vivid exchange about future ideas to improve the web server. However, the Apache Group was able to develop its version of the web server much faster because of their structure which encouraged worldwide collaboration. At the end of the year, the Apache server had its architecture redone to be modular and execute much faster.

One year later, at the beginning of 1996, the Apache web server already succeeded the popularity of the NCSA HTTP which had been the most popular web server on the Internet until then. Apache 1.0 finally was released on Dec 1, 1995. The web server continued to thrive and is still the most widely used web browser as of this writing.

The Rise of the Foundation

The team effort that led to the development and adoption of the Apache web server was a huge success. The Apache project kept receiving feedback and code changes (also called patches) from people all over the world. Could this be the development model for future software? It became apparent that more and more projects started to organize their groups similarly to the Apache group. Out of this need, the Apache Software Foundation (ASF) was formed as non-profit corporation in June 1999.

The ASF became a framework for open-source software development which, in its entirety, remains unmatched by other forms of open-source software development. The secret of its success is its unique approach to open-source software development where the foundation does not get in the way of the individual developers. Instead, it focuses on providing developers with the infrastructure and a minimal set of rules to manage their projects. The projects itself remain relatively autonomous.

Apache Governance - How does the foundation work?

There are about 200 independent projects running under the Apache umbrella. The question may arise, how does the foundation govern its project? First of all, the ASF is an organization that is run almost entirely by developers. Developers hate to spend too much time with administrative things (who doesn’t?), so the organization is structured in a way that requires little central control but favors autonomy of the projects which run under its umbrella.

Per-Project Entities

For every project (e.g. Apache HTTP, Apache Hadoop, Apache Commons, Apache Flink, Apache Beam, etc.), there is a Project Management Commitee (PMC), Committers, and Users.

Project Management Committee (PMC)

The Project Management Committee (PMC) manages a project and decides over its development direction. In that sense it has similar function as the original Apache Group which led the development of the Apache web server. When a new project is formed, the proposers constitute the initial PMC. Later on, new PMC members can be elected by the existing PMC. Note, that this goes without the permission of the central instances of the foundation. PMC members are also committers (see below).

Committers

Committers can modify the code base of the project but they can’t make major project changing decisions. They are trusted by the PMC to work in the interest of the project. When they contribute changes, they commit (thus, the name) these changes to the project. Committers don’t only change code but they can also update documentation or write blog posts on the project’s website. Committers are selected from the users of the project; more about this process in the Mediocrity section.

Users

Users are as important as the developers because they try out the project’s software, report bugs, and request new features. The term is a slightly confusing because, in the Apache world, most users are actually developers themselves. They are users in the sense that they are using an Apache project for their own work; they are not actively developing the Apache software they are using. However, they may also provide patches to the Committers. Users who contribute to a project are called Contributors. Contributors may eventually become committers.

In the following, the per-project entities are represented as circles. They exist for every project. The larger the circles, the more people. The redder the background color, the more decisional power the group has. Note that the user group circle is too large to fit in the image which is an accurate depiction of the user/developer ratio :)

Foundation-Wide Entities

The ASF does not work without some central instances. Here are the most important entities:

Apache Members

Apache members are the heart of the foundation. A prerequisite to becoming a members is to be active in at least one project. To become a member, you have to show a deep interest in the foundation and try to promote its values. Existing members can then invite you to become a member. Becoming a members does not only mean honor but it also provides the right to elect the Board.

The Board of Directors (Board)

The Board of Directors (Board) takes care of the overall government of the foundation. In particular, it is concerned with legal and financial matters like brand/patent issues, fundraising, and financial planning. The board is elected annually and is composed of Apache members. The current board can be viewed here.

Again, we use circles to set the per-project and foundation-wide entities into relation. Note that there is only one central Board for the entire foundation but Board members can be PMC members in different projects.

Officers of the corporation

Officers of the corporation are the executive part of the administration. They execute the decisions of the board and take care of everyday business.

Infrastructure (INFRA)

The support and administration team (INFRA) is the team that runs the Apache infrastructure and provides tools and support for developers. This includes running the apache.org web site and the mailing lists which are Apache’s main way of communication. Over time, the need for various tools to assist developers became apparent. The main tools available which are used by almost all projects are:

Mailing lists, for discussing the roadmap of the project, exchanging ideas, or reporting bugs (unwanted software behavior). Typically the mailing lists are divided into a developer and a user mailing list.
Bug trackers, which help developers to keep track of new features or bugs.
Version control, which helps developers to keep track of the code changes.
Build servers, which help to integrate/test new code or changes to existing code.

The Incubator

The Incubator is a division of the foundation dedicated to forming (bootstrapping) new Apache projects. The process is the following. People (volunteers, enthusiasts, or company employees) make a proposal to the Incubator. The proposal contains the name, the list of initial PMC members, and the motivation and goals for a new project. When the standards of the Apache Software Foundation are fulfilled by the proposal, the project enters the incubation phase. In the incubation phase, projects carry “incubating” with their names which is dropped once they graduate. To graduate, a project has to show that it adheres to the Apache standards and manages to develop a community. Formally, the project needs to prove that to the Incubator Project Management Committee (IPMC) which is comprised of Apache members. All existing work which is donated in the course of entering the incubator and, more importantly, all future work inside the project has to be licensed to the ASF under the Apache License. This ensures that development remains in the open-source according to the Apache philosophy. More about incubation on the official website

Meritocracy - How are decisions made?

The Apache Software Foundation uses the term “meritocracy” to describe how it governs itself. Going back to the ancient Greeks, meritocracy was a political system to put those into power which proofed that they had ability and talent within the field of power. The core of this philosophy can be found throughout history from ancient China to medieval Europe and is still present in many of today’s cultures in the sense that effort, increased responsibility, and service to a part of society ought to pay off in terms of power of decision, social status, or money.

Meritocracy in the Apache Software Foundation denotes that people who either work in the interest of the foundation or a project get promoted. Users who submit patches may be offered committer status. Comitters who are driving the project constructively, may gain PMC status. PMC members active across projects may earn the member status.

From there on, decision-making within the foundation and projects are typically performed using Lazy Consensus. Lazy consensus implies that even a few people can drive a discussion and make decisions for the entire community as long as nobody objects. The discussions have to be held in public on the mailing list. For instance, if a committer decides to introduce a new feature X, she may do so by proposing the feature on the mailing list. If nobody objects, she can go ahead and develop the feature. If lazy consensus does not work because an argument cannot be settled, a majority based vote can be started.

Meritocracy and Lazy Consensus are the core principles for governance within the Apache Software Foundation. On the one hand, Meritocracy ensures that new people can join those already in power. On the other hand, Lazy Consensus creates the opportunity to split up decision-making among the group such that it doesn’t always require the action of all members of the community.

The Apache License - A license for the world of open-source

With the incorporation of the foundation in 1999, a license had to be created to prevent conflicts with the intellectual property contributed by others to the ASF. Originally, the license was meant to be used exclusively by the ASF but it quickly became one of the most widely used software licenses for all kinds of open-source software development.

The Apache license is very liberal in the sense that source code modifications are not required to be open-sourced (= made publicly available) even when the source code is distributed or sold to other entities. This is in contrast to “Copyleft” licenses like the GNU Public License (GPL) which, upon redistribution, requires public attribution and publication of changes made to the source code.

The current version of the Apache License is 2.0, released in January 2004. The changes made since the initial release are only minor but they set the prerequisite for its prevalence. In the first place, the license was only available to Apache projects. Due to the success of the Apache model, people also wanted to use the license outside the foundation. This was made possible in version 2.0. Also, the new version made it possible to combine GPL code with Apache licensed code. In this case, the resulting product would have to be licensed under the GPL to be compatible with the GPL license. The last minor change for version 2.0 was to make inclusion of the license easier and require explicit patents for patent-relevant parts.

Apache Today

The ASF today is not the small circle as it used to be back in 1999. At the time of this writing, the Apache Software Foundation hosts 177 committees (same as PMCs) with close to 300 projects (latest statistics). Note that, a PMC may decide to host multiple projects if necessary. For instance, the Apache Commons PMC has broken up the different parts of the Apache Commons library, e.g. CLI, Email, Daemon, etc. Also, about 25 of the 300 projects have been retired and about 60 are currently in the incubation phase. So realistically, the number of projects is about 200.

The Apache Software Foundation regularly organizes conferences around the world called ApacheCons. These conferences are dedicated to the Apache community or certain topics like Big Data or IoT. It is a place to meet the developers and learn about the latest ideas and trends within the global Apache community. Apart from the official conferences, there are conferences on Apache software organized by companies or external organization, e.g. Strata, FlinkForward, Kafka Summit, Spark Summit, Elasticon.

Here’s a list of some projects that I have run across in the past. I grouped them into categories for a better overview. I realize you might not know a lot of the projects but maybe this list can be the starting point to discover more about these Apache projects :)

Big Data

Hadoop
Flink
Spark
Beam
Samza
Storm
NiFi
Kafka
Flume
Tez
Zeppelin

Database

CouchDB
HBase
Zookeeper
Derby
Cassandra

Query Tools / APIs

Hive
Pig
Drill
Crunch
Ignite
Solr
Lucene

Programming Languages

Groovy

Distributions

Bigtop
Ambari

Cloud

Mesos
CloudStack
Libcloud

Machine Learning

Mahout
SAMOA

Office

OpenOffice

Libraries

Commons
Avro
Thrift
ActiveMQ
Parquet

Developer Tools

Ant
Maven
Ivy
Subversion

Web Servers

Http (the one!)
Tomcat

Web Frameworks

Cocoon
Struts
Sling

Apache - A Successful Open-Source Development Model

My first attempt to learn more about Apache goes back several years. I was using the Apache License while working on Scalaris at Zuse Institute Berlin. I realized that the license was somehow connected to the Apache Software Foundation but I didn’t really understand the depth of this relationship until I started working on Apache Flink with Data Artisans. Besides the official homepage of the foundation, relatively little information was available on the Internet about the foundation and its projects. In hindsight, the best source of information would have been to read the email archives, ask the developers, or become a developer yourself :)

Still today, I couldn’t find an introductory guide to the ASF. So I wrote this blog post. I hope that I could provide an overview of the ASF and show you how significant the foundation has been for the open-source software development.

Thank you

Thank you for reading this article. Please drop me a message if I got something wrong or you would like to comment on anything.

Thank you to the Apache Flink project. Especially to Robert Metzger, Vasia Kalavri, Henry Saputra, Aljoscha Krettek, Matthias Sax, Ufuk Celebi, Till Rohrmann, Fabian Hueske, Stephan Ewen, and Kostas Tzoumas. You taught me a lot about the Apache way. Thanks also to the Apache Beam community which just graduated from the Incubator and has proven to be an excellent member of the Apache family.