Over two years ago, Apache Beam introduced the portability framework which allowed pipelines to be written in other languages than Java, e.g. Python and Go. Here’s how to get started writing Python pipelines in Beam.
1. Creating a virtual environment
Let’s first create a virtual environment for our pipelines. Note that we want to use Python 3 because Python 2 is now obsolete and won’t be supported in future Beam releases.
Now let’s install the latest version of Apache Beam:
2. Writing a Beam Python pipeline
Next, let’s create a file called
wordcount.py and write a simple Beam Python pipeline. I recommend using PyCharm or IntelliJ with the PyCharm plugin, but for now a simple text editor will also do the job:
Please see the inline comments for an explanation of what the code does.
We can now run the pipeline:
Arguably, that’s a very simple pipeline but you get the gist. Later on, we will change the data source to read from Kafka.
3. Choosing a Runner
By default, the so called DirectRunner runs your pipeline. The DirectRunner is only intended for local development purposes. It’s very slow and does not support distributed execution.
Let’s run the same pipeline with the Flink Runner which will runs the pipeline (you guessed it) on top of Apache Flink:
What happens when you run your script with the
--runner argument? Beam will look up the Runner (FlinkRunner) and attempt to run the pipeline. By default, this will download the Flink Runner JAR which contains the Beam JobService. The JobService will receive the pipeline and submit the pipeline to a Flink cluster. If you do not specify a cluster address via
--flink_master, a local Flink cluster will be started.
Fore more information visit the Flink Runner page. The page also contains information on other Runners, such as Google Cloud Dataflow or Apache Spark.
4. Configuring the environment
By default, the Python code will run in a so called
LOOPBACK environment. That’s an environment intended for development and testing purposes. It’s called
LOOPBACK because a local Python process is started which runs the Python code. However, if you submit to a cluster, the environment will default to
DOCKER which will bring up Docker containers on each of the hosts.
If you want to test the Docker-based execution locally, you can specify the following:
The Beam community publishes Docker images for all releases which are used by default. You can build / specify a custom image.
See the environment documentation page for more information on environment configuration.
5. Cross-language pipelines
As a next step, let’s read some data using Beam’s KafkaIO. Oh no! Turns out, there is no native Kafka connector in the Python API. No problem, we can use KafkaIO in from the Java SDK:
The essential pipeline logic hasn’t changed, but we have swapped out the simple
To understand what will happen when we run this pipeline, have a look at this image:
When we run the Python script, the pipeline is constructed. During this, a lookup to the
ExpansionService is performed to resolve
WriteToKafka. Once the pipeline has been assembled, it is submitted to the
JobService which also receives any required artifacts (e.g. Python libraries). The Runner then submits the pipeline against a Flink cluster.
Native transforms like GroupByKey can be processed directly by Flink. Any language-specific code runs in a separate environment for the language. The environment contains the SDK Harness which is responsible for running the language-specific code.
The good news is that you normally do not have to worry about this process.
FlinkRunner, as part of the Python SDK, abstracts away a lot of the complexity.