Starter kit for apache beam

Apache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines.

Beam’s supported distributed processing back-ends, which include Apache Flink, Apache Spark, and Google Cloud Dataflow.

It is Data Processing platform mainly used fro analytics or for ETL which is execution platform agnostics , data agnostics , and also support programming languages like python , java , go , scala also apache beam support cross language pipelines

Note version compatibility java is jdk 8 for python is 3.6, 3.7, and 3.8, 2.7 for Go version 1.10

Below is basic data pipeline for apache beam

Now lets understand some terminology for the apache beam

1) Pipeline

  • A pipeline is a graph of transformations that a user constructs that defines the data processing they want to do.
  • A Pipeline encapsulates your entire data processing task, from start to finish. This includes reading input data, transforming that data, and writing output data. All Beam driver programs must create a Pipeline

inputs = [0, 1, 2, 3]

# Create a pipeline.

with beam.Pipeline() as pipeline:

# Feed it some input elements with `Create`.

outputs = pipeline| ‘Create initial values’ >> beam.Create(inputs)

2) PCollection -

  • A PCollection is immutable
  • Data being processed in a pipeline is part of a PCollection.
  • A PCollection can be either bounded or unbounded in size

3) PTransforms -

  • Transforms are the operations in your pipeline, and provide a generic processing framework
  • The operations executed within a pipeline. These are best thought of as operations on PCollections.
  • The Beam SDKs contain a number of different transforms that you can apply to your pipeline’s PCollections.
  • Map: one-to-one
  • Let’s say we have some elements and we want to do something with each element.
  • We want to map a function to each element of the collection.
  • map takes a function that transforms a single input a into a single output b.

inputs = [0, 1, 2, 3]

with beam.Pipeline() as pipeline:

outputs = (pipeline| ‘Create values’ >> beam.Create(inputs)| ‘Multiply by 2’ >> beam.Map(lambda x: x * 2))

outputs | beam.Map(print)

  • FlatMap
  • Applies a simple 1-to-many mapping function over each element in the collection. The many elements are flattened into the resulting collection

import apache_beam as beam

with beam.Pipeline() as pipeline:

plants = (pipeline | ‘Gardening plants’ >> beam.Create([‘🍓Strawberry 🥕Carrot 🍆Eggplant’,‘🍅Tomato 🥔Potato’])| ‘Split words’ >> beam.FlatMap(str.split)| beam.Map(print))

  • ParDo
  • ParDo is a Beam transform for generic parallel processing. The ParDo processing paradigm is similar to the “Map” phase of map/reduce/shuffle
  • ParDo is useful for a variety of common data processing operations, including:
  • Filtering a data set.
    Formatting or type-converting each element in a data set.
    Extracting parts of each element in a data set.
  • Performing computations on each element in a data set.
  • Words = [‘apple’,’mango’,’orange’]
    class ComputeWordLengthFn(beam.DoFn):

def process(self, element):

return [len(element)]

# Apply a ParDo to the PCollection “words” to compute lengths for each word.

word_lengths = words | beam.ParDo(ComputeWordLengthFn())