Package and import transforms

Apache Beam YAML lets you package and reuse transforms through Beam YAML providers. Providers allow you to encapsulate transforms into a reusable unit that you can then import in your Beam YAML pipelines. YAML, Python, and Java Apache Beam transforms can all be packaged in this way.

With the job builder, you can load providers from Cloud Storage to use them in your job.

Writing providers

Beam YAML providers are defined in YAML files. These files specify the implementation and configuration of the provided transforms. Individual provider listings are expressed as YAML list items with type and config keys. Java and Python providers also have a config key that specifies the transform implementation. YAML-defined provider implementations are expressed inline.

YAML providers

YAML providers define new YAML transforms as a map of names to transform definitions. For example, this provider defines a transform that squares a field from its input:

- type: yaml  transforms:  SquareElement:  body:  type: chain  transforms:  - type: MapToFields  config:  language: python  append: true  fields:  power: "element ** 2" 

YAML providers can also specify transform parameters with a config_schema key in the transform definition and use these parameters using Jinja2 templatization:

- type: yaml  transforms:  RaiseElementToPower:  config_schema:  properties:  n: {type: integer}  body:  type: chain  transforms:  - type: MapToFields  config:  language: python  append: true  fields:  power: "element ** {{n}}" 

If a provided transform functions as a source, it must set requires_inputs: false:

- type: yaml  transforms:  CreateTestElements:  requires_inputs: false  body: |  type: Create  config:  elements: [1,2,3,4] 

It is also possible to define composite transforms:

- type: yaml  transforms:  ConsecutivePowers:  config_schema:  properties:  end: {type: integer}  n: {type: integer}  requires_inputs: false  body: |  type: chain  transforms:  - type: Range  config:  end: {{end}}  - type: RaiseElementToPower  config:  n: {{n}} 

Python providers

Python transforms can be provided using the following syntax:

- type: pythonPackage  config:  packages:  - pypi_package>=version  transforms:  MyCustomTransform: "pkg.module.PTransformClassOrCallable" 

For an in-depth example, see the Python provider starter project on GitHub.

Java providers

Java transforms can be provided using the following syntax:

- type: javaJar  config:  jar: gs://your-bucket/your-java-transform.jar  transforms:  MyCustomTransform: "urn:registered:in:transform" 

For an in-depth example, see the Java provider starter project on GitHub.

Using providers in the job builder

Transforms defined in providers can be imported from Cloud Storage and used in the job builder. To use a provider in the job builder:

  1. Save a provider as a YAML file in Cloud Storage.

    Go to Cloud Storage

  2. Go to the Jobs page in the Google Cloud console.

    Go to Jobs

  3. Click Create job from builder.

  4. Locate the YAML Providers section. You might need to scroll.

  5. In the YAML provider path box, enter the Cloud Storage location of the provider file.

  6. Wait for the provider to load. If the provider is valid, the transform(s) defined in the provider will appear in the Loaded transforms section.

  7. Locate your transform's name in the Loaded transforms section and click the button to insert the transform in your job.

  8. If your transform requires parameters, define them in the YAML transform configuration editor for your transform. Parameters should be defined as a YAML object mapping parameter names to parameter values.

What's next