Apache Beam YAML lets you package and reuse transforms through Beam YAML providers. Providers allow you to encapsulate transforms into a reusable unit that you can then import in your Beam YAML pipelines. YAML, Python, and Java Apache Beam transforms can all be packaged in this way.
With the job builder, you can load providers from Cloud Storage to use them in your job.
Writing providers
Beam YAML providers are defined in YAML files. These files specify the implementation and configuration of the provided transforms. Individual provider listings are expressed as YAML list items with type and config keys. Java and Python providers also have a config key that specifies the transform implementation. YAML-defined provider implementations are expressed inline.
YAML providers
YAML providers define new YAML transforms as a map of names to transform definitions. For example, this provider defines a transform that squares a field from its input:
- type: yaml transforms: SquareElement: body: type: chain transforms: - type: MapToFields config: language: python append: true fields: power: "element ** 2" YAML providers can also specify transform parameters with a config_schema key in the transform definition and use these parameters using Jinja2 templatization:
- type: yaml transforms: RaiseElementToPower: config_schema: properties: n: {type: integer} body: type: chain transforms: - type: MapToFields config: language: python append: true fields: power: "element ** {{n}}" If a provided transform functions as a source, it must set requires_inputs: false:
- type: yaml transforms: CreateTestElements: requires_inputs: false body: | type: Create config: elements: [1,2,3,4] It is also possible to define composite transforms:
- type: yaml transforms: ConsecutivePowers: config_schema: properties: end: {type: integer} n: {type: integer} requires_inputs: false body: | type: chain transforms: - type: Range config: end: {{end}} - type: RaiseElementToPower config: n: {{n}} Python providers
Python transforms can be provided using the following syntax:
- type: pythonPackage config: packages: - pypi_package>=version transforms: MyCustomTransform: "pkg.module.PTransformClassOrCallable" For an in-depth example, see the Python provider starter project on GitHub.
Java providers
Java transforms can be provided using the following syntax:
- type: javaJar config: jar: gs://your-bucket/your-java-transform.jar transforms: MyCustomTransform: "urn:registered:in:transform" For an in-depth example, see the Java provider starter project on GitHub.
Using providers in the job builder
Transforms defined in providers can be imported from Cloud Storage and used in the job builder. To use a provider in the job builder:
Save a provider as a YAML file in Cloud Storage.
Go to the Jobs page in the Google Cloud console.
Click Create job from builder.
Locate the YAML Providers section. You might need to scroll.
In the YAML provider path box, enter the Cloud Storage location of the provider file.
Wait for the provider to load. If the provider is valid, the transform(s) defined in the provider will appear in the Loaded transforms section.
Locate your transform's name in the Loaded transforms section and click the button to insert the transform in your job.
If your transform requires parameters, define them in the YAML transform configuration editor for your transform. Parameters should be defined as a YAML object mapping parameter names to parameter values.
What's next
- Learn more about Beam YAML providers.