Bigtable HBase Beam connector
To help you use Bigtable in a Dataflow pipeline, two open source Bigtable Beam I/O connectors are available.
If you are migrating from HBase to Bigtable or your application calls the HBase API, use the Bigtable HBase Beam connector (CloudBigtableIO
) discussed on this page.
In all other cases, you should use the Bigtable Beam connector (BigtableIO
) in conjunction with the Cloud Bigtable client for Java, which works with the Cloud Bigtable APIs. To get started using that connector, see Bigtable Beam connector.
For more information on the Apache Beam programming model, see the Beam documentation.
Get started with HBase
The Bigtable HBase Beam connector is written in Java and is built on the Bigtable HBase client for Java. It's compatible with the Dataflow SDK 2.x for Java, which is based on Apache Beam. The connector's source code is on GitHub in the repository googleapis/java-bigtable-hbase.
This page provides an overview of how to use Read
and Write
transforms.
Set up authentication
To use the Java samples on this page in a local development environment, install and initialize the gcloud CLI, and then set up Application Default Credentials with your user credentials.
Install the Google Cloud CLI.
If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.
If you're using a local shell, then create local authentication credentials for your user account:
gcloud auth application-default login
You don't need to do this if you're using Cloud Shell.
If an authentication error is returned, and you are using an external identity provider (IdP), confirm that you have signed in to the gcloud CLI with your federated identity.
For more information, see Set up authentication for a local development environment.
For information about setting up authentication for a production environment, see Set up Application Default Credentials for code running on Google Cloud.
Add the connector to a Maven project
To add the Bigtable HBase Beam connector to a Maven project, add the Maven artifact to your pom.xml
file as a dependency:
Specify the Bigtable configuration
Create an options interface to allow inputs for running your pipeline:
When you read from or write to Bigtable, you must provide a CloudBigtableConfiguration
configuration object. This object specifies the project ID and instance ID for your table, as well as the name of the table itself:
For reading, provide a CloudBigtableScanConfiguration
configuration object, which lets you specify an Apache HBase Scan
object that limits and filters the results of a read. See Reading from Bigtable for details.
Read from Bigtable
To read from a Bigtable table, you apply a Read
transform to the result of a CloudBigtableIO.read
operation. The Read
transform returns a PCollection
of HBase Result
objects, where each element in the PCollection
represents a single row in the table.
By default, a CloudBigtableIO.read
operation returns all of the rows in your table. You can use an HBase Scan
object to limit the read to a range of row keys within your table, or to apply filters to the results of the read. To use a Scan
object, include it in your CloudBigtableScanConfiguration
.
For example, you can add a Scan
that returns only the first key-value pair from each row in your table, which is useful when counting the number of rows in the table:
Write to Bigtable
To write to a Bigtable table, you apply
a CloudBigtableIO.writeToTable
operation. You'll need to perform this operation on a PCollection
of HBase Mutation
objects, which can include Put
and Delete
objects.
The Bigtable table must already exist and must have the appropriate column families defined. The Dataflow connector does not create tables and column families on the fly. You can use the cbt
CLI to create a table and set up column families, or you can do this programmatically.
Before you write to Bigtable, you must create your Dataflow pipeline so that puts and deletes can be serialized over the network:
In general, you'll need to perform a transform, such as a ParDo
, to format your output data into a collection of HBase Put
or Delete
objects. The following example shows a DoFn
transform that takes the current value and uses it as the row key for a Put
. You can then write the Put
objects to Bigtable.
To enable batch write flow control, set BIGTABLE_ENABLE_BULK_MUTATION_FLOW_CONTROL
to true
. This feature automatically rate-limits traffic for batch write requests and lets Bigtable autoscaling add or remove nodes automatically to handle your Dataflow job.
Here is the full writing example, including the variation that enables batch write flow control.