Optimize BigQuery DataFrames performance
BigQuery DataFrames helps you analyze and transform data in BigQuery using a pandas-compatible API. To make your data processing faster and more cost-effective, you can use several techniques to improve performance.
This document describes the following ways to optimize performance:
- Use partial ordering mode.
- Cache results after expensive operations.
- Preview data by using the
peek()
method. - Defer the
repr()
data retrieval.
Use partial ordering mode
BigQuery DataFrames has an ordering mode feature, which enforces a specific row order for operations like window functions and joins. You can specify the ordering mode by setting the ordering_mode
property to either strict
(known as strict ordering mode, which is the default) or partial
(known as partial ordering mode). Using the partial
setting can make your queries more efficient.
Partial ordering mode is different from strict ordering mode. Strict ordering mode arranges all rows in a specific order. This total ordering makes BigQuery DataFrames work better with pandas, letting you access rows by their order using the DataFrame.iloc
property. However, total ordering and its default sequential index prevent filters on columns or rows from reducing the amount of data scanned. This prevention occurs unless you apply those filters as parameters to the read_gbq
and read_gbq_table
functions. To order all the rows in the DataFrame, BigQuery DataFrames creates a hash of all the rows. This operation can cause a full data scan that ignores row and column filters.
Partial ordering mode stops BigQuery DataFrames from creating a total order for all rows and turns off features that need a total order, like the DataFrame.iloc
property. Partial ordering mode also sets the DefaultIndexKind
class to a null index, instead of to a sequential index.
When you filter a DataFrame
object using partial ordering mode, BigQuery DataFrames doesn't calculate which rows are missing in the sequential index. Partial ordering mode also doesn't automatically combine data based on index. These approaches can increase the efficiency of your queries. However, whether you use the default strict ordering mode or partial ordering mode, the BigQuery DataFrames API works like the familiar pandas API.
With both partial and strict ordering modes, you pay for the BigQuery resources you use. However, using partial ordering mode can lower costs when working with large clustered and partitioned tables. This cost reduction occurs because row filters on cluster and partition columns reduce the amount of data processed.
Enable partial ordering mode
To use partial ordering, set the ordering_mode
property to partial
before performing any other operation with BigQuery DataFrames, as shown in the following code sample:
Partial ordering mode prevents implicit joins of unrelated BigQuery DataFrames objects because it lacks a sequential index. Instead, you must explicitly call the DataFrame.merge
method to join two BigQuery DataFrames objects that derive from different table expressions.
The Series.unique()
and Series.drop_duplicates()
features don't work with partial ordering mode. Instead, use the groupby
method to find unique values, as shown in the following example:
With partial ordering mode, the output of the DataFrame.head(n)
and Series.head(n)
functions might not be the same every time you run them. To download a small, random sample of the data, use the DataFrame.peek()
or Series.peek()
methods.
For a detailed tutorial in which you use the ordering_mode = "partial"
property, see Analyzing package downloads from PyPI with BigQuery DataFrames.
Troubleshooting
Because BigQuery DataFrames in partial ordering mode sometimes lacks an ordering or index, you might encounter the following issues when using some pandas-compatible methods.
Order required error
Some features, like the DataFrame.head()
and DataFrame.iloc
functions, need an ordering. For a list of features that require ordering, see the Requires ordering column in Supported pandas APIs.
When an object has no ordering, the operation fails with an OrderRequiredError
message like the following: OrderRequiredError: Op iloc requires an ordering. Use .sort_values or .sort_index to provide an ordering.
As the error message states, you can provide an ordering using the DataFrame.sort_values()
method to sort by one or more columns. Other methods, such as DataFrame.groupby()
, implicitly provide a total ordering based on the group by keys.
If the ordering isn't a completely stable total ordering for all rows, later operations might show an AmbiguousWindowWarning
message like the following: AmbiguousWindowWarning: Window ordering may be ambiguous, this can cause unstable results.
If your work can handle results that aren't always the same, or if you can manually check that your ordering is a total ordering, you can filter out the AmbiguousWindowWarning
message in this way:
Null index error
Some features, like the DataFrame.unstack()
and Series.interpolate()
properties, need an index. For a list of features that require an index, see the Requires index column in Supported pandas APIs.
When you use an operation that requires an index with partial ordering mode, the operation raises a NullIndexError
message like the following: NullIndexError: DataFrame cannot perform interpolate as it has no index. Set an index using set_index.
As the error message states, you can provide an index using the DataFrame.set_index()
method to sort by one or more columns. Other methods, such as DataFrame.groupby()
, implicitly provide an index based on the group by keys, unless the as_index=False
parameter is set.
Cache results after expensive operations
BigQuery DataFrames stores operations locally and defers running queries until certain conditions are met. This can cause the same operations to run multiple times across different queries.
To avoid repeating costly operations, save intermediate results with the cache()
method, as shown in the following example:
This method creates a temporary BigQuery table to store your results. You are charged for the storage of this temporary table in BigQuery.
Preview your data with the peek()
method
BigQuery DataFrames offers two API methods to preview data:
peek(n)
returnsn
rows of data, wheren
is the number of rows.head(n)
returns the firstn
rows of data, depending on the context, wheren
is the number of rows.
Use the head()
method only when the order of data is important, for example, when you want the five largest values in a column. In other cases, use the peek()
method for more efficient data retrieval, as shown in the following code sample:
You can also use the peek()
method to download a small, random sample of data while using partial ordering mode.
Defer the repr()
data retrieval
You can call the repr()
method in BigQuery DataFrames with notebooks or your IDE debugger. This call triggers the head()
call that retrieves the actual data. This retrieval can slow down your iterative coding and debugging process and also incur costs.
To prevent the repr()
method from retrieving data, set the repr_mode
attribute to "deferred"
, as shown in the following example:
In the deferred mode, you can only preview your data with explicit peek()
and head()
calls.
What's next
- Learn how to use BigQuery DataFrames.
- Learn how to visualize BigQuery DataFrames.
- Explore the BigQuery DataFrames API reference.
- View BigQuery DataFrames source code, sample notebooks, and samples on GitHub.