Transform SQL translations using configuration YAML files
This document shows you how to use configuration YAML files to transform SQL code while migrating it to BigQuery. It provides guidelines to create your own configuration YAML files, and provides examples for various translation transformations that are supported by this feature.
When using the BigQuery interactive SQL translator, using the BigQuery Migration API, or performing a batch SQL translation, you can provide configuration YAML files to modify a SQL query translation. Using configuration YAML files allows for further customization when translating SQL queries from your source database.
You can specify a configuration YAML file to use in a SQL translation in the following ways:
- If you are using the interactive SQL translator, specify the file path to the configuration file or batch translation job ID in the translation settings.
- If you are using the BigQuery Migration API, place the configuration YAML in the same Cloud Storage bucket as the input SQL files.
- If you are performing a batch SQL translation, place the configuration YAML in the same Cloud Storage bucket as the input SQL files.
- If you are using the batch translation Python client, place the configuration YAML file in the local translation input folder.
The interactive SQL translator, BigQuery Migration API, the batch SQL translator, and the batch translation Python client supports the use of multiple configuration YAML files in a single translation job. See Applying multiple YAML configurations for more information.
Configuration YAML file requirements
Before creating a configuration YAML file, review the following information to ensure that your YAML file is compatible to use with the BigQuery Migration Service:
- You must upload the configuration YAML files in the directory of the Cloud Storage bucket that contains your SQL translation input files. For information on how to create buckets and upload files to Cloud Storage, see Create buckets and Upload objects from a filesystem.
- The file size for a single configuration YAML file must not exceed 1 MB.
- The total file size of all configuration YAML files used in a single SQL translation job must not exceed 4 MB.
- If you are using
regexsyntax for name matching, use RE2/J. - All configuration YAML file names must include a
.config.yamlextension—for example,change-case.config.yaml.config.yamlalone is not a valid name for the configuration file.
Guidelines to create a configuration YAML file
This section provides some general guidelines to create a configuration YAML file:
Header
Each configuration file must contain a header specifying the type of configuration. The object_rewriter type is used to specify SQL translations in a configuration YAML file. The following example uses the object_rewriter type to transform a name case:
type: object_rewriter global: case: all: UPPERCASE Entity selection
To perform entity-specific transformations, specify the entity in the configuration file. All match properties are optional; only use the match properties needed for a transformation. The following configuration YAML exposes properties to be matched in order to select specific entities:
match: database: <literal_name> schema: <literal_name> relation: <literal_name> attribute: <literal_name> databaseRegex: <regex> schemaRegex: <regex> relationRegex: <regex> attributeRegex: <regex> Description of each match property:
databaseordb: the project_id component.schema: the dataset component.relation: the table component.attribute: the column component. Only valid for attribute selectiondatabaseRegexordbRegex: matches adatabaseproperty with a regular expression (Preview).schemaRegex: matchesschemaproperties to regular expressions (Preview).relationRegex: matchesrelationproperties with regular expressions (Preview).attributeRegex: matchesattributeproperties with regular expressions. Only valid for attribute selection (Preview).
For example, the following configuration YAML specifies the match properties to select the testdb.acme.employee table for a temporary table transformation.
type: object_rewriter relation: - match: database: testdb schema: acme relation: employee temporary: true You can use the databaseRegex, schemaRegex, relationRegex, and attributeRegex properties to specify regular expressions in order to select a subset of entities. The following example changes all relations from tmp_schema schema in testdb to temporary, as long as their name starts with tmp_:
type: object_rewriter relation: - match: schema: tmp_schema relationRegex: "tmp_.*" temporary: true Both literal and regex properties are matched in a case-insensitive manner. You can enforce case-sensitive matching by using a regex with a disabled i flag, as seen in the following example:
match: relationRegex: "(?-i:<actual_regex>)" You can also specify fully-qualified entities using an equivalent short-string syntax. A short-string syntax expects exactly 3 (for relation selection) or 4 (for attribute selection) name segments delimited with dots, as the example testdb.acme.employee. The segments are then internally interpreted as if they were passed as database, schema, relation and attribute respectively. This means that names are matched literally, thus regular expressions are not allowed in short syntax. The following example shows the use of short-string syntax to specify a fully-qualified entity in a configuration YAML file:
type: object_rewriter relation: - match : "testdb.acme.employee" temporary: true If a table contains a dot in the name, you cannot specify the name using a short syntax. In this case, you must use an object match. The following example changes the testdb.acme.stg.employee table to temporary:
type: object_rewriter relation: - match: database: testdb schema: acme relation: stg.employee temporary: true The configuration YAML accepts key as an alias to match.
Default database
Some input SQL dialects, notably Teradata, do not support database-name in the qualified name. In this case, the easiest way to match entities is to omit database property in match.
However, you can set the default_database property of the BigQuery Migration Service and use that default database in the match.
Supported target attribute types
You can use the configuration YAML file to perform attribute type transformations, where you transform the data type of a column from the source type to a target type. The configuration YAML file supports the following target types:
BOOLEANTINYINTSMALLINTINTEGERBIGINTFLOATDOUBLENUMERIC(Supports optional precision and scale, such asNUMERIC(18, 2))TIMETIMETZDATEDATETIMETIMESTAMPTIMESTAMPTZCHAR(Supports optional precision, such asCHAR(42))VARCHAR(Supports optional precision, such asVARCHAR(42))
Configuration YAML examples
This section provides examples to create various configuration YAML files to use with your SQL translations. Each example outlines the YAML syntax to transform your SQL translation in specific ways, along with a brief description. Each example also provides the contents of a teradata-input.sql or hive-input.sql file and a bq-output.sql file so that you can compare the effects of a configuration YAML on a BigQuery SQL query translation.
The following examples use Teradata or Hive as the input SQL dialect and BigQuery SQL as the output dialect. The following examples also use testdb as the default database, and testschema as the schema search path.
Change object-name case
The following configuration YAML changes the upper or lower-casing of object names:
type: object_rewriter global: case: all: UPPERCASE database: LOWERCASE attribute: LOWERCASE A SQL translation with this configuration YAML file might look like the following:
teradata-input.sql | create table x(a int); select * from x; |
bq-output.sql | CREATE TABLE testdb.TESTSCHEMA.X ( a INT64 ) ; SELECT X.a FROM testdb.TESTSCHEMA.X ; |
Make table temporary
The following configuration YAML changes a regular table to a temporary table:
type: object_rewriter relation: - match: "testdb.testschema.x" temporary: true A SQL translation with this configuration YAML file might look like the following:
teradata-input.sql | create table x(a int); |
bq-output.sql | CREATE TEMPORARY TABLE x ( a INT64 ) ; |
Make table ephemeral
The following configuration YAML changes a regular table to an ephemeral table with a 60 second expiration.
type: object_rewriter relation: - match: "testdb.testschema.x" ephemeral: expireAfterSeconds: 60 A SQL translation with this configuration YAML file might look like the following:
teradata-input.sql | create table x(a int); |
bq-output.sql | CREATE TABLE testdb.testschema.x ( a INT64 ) OPTIONS( expiration_timestamp=timestamp_add(current_timestamp(), interval 60 SECOND) ); |
Set partition expiration
The following configuration YAML changes the expiration of a partitioned table to 1 day:
type: object_rewriter relation: - match: "testdb.testschema.x" partitionLifetime: expireAfterSeconds: 86400 A SQL translation with this configuration YAML file might look like the following:
teradata-input.sql | create table x(a int, b int) partition by (a); |
bq-output.sql | CREATE TABLE testdb.testschema.x ( a INT64, b INT64 ) CLUSTER BY a OPTIONS( partition_expiration_days=1 ); |
Change external location or format for a table
The following configuration YAML changes the external location and formation for a table:
type: object_rewriter relation: - match: "testdb.testschema.x" external: locations: "gs://path/to/department/files" format: ORC A SQL translation with this configuration YAML file might look like the following:
teradata-input.sql | create table x(a int); |
bq-output.sql | CREATE EXTERNAL TABLE testdb.testschema.x ( a INT64 ) OPTIONS( format='ORC', uris=[ 'gs://path/to/department/files' ] ); |
Set or change table description
The following configuration YAML sets the description of a table:
type: object_rewriter relation: - match: "testdb.testschema.x" description: text: "Example description." A SQL translation with this configuration YAML file might look like the following:
teradata-input.sql | create table x(a int); |
bq-output.sql | CREATE TABLE testdb.testschema.x ( a INT64 ) OPTIONS( description='Example description.' ); |
Set or change table partitioning
The following configuration YAML changes the partitioning scheme of a table:
type: object_rewriter relation: - match: "testdb.testschema.x" partition: simple: add: [a] - match: "testdb.testschema.y" partition: simple: remove: [a] A SQL translation with this configuration YAML file might look like the following:
teradata-input.sql | create table x(a date, b int); create table y(a date, b int) partition by (a); |
bq-output.sql | CREATE TABLE testdb.testschema.x ( a DATE, b INT64 ) PARTITION BY a; CREATE TABLE testdb.testschema.y ( a DATE, b INT64 ) ; |
Set or change table clustering
The following configuration YAML changes the clustering scheme of a table:
type: object_rewriter relation: - match: "testdb.testschema.x" clustering: add: [a] - match: "testdb.testschema.y" clustering: remove: [b] A SQL translation with this configuration YAML file might look like the following:
hive-input.sql | create table x(a int, b int); create table y(a int, b int) clustered by (b) into 16 buckets; |
bq-output.sql | CREATE TABLE testdb.testschema.x ( a INT64, b INT64 ) CLUSTER BY a; CREATE TABLE testdb.testschema.y ( a INT64, b INT64 ) ; |
Change type of a column attribute
The following configuration YAML changes the data type for an attribute of a column:
type: object_rewriter attribute: - match: database: testdb schema: testschema attributeRegex: "a+" type: target: NUMERIC(10,2) You can transform the source data type to any of the supported target attribute types.
A SQL translation with this configuration YAML file might look like the following:
teradata-input.sql | create table x(a int, b int, aa int); |
bq-output.sql | CREATE TABLE testdb.testschema.x ( a NUMERIC(31, 2), b INT64, aa NUMERIC(31, 2) ) ; |
Add connection to external data lake
The following configuration YAML marks the source table as being an external table that points to data stored in an external data lake, specified by a data lake connection.
type: object_rewriter relation: - key: "testdb.acme.employee" external: connection_id: "connection_test" A SQL translation with this configuration YAML file might look like the following:
hive-input.sql | CREATE TABLE x ( a VARCHAR(150), b INT ); |
bq-output.sql | CREATE EXTERNAL TABLE x ( a STRING, b INT64 ) WITH CONNECTION `connection_test` OPTIONS( ); |
Change the character encoding of an input file
By default, the BigQuery Migration Service attempts to automatically detect the character encoding of input files. In cases where BigQuery Migration Service might misidentify the encoding of a file, you can use a configuration YAML to specify the character encoding explicitly.
The following configuration YAML specifies the explicit character encoding of the input file as ISO-8859-1.
type: experimental_input_formats formats: - source: pathGlob: "*.sql" contents: raw: charset: iso-8859-1 Global type conversion
The following configuration YAML changes a data type to another across all scripts, and specifies a source data type to avoid in the transpiled script. This is different from the Change type of a column attribute configuration, where only the data type for a single attribute is changed.
BigQuery supports the following data type conversions:
DATETIMEtoTIMESTAMPTIMESTAMPtoDATETIME(accepts optional time zone)TIMESTAMP WITH TIME ZONEtoDATETIME(accepts optional time zone)CHARtoVARCHAR
In the following example, the configuration YAML converts a TIMESTAMP data type to DATETIME.
type: experimental_object_rewriter global: typeConvert: timestamp: DATETIME In dialects like Teradata, datetime-related functions such as current_date, current_time, or current_timestamp return timestamps based on the configured time zone, either local or session. BigQuery, on the other hand, always returns timestamps in UTC. To ensure consistent behavior between the two dialects, it is necessary to configure the time zone accordingly.
In the following example, the configuration YAML converts a TIMESTAMP and a TIMESTAMP WITH TIME ZONE data type to DATETIME, with the target time zone set to Europe/Paris.
type: experimental_object_rewriter global: typeConvert: timestamp: target: DATETIME timezone: Europe/Paris timestamptz: target: DATETIME timezone: Europe/Paris A SQL translation with this configuration YAML file might look like the following:
teradata-input.sql | create table x(a timestamp); select a from x where a > current_timestamp(0); |
bq-output.sql | CREATE TABLE x ( a TIMESTAMP ) ; SELECT x.a FROM test.x WHERE x.a > datetime_trunc(current_datetime('Europe/Paris'), SECOND) ; |
Select statement modification
The following configuration YAML changes the star projection, GROUP BY, and ORDER BY clauses in SELECT statements.
starProjection supports the following configurations:
ALLOWPRESERVE(default)EXPAND
groupBy and orderBy support the following configurations:
EXPRESSIONALIASINDEX
In the following example, the configuration YAML configures the star projection to EXPAND.
type: experimental_statement_rewriter select: starProjection: EXPAND A SQL translation with this configuration YAML file might look like the following:
teradata-input.sql | create table x(a int, b TIMESTAMP); select * from x; |
bq-output.sql | CREATE TABLE x ( a INT64, b DATETIME ) ; SELECT x.a x.b FROM x ; |
UDF specification
The following configuration YAML specifies the signature of user-defined functions (UDFs) that are used in the source scripts. Much like metadata zip files, UDF definitions can help to produce a more accurate translation of input scripts.
type: metadata udfs: - "date parse_short_date(dt int)" A SQL translation with this configuration YAML file might look like the following:
teradata-input.sql | create table x(dt int); select parse_short_date(dt) + 1 from x; |
bq-output.sql | CREATE TABLE x ( dt INT64 ) ; SELECT date_add(parse_short_date(x.dt), interval 1 DAY) FROM x ; |
Setting decimal precision strictness
By default, BigQuery Migration Service increases numeric precision to the highest precision available for a given scale. The following configuration YAML overrides this behavior by configuring the precision strictness to retain the decimal precision of the source statement.
type: experimental_statement_rewriter common: decimalPrecision: STRICT A SQL translation with this configuration YAML file might look like the following:
teradata-input.sql | create table x(a decimal(3,0)); |
bq-output.sql | CREATE TABLE x ( a NUMERIC(3) ) ; |
Setting string precision strictness
By default, BigQuery Migration Service omits string precision when translating CHAR and VARCHAR columns. This can help prevent truncation errors when values are written. Some SQL dialects, such as Teradata, truncate values that exceed the maximum precision on write, while BigQuery returns an error in this scenario.
If your application doesn't rely on the source dialect's truncation behavior, consider preserving the column's precision in the translated type definition.
The following configuration YAML overrides this behavior by configuring the precision strictness to retain the string precision of the source statement.
type: experimental_statement_rewriter common: stringPrecision: STRICT A SQL translation with this configuration YAML file might look like the following:
teradata-input.sql | create table x(a varchar(3)); |
bq-output.sql | CREATE TABLE x ( a STRING(3) ) ; |
Output name mapping
You can use configuration YAML to map SQL object names. You can change different parts of the name depending on the object being mapped.
Static name mapping
Use static name mapping to map the name of an entity. If you only want to change specific parts of the name while keeping other parts of the name the same, then only include the parts that need to change.
The following configuration YAML changes the name of the table from my_db.my_schema.my_table to my_new_db.my_schema.my_new_table.
type: experimental_object_rewriter relation: - match: "my_db.my_schema.my_table" outputName: database: "my_new_db" relation: "my_new_table" A SQL translation with this configuration YAML file might look like the following:
teradata-input.sql | create table my_db.my_schema.my_table(a int); |
bq-output.sql | CREATE TABLE my_new_db.my_schema.my_new_table ( a INT64 ) |
You can use static name mapping to update the region used by names in the public user-defined functions.
The following example changes the names in the bqutil.fn UDF from using the default us multi-region to using the europe_west2 region:
type: experimental_object_rewriter function: - match: database: bqutil schema: fn outputName: database: bqutil schema: fn_europe_west2 Dynamic name mapping
Use dynamic name mapping to change several objects at the same time, and create new names based on the mapped objects.
The following configuration YAML changes the name of all tables by adding the prefix stg_ to those that belong to the staging schema, and then moves those tables to the production schema.
type: experimental_object_rewriter relation: - match: schema: staging outputName: schema: production relation: "stg_${relation}" A SQL translation with this configuration YAML file might look like the following:
teradata-input.sql | create table staging.my_table(a int); |
bq-output.sql | CREATE TABLE production.stg_my_table ( a INT64 ) ; |
Specifying default database and schema search path
The following configuration YAML specifies a default database and schema search path.
type: environment session: defaultDatabase: myproject schemaSearchPath: [myschema1, myschema2] A SQL translation with this configuration YAML file might look like the following:
teradata-input.sql | SELECT * FROM database.table SELECT * FROM table1 |
bq-output.sql | SELECT * FROM myproject.database.table. SELECT * FROM myproject.myschema1.table1 |
Global output name rewrite
The following configuration YAML changes the output names of all objects (database, schema, relation, and attributes) in the script according to the configured rules.
type: experimental_object_rewriter global: outputName: regex: - match: '\s' replaceWith: '_' - match: '>=' replaceWith: 'gte' - match: '^[^a-zA-Z_].*' replaceWith: '_$0' A SQL translation with this configuration YAML file might look like the following:
teradata-input.sql | create table "test special chars >= 12"("42eid" int, "custom column" varchar(10)); |
bq-output.sql | CREATE TABLE test_special_chars_employees_gte_12 ( _42eid INT64, custom_column STRING ) ; |
Optimize and improve the performance of translated SQL
Optional transformations can be applied to translated SQL in order to introduce changes that can improve the query in terms of performance or cost. These optimizations are strictly case dependent and should be evaluated against unmodified SQL output to assess their actual effect on performance.
The following configuration YAML enables optional transformations. The configuration accepts a list of optimizations and, for optimizations which accept parameters, a section with optional parameter values.
type: optimizer transformations: - name: PRECOMPUTE_INDEPENDENT_SUBSELECTS - name: REWRITE_CTE_TO_TEMP_TABLE parameters: threshold: 1 | Optimization | Optional parameter | Description |
|---|---|---|
PRECOMPUTE_INDEPENDENT_SUBSELECTS | scope: [PREDICATE, PROJECTION] | Rewrites the query by adding a DECLARE statement to replace an expression in either PREDICATE clauses or PROJECTION with a precomputed variable. This will be identified as a static predicate allowing for a reduction of the amount of data read. If the scope is omitted, the default value is PREDICATE (i.e. WHERE and JOIN-ON clause). Extracting a scalar subquery to a DECLARE statement will make the original predicate static and therefore qualify for improved execution planning. This optimization will introduce new SQL statements. |
REWRITE_CTE_TO_TEMP_TABLE | threshold: N | Rewrites common table expressions (CTE) to temporary tables when there are more than N references to the same common table expression. This reduces query complexity and forces single execution of the common table expression. If N is omitted, the default value is 4. We recommend using this optimization when non-trivial CTEs are referenced multiple times. Introducing temporary tables has an overhead that might be larger than eventual multiple executions of a low complexity or low cardinality CTE. This optimization will introduce new SQL statements. |
REWRITE_ZERO_SCALE_NUMERIC_AS_INTEGER | bigint: N | Rewrites zero-scale NUMERIC/BIGNUMERIC attributes to INT64 type if the precision is within N. If N is omitted, the default value is 18. We recommend using this optimization when translating from source dialects that don't have integer types. Changing column types requires reviewing all downstream uses for type compatibility and semantic changes. For example, fractional divisions becoming integer divisions, code expecting numeric values |
DROP_TEMP_TABLE | Adds DROP TABLE statements for all temporary tables created in a script and not dropped by the end of it. This reduces the storage billing period for the temporary table from 24 hours to the script running time. This optimization will introduce new SQL statements. We recommend using this optimization when temporary tables are not accessed for any further processing after the end of script execution. This optimization will introduce new SQL statements. | |
REGEXP_CONTAINS_TO_LIKE | Rewrites some categories of REGEXP_CONTAINS matching patterns to LIKE expressions. We recommend using this optimization when no other process, such as macro replacement, relies on the regular expression pattern literals being preserved unchanged in output SQL. | |
ADD_DISTINCT_TO_SUBQUERY_IN_SET_COMPARISON | Adds DISTINCT clause to subqueries used as value set for [NOT] IN operator. We recommend using this optimization when the cardinality (distinct number of values) of the subquery result is significantly lower than the number of values. When this precondition is not met this transformation can have negative effects on performance. |
Create a Gemini-based configuration YAML file
To generate AI output, the source directory containing your SQL translation input must include a configuration YAML file.
Requirements
The configuration YAML file for AI outputs must have a suffix of .ai_config.yaml. For example, rules_1.ai_config.yaml.
Supported fields
You can use the following fields to configure your AI translation output:
suggestion_type(optional): Specify the type of AI suggestion to be generated. The following suggestion types are supported:QUERY_CUSTOMIZATION(default): Generates AI-suggestions for SQL code based on the translation rules specified in the configuration YAML file.TRANSLATION_EXPLANATION: Generates text that includes a summary of the translated GoogleSQL query and the differences and inconsistencies between the source SQL query and the translated GoogleSQL query.
rewrite_target(optional): SpecifySOURCE_SQLif you want to apply the translation rule to your input SQL, orTARGET_SQL(default) if you want to apply the translation rule to your output SQL.instruction(optional): In natural language, describe a change to the target SQL. The Gemini-enhanced SQL translation assesses the request and makes the specified change.examples(optional): Provide SQL examples of how you want the SQL pattern to be modified.
You can add additional translation_rules and additional examples as necessary.
Examples
The following examples create Gemini-based configuration YAML files which you can use with your SQL translations.
Remove the upper function in the default translation output query
translation_rules: - instruction: "Remove upper() function" examples: - input: "upper(X)" output: "X" Create multiple translation rules to customize the translation output
translation_rules: - instruction: "Remove upper() function" suggestion_type: QUERY_CUSTOMIZATION rewrite_target: TARGET_SQL examples: - input: "upper(X)" output: "X" - instruction: "Insert a comment at the head that explains each statement in detail. suggestion_type: QUERY_CUSTOMIZATION rewrite_target: TARGET_SQL Remove SQL comments from the translation input query
translation_rules: - instruction: "Remove all the sql comments in the input sql query." suggestion_type: QUERY_CUSTOMIZATION rewrite_target: SOURCE_SQL Generate translation explanations using default LLM prompt
This example uses the default LLM prompts provided by the translation service to generate text explanations:
translation_rules: - suggestion_type: "TRANSLATION_EXPLANATION" Generates translation explanations using your own natural language prompts
translation_rules: - suggestion_type: "TRANSLATION_EXPLANATION" instruction: "Explain the syntax differences between the source Teradata query and the translated GoogleSQL query." Multiple suggestion types in a single configuration YAML file
translation_rules: - suggestion_type: "TRANSLATION_EXPLANATION" instruction: "Explain the syntax differences between the source Teradata query and the translated GoogleSQL query." - instruction: "Remove upper() function" suggestion_type: QUERY_CUSTOMIZATION rewrite_target: TARGET_SQL examples: - input: "upper(X)" output: "X" - instruction: "Remove all the sql comments in the input sql query." suggestion_type: QUERY_CUSTOMIZATION rewrite_target: SOURCE_SQL Applying multiple YAML configurations
When specifying a configuration YAML file in a batch or interactive SQL translation, you can select multiple configuration YAML files in a single translation job to reflect multiple transformations. If multiple configurations conflict, one transformation might override another. We recommend using different types of configuration settings in each file to avoid conflicting transformations in the same translation job.
The following example lists two separate configuration YAML files that were provided for a single SQL translation job, one to change a column's attribute, and the other to set the table as temporary:
change-type-example.config.yaml:
type: object_rewriter attribute: - match: "testdb.testschema.x.a" type: target: NUMERIC(10,2) make-temp-example.config.yaml:
type: object_rewriter relation: - match: "testdb.testschema.x" temporary: true A SQL translation with these two configuration YAML files might look like the following:
teradata-input.sql | create table x(a int); |
bq-output.sql | CREATE TEMPORARY TABLE x ( a NUMERIC(31, 2) ) ; |