You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Added runtime validation of sql expressions (#625)
## Changes * Added runtime validation of expressions in `sql_expression` check. Note that this covers a subset of error conditions. Some issues only appear during a Spark action. * Fixed tests: * Improved clean up of lakebase orphaned instances * Fixed tests for storing checks in volume, after sdk upgrade * Removed test for listing tables with pattern matching all catalogs and schemas as it causes failures when other tests are run in parallel ### Tests <!-- How is this tested? Please see the checklist below and also describe any other relevant tests --> - [ ] manually tested - [ ] added unit tests - [x] added integration tests - [ ] added end-to-end tests - [ ] added performance tests
Copy file name to clipboardExpand all lines: docs/dqx/docs/guide/quality_checks_apply.mdx
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -32,7 +32,7 @@ Checks can be applied to the input data by one of the following methods of the `
32
32
33
33
You can see the full list of `DQEngine` methods [here](/docs/reference/engine/#dqx-engine-methods).
34
34
35
-
The engine ensures that the specified `column`, `columns`, or `filter` fields can be resolved in the input DataFrame. If any of these fields are invalid, the check evaluation is skipped, and the results include the check failure with a message identifying the invalid fields.
35
+
The engine ensures that the specified `column`, `columns`, `filter`, or sql 'expression' fields can be resolved in the input DataFrame. If any of these fields are invalid, the check evaluation is skipped, and the results include the check failure with a message identifying the invalid fields.
36
36
The engine will raise an error if you try to apply checks with invalid definition (e.g. wrong syntax).
37
37
In addition, you can also perform a standalone syntax validation of the checks as described [here](/docs/guide/quality_checks_definition#validating-syntax-of-quality-checks).
Copy file name to clipboardExpand all lines: docs/dqx/docs/reference/quality_checks.mdx
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -48,7 +48,7 @@ You can also define your own custom checks (see [Creating custom checks](#creati
48
48
|`is_ipv4_address_in_cidr`| Checks whether the values in the input column have valid IPv4 address format and fall within the given CIDR block. |`column`: column to check (can be a string column name or a column expression); `cidr_block`: CIDR block string |
49
49
|`is_valid_ipv6_address`| Checks whether the values in the input column have valid IPv6 address format. |`column` to check (can be a string column name or a column expression) |
50
50
|`is_ipv6_address_in_cidr`| Checks whether the values in the input column have valid IPv6 address format and fall within the given CIDR block. |`column`: column to check (can be a string column name or a column expression); `cidr_block`: CIDR block string |
51
-
|`sql_expression`| Checks whether the values meet the condition provided as an SQL expression, e.g. `a = 'str1' and a > b`. SQL expressions are evaluated at runtime, so ensure that the expression is safe and that functions used within it (e.g. h3_ischildof, division) do not throw exceptions. You can achieve this by validating input arguments or columns beforehand using guards such as CASE WHEN, IS NOT NULL, RLIKE, or type try casts. |`expression`: sql expression to check on a DataFrame (fail the check if expression evaluates to False, pass if it evaluates to True); `msg`: optional message to output; `name`: optional name of the resulting column (it can be overwritten by `name` specified at the check level); `negate`: if the condition should be negated; `columns`: optional list of columns used in the sql expression to validate they can be resolved in the input DataFrame, also used for reporting purposes and as a name prefix when a check name is not provided |
51
+
| `sql_expression` | Checks whether the values meet the condition provided as an SQL expression, e.g. `a = 'str1' and a > b`. If the SQL expression is invalid (for example, references non-existent columns), the check evaluation is skipped. The results will include a check failure with a message identifying the invalid columns. However, when using functions within the SQL expression which are generally opaque for Spark, you must ensure they do not raise exceptions (for instance, you must pass valid arguments to 'h3_ischildof'), as such errors are not automatically handled. Some errors will only appear during Spark action (e.g. when displaying or saving the results) and you must validate the input arguments using guards such as CASE WHEN, IS NOT NULL, RLIKE, or type try casts. | `expression`: sql expression to check on a DataFrame (fail the check if expression evaluates to False, pass if it evaluates to True); `msg`: optional message to output; `name`: optional name of the resulting column (it can be overwritten by `name` specified at the check level); `negate`: if the condition should be negated; `columns`: optional list of columns used in the sql expression to validate they can be resolved in the input DataFrame, also used for reporting purposes and as a name prefix when a check name is not provided |
52
52
|`is_data_fresh`| Checks whether the values in the input timestamp column are not older than the specified number of minutes from the base timestamp column. This is useful for identifying stale data due to delayed pipelines and helps catch upstream issues early. |`column`: column of type timestamp/date to check (can be a string column name or a column expression); `max_age_minutes`: maximum age in minutes before data is considered stale; `base_timestamp`: optional base timestamp column from which the stale check is calculated. This can be a string, column expression, datetime value or literal value ex:F.lit(datetime(2024,1,1)). If not provided current_timestamp() function is used |
53
53
|`does_not_contain_pii`| Checks whether the values in the input column contain Personally Identifiable Information (PII). Uses Microsoft Presidio to detect various named entities (e.g. PERSON, ADDRESS, EMAIL_ADDRESS). Requires installation of PII detection extras: `pip install 'databricks-labs-dqx[pii-detection]'`. See more details [here](#detecting-personally-identifiable-information-pii). |`column`: column to check (can be a string column name or a column expression); `threshold`: confidence threshold for PII detection (0.0 to 1.0, default: 0.7); `language`: optional language of the text (default: 'en'); `entities`: optional list of entities to detect; `nlp_engine_config`: optional dictionary configuring the NLP engine used for PII detection, see the [Presidio documentation](https://microsoft.github.io/presidio/analyzer/customizing_nlp_models/) for more information |
54
54
|`is_latitude`| Checks whether the values in the input column are valid latitude values (i.e. between -90 and 90 degrees). |`column`: column to check (can be a string column name or a column expression) |
0 commit comments