databrickslabs
diff --git a/‎docs/dqx/docs/guide/quality_checks_apply.mdx‎
Lines changed: 1 addition & 1 deletion b/‎docs/dqx/docs/guide/quality_checks_apply.mdx‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/dqx/docs/reference/quality_checks.mdx‎
Lines changed: 1 addition & 1 deletion b/‎docs/dqx/docs/reference/quality_checks.mdx‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎src/databricks/labs/dqx/config.py‎
Lines changed: 20 additions & 1 deletion b/‎src/databricks/labs/dqx/config.py‎
Lines changed: 20 additions & 1 deletion
diff --git a/‎src/databricks/labs/dqx/manager.py‎
Lines changed: 28 additions & 1 deletion b/‎src/databricks/labs/dqx/manager.py‎
Lines changed: 28 additions & 1 deletion
diff --git a/‎tests/conftest.py‎
Lines changed: 6 additions & 5 deletions b/‎tests/conftest.py‎
Lines changed: 6 additions & 5 deletions
diff --git a/‎tests/integration/test_apply_checks.py‎
Lines changed: 43 additions & 2 deletions b/‎tests/integration/test_apply_checks.py‎
Lines changed: 43 additions & 2 deletions
@@ -32,7 +32,7 @@ Checks can be applied to the input data by one of the following methods of the `
 
 You can see the full list of `DQEngine` methods [here](/docs/reference/engine/#dqx-engine-methods).
 
-The engine ensures that the specified `column`, `columns`, or `filter` fields can be resolved in the input DataFrame. If any of these fields are invalid, the check evaluation is skipped, and the results include the check failure with a message identifying the invalid fields.
+The engine ensures that the specified `column`, `columns`, `filter`, or sql 'expression' fields can be resolved in the input DataFrame. If any of these fields are invalid, the check evaluation is skipped, and the results include the check failure with a message identifying the invalid fields.
 The engine will raise an error if you try to apply checks with invalid definition (e.g. wrong syntax).
 In addition, you can also perform a standalone syntax validation of the checks as described [here](/docs/guide/quality_checks_definition#validating-syntax-of-quality-checks).
 
 
@@ -48,7 +48,7 @@ You can also define your own custom checks (see [Creating custom checks](#creati
 | `is_ipv4_address_in_cidr` | Checks whether the values in the input column have valid IPv4 address format and fall within the given CIDR block. | `column`: column to check (can be a string column name or a column expression); `cidr_block`: CIDR block string |
 | `is_valid_ipv6_address` | Checks whether the values in the input column have valid IPv6 address format. | `column` to check (can be a string column name or a column expression) |
 | `is_ipv6_address_in_cidr` | Checks whether the values in the input column have valid IPv6 address format and fall within the given CIDR block. | `column`: column to check (can be a string column name or a column expression); `cidr_block`: CIDR block string |
-| `sql_expression` | Checks whether the values meet the condition provided as an SQL expression, e.g. `a = 'str1' and a > b`. SQL expressions are evaluated at runtime, so ensure that the expression is safe and that functions used within it (e.g. h3_ischildof, division) do not throw exceptions. You can achieve this by validating input arguments or columns beforehand using guards such as CASE WHEN, IS NOT NULL, RLIKE, or type try casts. | `expression`: sql expression to check on a DataFrame (fail the check if expression evaluates to False, pass if it evaluates to True); `msg`: optional message to output; `name`: optional name of the resulting column (it can be overwritten by `name` specified at the check level); `negate`: if the condition should be negated; `columns`: optional list of columns used in the sql expression to validate they can be resolved in the input DataFrame, also used for reporting purposes and as a name prefix when a check name is not provided |
+| `sql_expression` | Checks whether the values meet the condition provided as an SQL expression, e.g. `a = 'str1' and a > b`. If the SQL expression is invalid (for example, references non-existent columns), the check evaluation is skipped. The results will include a check failure with a message identifying the invalid columns. However, when using functions within the SQL expression which are generally opaque for Spark, you must ensure they do not raise exceptions (for instance, you must pass valid arguments to 'h3_ischildof'), as such errors are not automatically handled. Some errors will only appear during Spark action (e.g. when displaying or saving the results) and you must validate the input arguments using guards such as CASE WHEN, IS NOT NULL, RLIKE, or type try casts. | `expression`: sql expression to check on a DataFrame (fail the check if expression evaluates to False, pass if it evaluates to True); `msg`: optional message to output; `name`: optional name of the resulting column (it can be overwritten by `name` specified at the check level); `negate`: if the condition should be negated; `columns`: optional list of columns used in the sql expression to validate they can be resolved in the input DataFrame, also used for reporting purposes and as a name prefix when a check name is not provided |
 | `is_data_fresh` | Checks whether the values in the input timestamp column are not older than the specified number of minutes from the base timestamp column. This is useful for identifying stale data due to delayed pipelines and helps catch upstream issues early. | `column`: column of type timestamp/date to check (can be a string column name or a column expression); `max_age_minutes`: maximum age in minutes before data is considered stale; `base_timestamp`: optional base timestamp column from which the stale check is calculated. This can be a string, column expression, datetime value or literal value ex:F.lit(datetime(2024,1,1)). If not provided current_timestamp() function is used |
 | `does_not_contain_pii` | Checks whether the values in the input column contain Personally Identifiable Information (PII). Uses Microsoft Presidio to detect various named entities (e.g. PERSON, ADDRESS, EMAIL_ADDRESS). Requires installation of PII detection extras: `pip install 'databricks-labs-dqx[pii-detection]'`. See more details [here](#detecting-personally-identifiable-information-pii). | `column`: column to check (can be a string column name or a column expression); `threshold`: confidence threshold for PII detection (0.0 to 1.0, default: 0.7); `language`: optional language of the text (default: 'en'); `entities`: optional list of entities to detect; `nlp_engine_config`: optional dictionary configuring the NLP engine used for PII detection, see the [Presidio documentation](https://microsoft.github.io/presidio/analyzer/customizing_nlp_models/) for more information |
 | `is_latitude` | Checks whether the values in the input column are valid latitude values (i.e. between -90 and 90 degrees). | `column`: column to check (can be a string column name or a column expression) |
 
@@ -2,6 +2,8 @@
 from functools import cached_property
 from datetime import datetime, timezone
 from dataclasses import dataclass, field
+
+from databricks.labs.dqx.checks_serializer import FILE_SERIALIZERS
 from databricks.labs.dqx.errors import InvalidConfigError, InvalidParameterError
 
 __all__ = [
@@ -277,7 +279,24 @@ class VolumeFileChecksStorageConfig(BaseChecksStorageConfig):
 
  def __post_init__(self):
  if not self.location:
- raise InvalidConfigError("The Unity Catalog volume file path ('location' field) must not be empty or None.")
+ raise InvalidParameterError(
+ "The Unity Catalog volume file path ('location' field) must not be empty or None."
+ )
+
+ # Expected format: /Volumes/{catalog}/{schema}/{volume}/{path/to/file}
+ if not self.location.startswith("/Volumes/"):
+ raise InvalidParameterError("The volume path must start with '/Volumes/'.")
+
+ parts = self.location.split("/")
+ # After split need at least: ['', 'Volumes', 'catalog', 'schema', 'volume', optional 'dir', 'file']
+ if len(parts) < 3 or not parts[2]:
+ raise InvalidParameterError("Invalid path: Path is missing a catalog name")
+ if len(parts) < 4 or not parts[3]:
+ raise InvalidParameterError("Invalid path: Path is missing a schema name")
+ if len(parts) < 5 or not parts[4]:
+ raise InvalidParameterError("Invalid path: Path is missing a volume name")
+ if len(parts) < 6 or not parts[-1].lower().endswith(tuple(FILE_SERIALIZERS.keys())):
+ raise InvalidParameterError("Invalid path: Path must include a file name after the volume")
 
 
 @dataclass
 
@@ -7,6 +7,7 @@
 from pyspark.errors import AnalysisException
 from pyspark.sql import DataFrame, Column, SparkSession
 
+from databricks.labs.dqx import check_funcs
 from databricks.labs.dqx.executor import DQCheckResult, DQRuleExecutorFactory
 from databricks.labs.dqx.rule import (
  DQRule,
@@ -94,6 +95,23 @@ def has_invalid_filter(self) -> bool:
  """
  return self._is_invalid_column(self.filter_condition)
 
+ @cached_property
+ def invalid_sql_expression(self) -> str | None:
+ """
+ Returns an invalid expression for sql expression check.
+ """
+ if self.check.check_func is check_funcs.sql_expression:
+ if "expression" in self.check.check_func_kwargs:
+ field_value = self.check.check_func_kwargs["expression"]
+ elif self.check.check_func_args:
+ field_value = self.check.check_func_args[0]
+ else:
+ return None # should never happen, as it is validated for correct args when building rules
+
+ if self._is_invalid_column(field_value):
+ return field_value
+ return None
+
  def process(self) -> DQCheckResult:
  """
  Process the data quality rule (check) and return results as DQCheckResult containing:
@@ -151,6 +169,14 @@ def _get_invalid_cols_message(self) -> str:
  f"Check evaluation skipped due to invalid check filter: '{self.check.filter}'"
  )
 
+ if self.invalid_sql_expression:
+ logger.warning(
+ f"Skipping check '{self.check.name}' due to invalid sql expression: '{self.invalid_sql_expression}'"
+ )
+ invalid_cols_message_parts.append(
+ f"Check evaluation skipped due to invalid sql expression: '{self.invalid_sql_expression}'"
+ )
+
  invalid_cols_message = "; ".join(invalid_cols_message_parts)
 
  return invalid_cols_message
@@ -164,7 +190,8 @@ def _is_invalid_column(self, column: str | Column) -> bool:
  col_expr = F.expr(column) if isinstance(column, str) else column
  _ = self.df.select(col_expr).schema # perform logical plan validation without triggering computation
  except AnalysisException as e:
- # if column is not accessible or column expression cannot be evaluated, an AnalysisException is thrown
+ # If column is not accessible or column expression cannot be evaluated, an AnalysisException is thrown.
+ # Note: This does not cover all error conditions. Some issues only appear during a Spark action.
  logger.debug(
  f"Invalid column '{column}' provided in the check '{self.check.name}'",
  exc_info=e,
 
@@ -1,5 +1,6 @@
 import os
 from datetime import timedelta
+from io import BytesIO
 from typing import Any
 import re
 import logging
@@ -32,7 +33,7 @@
 
 @pytest.fixture(scope="session")
 def debug_env_name():
- return "ws2" # Specify the name of the debug environment from ~/.databricks/debug-env.json
+ return "ws" # Specify the name of the debug environment from ~/.databricks/debug-env.json
 
 
 @pytest.fixture
@@ -613,7 +614,7 @@ def create(**kwargs):
  folder = make_directory()
  volume_file_path = str(folder.absolute()) + "/checks.yaml"
 
- ws.files.upload(volume_file_path, checks_yaml_content.encode(), overwrite=True)
+ ws.files.upload(volume_file_path, BytesIO(checks_yaml_content.encode()), overwrite=True)
 
  return volume_file_path
 
@@ -632,7 +633,7 @@ def create(**kwargs):
  folder = make_directory()
  volume_file_path = str(folder.absolute()) + "/checks.json"
 
- ws.files.upload(volume_file_path, checks_json_content.encode(), overwrite=True)
+ ws.files.upload(volume_file_path, BytesIO(checks_json_content.encode()), overwrite=True)
 
  return volume_file_path
 
@@ -651,7 +652,7 @@ def create(**kwargs):
  folder = make_directory()
  volume_file_path = str(folder.absolute()) + "/checks.yaml"
 
- ws.files.upload(volume_file_path, checks_yaml_invalid_content.encode(), overwrite=True)
+ ws.files.upload(volume_file_path, BytesIO(checks_yaml_invalid_content.encode()), overwrite=True)
 
  return volume_file_path
 
@@ -670,7 +671,7 @@ def create(**kwargs):
  folder = make_directory()
  volume_file_path = str(folder.absolute()) + "/checks.json"
 
- ws.files.upload(volume_file_path, checks_json_invalid_content.encode(), overwrite=True)
+ ws.files.upload(volume_file_path, BytesIO(checks_json_invalid_content.encode()), overwrite=True)
 
  return volume_file_path
 
 
@@ -7899,6 +7899,16 @@ def test_apply_checks_skip_checks_with_missing_columns(ws, spark):
  columns=["missing_col"],
  filter="missing_col > 0",
  ),
+ # invalid sql expression column
+ DQRowRule(
+ name="invalid_col_sql_expression",
+ criticality="error",
+ check_func=check_funcs.sql_expression,
+ check_func_args=["missing_col > 0"], # verify validation works when using args
+ check_func_kwargs={
+ "msg": "missing_col is less than 0",
+ },
+ ),
  ]
 
  checked = dq_engine.apply_checks(test_df, checks)
@@ -7933,7 +7943,8 @@ def test_apply_checks_skip_checks_with_missing_columns(ws, spark):
  },
  {
  "name": "missing_col_sql_expression",
- "message": "Check evaluation skipped due to invalid check columns: ['missing_col']",
+ "message": "Check evaluation skipped due to invalid check columns: ['missing_col']; "
+ "Check evaluation skipped due to invalid sql expression: 'missing_col > 0'",
  "columns": ["missing_col"],
  "filter": None,
  "function": "sql_expression",
@@ -7950,6 +7961,15 @@ def test_apply_checks_skip_checks_with_missing_columns(ws, spark):
  "run_time": RUN_TIME,
  "user_metadata": {},
  },
+ {
+ "name": "invalid_col_sql_expression",
+ "message": "Check evaluation skipped due to invalid sql expression: 'missing_col > 0'",
+ "columns": None,
+ "filter": None,
+ "function": "sql_expression",
+ "run_time": RUN_TIME,
+ "user_metadata": {},
+ },
  ],
  [
  {
@@ -8050,6 +8070,17 @@ def test_apply_checks_by_metadata_skip_checks_with_missing_columns(ws, spark):
  "arguments": {"columns": ["missing_col"]},
  },
  },
+ {
+ "name": "invalid_col_sql_expression",
+ "criticality": "error",
+ "check": {
+ "function": "sql_expression",
+ "arguments": {
+ "expression": "missing_col > 0",
+ "msg": "missing_col is less than 0",
+ },
+ },
+ },
  ]
 
  checked = dq_engine.apply_checks_by_metadata(test_df, checks)
@@ -8084,7 +8115,8 @@ def test_apply_checks_by_metadata_skip_checks_with_missing_columns(ws, spark):
  },
  {
  "name": "missing_col_sql_expression",
- "message": "Check evaluation skipped due to invalid check columns: ['missing_col']",
+ "message": "Check evaluation skipped due to invalid check columns: ['missing_col']; "
+ "Check evaluation skipped due to invalid sql expression: 'missing_col > 0'",
  "columns": ["missing_col"],
  "filter": None,
  "function": "sql_expression",
@@ -8101,6 +8133,15 @@ def test_apply_checks_by_metadata_skip_checks_with_missing_columns(ws, spark):
  "run_time": RUN_TIME,
  "user_metadata": {},
  },
+ {
+ "name": "invalid_col_sql_expression",
+ "message": "Check evaluation skipped due to invalid sql expression: 'missing_col > 0'",
+ "columns": None,
+ "filter": None,
+ "function": "sql_expression",
+ "run_time": RUN_TIME,
+ "user_metadata": {},
+ },
  ],
  [
  {