Skip to content

Conversation

@mwojtyczka
Copy link
Contributor

@mwojtyczka mwojtyczka commented Oct 20, 2025

Changes

  • Added runtime validation of expressions in sql_expression check. Note that this covers a subset of error conditions. Some issues only appear during a Spark action.
  • Fixed tests:
    • Improved clean up of lakebase orphaned instances
    • Fixed tests for storing checks in volume, after sdk upgrade
    • Removed test for listing tables with pattern matching all catalogs and schemas as it causes failures when other tests are run in parallel

Tests

  • manually tested
  • added unit tests
  • added integration tests
  • added end-to-end tests
  • added performance tests
@github-actions
Copy link

github-actions bot commented Oct 20, 2025

✅ 406/406 passed, 12 flaky, 1 skipped, 4h45m0s total

Flaky tests:

  • 🤪 test_profile_tables_for_patterns_with_exclude_patterns (80ms)
  • 🤪 test_uninstallation (86ms)
  • 🤪 test_quality_checker_workflow_for_multiple_run_configs (83ms)
  • 🤪 test_uninstallation_with_custom_folder (82ms)
  • 🤪 test_profile_tables_for_patterns_with_common_opts (83ms)
  • 🤪 test_quality_checker_workflow_with_custom_install_folder (383ms)
  • 🤪 test_profile_tables_for_patterns_with_different_opts (83ms)
  • 🤪 test_quality_checker_workflow_with_custom_check_func (91ms)
  • 🤪 test_quality_checker_workflow_streaming (366ms)
  • 🤪 test_quality_checker_workflow_with_quarantine (79ms)
  • 🤪 test_col_sql_expression_long_name (78ms)
  • 🤪 test_e2e_workflow (3.632s)

Running from acceptance #2909

@mwojtyczka mwojtyczka changed the title Added validation of expression in sql expression check Added runtime validation of expression in sql expression check Oct 20, 2025
@mwojtyczka mwojtyczka requested review from alexott, Copilot and ghanse and removed request for grusin-db October 20, 2025 13:50
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds runtime validation for SQL expressions in the sql_expression check to improve error handling when expressions reference non-existent columns or contain invalid syntax. The enhancement ensures that invalid SQL expressions are detected early and checks are skipped gracefully with descriptive error messages.

Key changes:

  • Implemented validation logic to detect invalid SQL expressions before execution
  • Enhanced error messages to include SQL expression validation failures
  • Updated documentation to clarify validation behavior and limitations

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
src/databricks/labs/dqx/manager.py Added invalid_sql_expression property and validation logic for SQL expressions
tests/integration/test_apply_checks.py Added test cases for SQL expression validation with missing columns
docs/dqx/docs/reference/quality_checks.mdx Updated documentation for sql_expression check to explain validation behavior
docs/dqx/docs/guide/quality_checks_apply.mdx Updated guide to mention SQL expression validation alongside other field validations

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@mwojtyczka mwojtyczka changed the title Added runtime validation of expression in sql expression check Added runtime validation of sql expressions Oct 20, 2025
Copy link
Contributor

@ghanse ghanse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mwojtyczka mwojtyczka merged commit 118ae92 into main Oct 21, 2025
17 of 18 checks passed
@mwojtyczka mwojtyczka deleted the sql_expr_validate branch October 21, 2025 15:42
mwojtyczka added a commit that referenced this pull request Nov 6, 2025
* Added Data Quality Summary Metrics ([#553](#553)). The data quality engine has been enhanced with the ability to track and manage summary metrics for data quality validation, leveraging Spark's Observation feature. A new `DQMetricsObserver` class has been introduced to manage Spark observations and track summary metrics on datasets checked with the engine. The `DQEngine` class has been updated to optionally return the Spark observation associated with a given run, allowing users to access and save summary metrics. The engine now supports also writing summary metrics to a table using the `metrics_config` parameter, and a new `save_summary_metrics` method has been added to save data quality summary metrics to a table. Additionally, the engine has been updated to include a unique `run_id` field in the detailed per-row quality results, enabling cross-referencing with summary metrics. The changes also include updates to the configuration file to support the storage of summary metrics. Overall, these enhancements provide a more comprehensive and flexible data quality checking capability, allowing users to track and analyze data quality issues more effectively. * LLM assisted rules generation ([#577](#577)). This release introduces a significant enhancement to the data quality rules generation process with the integration of AI-assisted rules generation using large language models (LLMs). The `DQGenerator` class now includes a `generate_dq_rules_ai_assisted` method, which takes user input in natural language and optionally a schema from an input table to generate data quality rules. These rules are then validated for correctness. The AI-assisted rules generation feature supports both programmatic and no-code approaches. Additionally, the feature enables the use of different LLM models and gives the possibility to use custom check functions. The release also includes various updates to the documentation, configuration files, and testing framework to support the new AI-assisted rules generation feature, ensuring a more streamlined and efficient process for defining and applying data quality rules. * Added Lakebase checks storage backend ([#550](#550)). A Lakebase checks storage backend was added, allowing users to store and manage their data quality rules in a centralized lakabase table, in addition to the existing Delta table storage. The `checks_location` resolution has been updated to accommodate Lakebase, supporting both table and file storage, with flexible formatting options, including "catalog.schema.table" and "database.schema.table". The Lakebase checks storage backend is configurable through the `LakebaseChecksStorageConfig` class, which includes fields for instance name, user, location, port, run configuration name, and write mode. This update provides users with more flexibility in storing and loading quality checks, ensuring that checks are saved correctly regardless of the specified location format. * Added runtime validation of sql expressions ([#625](#625)). The data quality check functionality has been enhanced with runtime validation of SQL expressions, ensuring that specified fields can be resolved in the input DataFrame and that SQL expressions are valid before evaluation. If an SQL expression is invalid, the check evaluation is skipped and the results include a check failure with a descriptive message. Additionally, the configuration validation for Unity Catalog volume file paths has been improved to enforce a specific format, preventing invalid configurations and providing more informative error messages. * Fixed docs ([#598](#598)). The documentation build process has undergone significant improvements to enhance efficiency and maintainability. * Improved Config Serialization ([#676](#676)). Several updates have been made to improve the functionality, consistency, and maintainability of the codebase. The configuration loading functionality has been refactored to utilize the `ConfigSerializer` class, which handles the serialization and deserialization of workspace and run configurations. * Restore use of `hatch-fancy-pypi-readme` to fix images in PyPi ([#601](#601)). The image source path for the logo in the README has been modified to correctly display the logo image when rendered, particularly on PyPi. * Skip check evaluation if columns or filter cannot be resolved in the input DataFrame ([#609](#609)). DQX now skip check evaluation if columns or filters are incorrect allowing other checks to proceed even if one rule fails. The DQX engine validates specified column, columns and filter fields against the input DataFrame before applying checks, skipping evaluation and providing informative error messages if any fields are invalid. * Updated user guide docs ([#607](#607)). The documentation for quality checking and integration options has been updated to provide accurate and detailed information on supported types and approaches. Quality checking can be performed in-transit (pre-commit), validating data on the fly during processing, or at-rest, checking existing data stored in tables. * Improved build process ([#618](#618)). The hatch version has been updated to 1.15.0 to avoid compatibility issues with click version 8.3 and later, which introduced a bug affecting hatch. Additionally, the project's dependencies have been updated, including bumping the `databricks-labs-pytester` version from 0.7.2 to 0.7.4, and code refactoring has been done to use a single Lakebase instance for all integration tests, with retry logic added to handle cases where the workspace quota limit for the number of Lakebase instances is exceeded, enhancing the testing infrastructure and improving test reliability. Furthermore, documentation updates have been made to clarify the application of quality checks to data using DQX. These changes aim to improve the efficiency, reliability, and clarity of the project's testing and documentation infrastructure.
@mwojtyczka mwojtyczka mentioned this pull request Nov 6, 2025
mwojtyczka added a commit that referenced this pull request Nov 7, 2025
* Added Data Quality Summary Metrics ([#553](#553)). The data quality engine has been enhanced with the ability to track and manage summary metrics for data quality validation, leveraging Spark's Observation feature. A new `DQMetricsObserver` class has been introduced to manage Spark observations and track summary metrics on datasets checked with the engine. The `DQEngine` class has been updated to optionally return the Spark observation associated with a given run, allowing users to access and save summary metrics. The engine now supports also writing summary metrics to a table using the `metrics_config` parameter, and a new `save_summary_metrics` method has been added to save data quality summary metrics to a table. Additionally, the engine has been updated to include a unique `run_id` field in the detailed per-row quality results, enabling cross-referencing with summary metrics. The changes also include updates to the configuration file to support the storage of summary metrics. Overall, these enhancements provide a more comprehensive and flexible data quality checking capability, allowing users to track and analyze data quality issues more effectively. * LLM assisted rules generation ([#577](#577)). This release introduces a significant enhancement to the data quality rules generation process with the integration of AI-assisted rules generation using large language models (LLMs). The `DQGenerator` class now includes a `generate_dq_rules_ai_assisted` method, which takes user input in natural language and optionally a schema from an input table to generate data quality rules. These rules are then validated for correctness. The AI-assisted rules generation feature supports both programmatic and no-code approaches. Additionally, the feature enables the use of different LLM models and gives the possibility to use custom check functions. The release also includes various updates to the documentation, configuration files, and testing framework to support the new AI-assisted rules generation feature, ensuring a more streamlined and efficient process for defining and applying data quality rules. * Added Lakebase checks storage backend ([#550](#550)). A Lakebase checks storage backend was added, allowing users to store and manage their data quality rules in a centralized lakabase table, in addition to the existing Delta table storage. The `checks_location` resolution has been updated to accommodate Lakebase, supporting both table and file storage, with flexible formatting options, including "catalog.schema.table" and "database.schema.table". The Lakebase checks storage backend is configurable through the `LakebaseChecksStorageConfig` class, which includes fields for instance name, user, location, port, run configuration name, and write mode. This update provides users with more flexibility in storing and loading quality checks, ensuring that checks are saved correctly regardless of the specified location format. * Added runtime validation of sql expressions ([#625](#625)). The data quality check functionality has been enhanced with runtime validation of SQL expressions, ensuring that specified fields can be resolved in the input DataFrame and that SQL expressions are valid before evaluation. If an SQL expression is invalid, the check evaluation is skipped and the results include a check failure with a descriptive message. Additionally, the configuration validation for Unity Catalog volume file paths has been improved to enforce a specific format, preventing invalid configurations and providing more informative error messages. * Fixed docs ([#598](#598)). The documentation build process has undergone significant improvements to enhance efficiency and maintainability. * Improved Config Serialization ([#676](#676)). Several updates have been made to improve the functionality, consistency, and maintainability of the codebase. The configuration loading functionality has been refactored to utilize the `ConfigSerializer` class, which handles the serialization and deserialization of workspace and run configurations. * Restore use of `hatch-fancy-pypi-readme` to fix images in PyPi ([#601](#601)). The image source path for the logo in the README has been modified to correctly display the logo image when rendered, particularly on PyPi. * Skip check evaluation if columns or filter cannot be resolved in the input DataFrame ([#609](#609)). DQX now skip check evaluation if columns or filters are incorrect allowing other checks to proceed even if one rule fails. The DQX engine validates specified column, columns and filter fields against the input DataFrame before applying checks, skipping evaluation and providing informative error messages if any fields are invalid. * Updated user guide docs ([#607](#607)). The documentation for quality checking and integration options has been updated to provide accurate and detailed information on supported types and approaches. Quality checking can be performed in-transit (pre-commit), validating data on the fly during processing, or at-rest, checking existing data stored in tables. * Improved build process ([#618](#618)). The hatch version has been updated to 1.15.0 to avoid compatibility issues with click version 8.3 and later, which introduced a bug affecting hatch. Additionally, the project's dependencies have been updated, including bumping the `databricks-labs-pytester` version from 0.7.2 to 0.7.4, and code refactoring has been done to use a single Lakebase instance for all integration tests, with retry logic added to handle cases where the workspace quota limit for the number of Lakebase instances is exceeded, enhancing the testing infrastructure and improving test reliability. Furthermore, documentation updates have been made to clarify the application of quality checks to data using DQX. These changes aim to improve the efficiency, reliability, and clarity of the project's testing and documentation infrastructure.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants