Skip to content

sodadata/soda-core

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Soda Core

Data quality testing for SQL-, Spark-, and Pandas-accessible data.

License: Apache 2.0 Slack


Be the first to try Soda’s new AI-powered metrics observability and collaborative data contracts — all in Soda Cloud. Request a demo!


✔ An open-source, CLI tool and Python library for data quality testing
✔ Compatible with the Soda Checks Language (SodaCL)
✔ Enables data quality testing both in and out of your data pipelines and development workflows
✔ Integrated to allow a Soda scan in a data pipeline, or programmatic scans on a time-based schedule

Soda Core is a free, open-source, command-line tool and Python library that enables you to use the Soda Checks Language to turn user-defined input into aggregated SQL queries.

When it runs a scan on a dataset, Soda Core executes the checks to find invalid, missing, or unexpected data. When your Soda Checks fail, they surface the data that you defined as bad-quality.

Soda Library

Consider migrating to Soda Library, an extension of Soda Core that offers more features and functionality, and enables you to connect to a Soda Cloud account to collaborate with your team on data quality.

Install Soda Library and get started with a 45-day free trial.


Get started

Soda Core currently supports connections to several data sources. See Compatibility for a complete list.

Requirements

  • Python 3.8 or greater
  • Pip 21.0 or greater

Install and run

  1. To get started, use the install command, replacing soda-core-postgres with the package that matches your data source. See Install Soda Core for a complete list.

    pip install soda-core-postgres
  2. Prepare a configuration.yml file to connect to your data source. Then, write data quality checks in a checks.yml file. See Configure Soda Core.

  3. Run a scan to review checks that passed, failed, or warned during a scan. See Run a Soda Core scan.

    soda scan -d your_datasource -c configuration.yml checks.yml

Example checks

# Checks for basic validations checks for dim_customer: - row_count between 10 and 1000 - missing_count(birth_date) = 0 - invalid_percent(phone) < 1 %: valid format: phone number - invalid_count(number_cars_owned) = 0: valid min: 1 valid max: 6 - duplicate_count(phone) = 0 # Checks for schema changes checks for dim_product: - schema: name: Find forbidden, missing, or wrong type warn: when required column missing: [dealer_price, list_price] when forbidden column present: [credit_card] when wrong column type: standard_cost: money fail: when forbidden column present: [pii*] when wrong column index: model_name: 22 # Check for freshness  - freshness(start_date) < 1d # Check for referential integrity checks for dim_department_group: - values in (department_group_name) must exist in dim_employee (department_name)

Documentation