Skip to content
This repository was archived by the owner on May 17, 2024. It is now read-only.

datafold/data-diff

Repository files navigation

Datafold

data-diff: Compare datasets fast, within or across SQL databases

data-diff-logo


Make sure to join us at our virtual hands-on lab series where our team walks through live how to get set-up with it!

Use Cases

Data Migration & Replication Testing

Compare source to target and check for discrepancies when moving data between systems:

  • Migrating to a new data warehouse (e.g., Oracle > Snowflake)
  • Converting SQL to a new transformation framework (e.g., stored procedures > dbt)
  • Continuously replicating data from an OLTP DB to OLAP DWH (e.g., MySQL > Redshift)

Data Development Testing

Test SQL code and preview changes by comparing development/staging environment data to production:

  1. Make a change to some SQL code
  2. Run the SQL code to create a new dataset
  3. Compare the dataset with its production version or another iteration

dbt

data-diff integrates with dbt Core to seamlessly compare local development to production datasets

data-development-testing

dbt Cloud users should check out Datafold's out-of-the-box deployment testing integration

πŸ‘€ Watch 4-min demo video

Get started with data-diff & dbt

Reach out on the dbt Slack in #tools-datafold for advice and support

How it works

When comparing the data, data-diff utilizes the resources of the underlying databases as much as possible. It has two primary modes of comparison:

joindiff

  • Recommended for comparing data within the same database
  • Uses the outer join operation to diff the rows as efficiently as possible within the same database
  • Fully relies on the underlying database engine for computation
  • Requires both datasets to be queryable with a single SQL query
  • Time complexity approximates JOIN operation and is largely independent of the number of differences in the dataset

hashdiff

  • Recommended for comparing datasets across different databases
  • Can also be helpful in diffing very large tables with few expected differences within the same database
  • Employs a divide-and-conquer algorithm based on hashing and binary search
  • Can diff data across distinct database engines, e.g., PostgreSQL <> Snowflake
  • Time complexity approximates COUNT(*) operation when there are few differences
  • Performance degrades when datasets have a large number of differences

More information about the algorithm and performance considerations can be found here

Get started

Validating dbt model changes between dev and prod

⚑ Looking to use data-diff in dbt development? Head over to our data-diff + dbt documentation to get started!

Compare data tables between databases

πŸ”€ To compare data between databases, install data-diff with specific database adapters, e.g.:

pip install data-diff 'data-diff[postgresql,snowflake]' -U 

Run data-diff with connection URIs. In the following example, we compare tables between PostgreSQL and Snowflake using the hashdiff algorithm:

data-diff \ postgresql://<username>:'<password>'@localhost:5432/<database> \ <table> \ "snowflake://<username>:<password>@<account>/<DATABASE>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<ROLE>" \ <TABLE> \ -k <primary key column> \ -c <columns to compare> \ -w <filter condition>

Run data-diff with a toml configuration file. In the following example, we compare tables between MotherDuck(hosted DuckDB) and Snowflake using the hashdiff algorithm:

## DATABASE CONNECTION ## [database.duckdb_connection] driver = "duckdb" # filepath = "datafold_demo.duckdb" # local duckdb file example # filepath = "md:" # default motherduck connection example filepath = "md:datafold_demo?motherduck_token=${motherduck_token}" # API token recommended for motherduck connection database = "datafold_demo" [database.snowflake_connection] driver = "snowflake" database = "DEV" user = "sung" password = "${SNOWFLAKE_PASSWORD}" # or "<PASSWORD_STRING>" # the info below is only required for snowflake account = "${ACCOUNT}" # by33919 schema = "DEVELOPMENT" warehouse = "DEMO" role = "DEMO_ROLE" ## RUN PARAMETERS ## [run.default] verbose = true ## EXAMPLE DATA DIFF JOB ## [run.demo_xdb_diff] # Source 1 ("left") 1.database = "duckdb_connection" 1.table = "development.raw_orders" # Source 2 ("right") 2.database = "snowflake_connection" 2.table = "RAW_ORDERS" # note that snowflake table names are case-sensitive verbose = false
# export relevant environment variables, example below export motherduck_token=<MOTHERDUCK_TOKEN> # run the configured data-diff job data-diff --conf datadiff.toml \ --run demo_xdb_diff \ -k "id" \ -c status # output example - 1, completed + 1, returned

Check out documentation for the full command reference.

Supported databases

Database Status Connection string
PostgreSQL >=10 🟒 postgresql://<user>:<password>@<host>:5432/<database>
MySQL 🟒 mysql://<user>:<password>@<hostname>:5432/<database>
Snowflake 🟒 "snowflake://<user>[:<password>]@<account>/<database>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<role>[&authenticator=externalbrowser]"
BigQuery 🟒 bigquery://<project>/<dataset>
Redshift 🟒 redshift://<username>:<password>@<hostname>:5439/<database>
DuckDB 🟒 duckdb://<dbname>@<filepath>
MotherDuck 🟒 duckdb://<dbname>@<filepath>
Oracle 🟑 oracle://<username>:<password>@<hostname>/servive_or_sid
Presto 🟑 presto://<username>:<password>@<hostname>:8080/<database>
Databricks 🟑 databricks://<http_path>:<access_token>@<server_hostname>/<catalog>/<schema>
Trino 🟑 trino://<username>:<password>@<hostname>:8080/<database>
Clickhouse 🟑 clickhouse://<username>:<password>@<hostname>:9000/<database>
Vertica 🟑 vertica://<username>:<password>@<hostname>:5433/<database>
ElasticSearch πŸ“
Planetscale πŸ“
Pinot πŸ“
Druid πŸ“
Kafka πŸ“
SQLite πŸ“
  • 🟒: Implemented and thoroughly tested.
  • 🟑: Implemented, but not thoroughly tested yet.
  • ⏳: Implementation in progress.
  • πŸ“: Implementation planned. Contributions welcome.

Your database not listed here?


Contributors

We thank everyone who contributed so far!


Analytics


License

This project is licensed under the terms of the MIT License.