Name	Name	Last commit message	Last commit date
Latest commit History 714 Commits
.github	.github
data_diff	data_diff
dev	dev
docs	docs
tests	tests
.dockerignore	.dockerignore
.editorconfig	.editorconfig
.gitignore	.gitignore
CODE_OF_CONDUCT.md	CODE_OF_CONDUCT.md
CONTRIBUTING.md	CONTRIBUTING.md
Dockerfile	Dockerfile
LICENSE	LICENSE
README.md	README.md
debug.py	debug.py
docker-compose.yml	docker-compose.yml
poetry.lock	poetry.lock
pyproject.toml	pyproject.toml
readthedocs.yml	readthedocs.yml

data-diff

🐞Bugs? 💡Issues?
- Please open an issue!
💬 Prefer to chat live?
- Find us in #tools-data-diff in the Locally Optimistic Slack or
- Please reach out to the product team share any product feedback or feature requests!
💸💸 Looking for paid contributors! 💸💸
- We're looking for developers with a deep understanding of databases and solid Python knowledge. Apply here!

data-diff enables data professionals to detect differences in values between any two tables. It's fast, easy to use, and reliable. Even at massive scale.

But first, what's a diff?

Diffing compares two files and tells you how they're different. It's there for you when your test suite doesn't cover every edge case.

You already know about git diff, which helps code reviewers see what's changed. A data-diff does the same thing, but between two data tables.

This is a `git diff`, as you know.

This is a `data-diff`! Real code and output examples below.

Cool, but I already have a test suite.

You sure do! And that will catch any errors that you've written a test for. data-diff is different because it will alert you to any changes in the data.

For example, if you write a test to check if a primary key is unique and not_null, that won't catch if your code change or data migration has caused certain primary keys to disappear💨, or to be slightly altered😱. That's where data-diff comes in.

`data-diff` is useful for:

Validation of data integrity in replication, migration, and pipelines
Comparing tables within one database to validate successful transformations
Searching for changes between two data sets in any context

Oh, you'd like a list of amazing things `data-diff` can do?

⇄ Verifies across many different databases (e.g. PostgreSQL ⇄ Snowflake) or within a database
🔍 Outputs diff of rows in detail
🚨 Simple CLI/API to create monitoring and alerts
🔁 Bridges column types of different formats and levels of precision (e.g. Double ⇆ Float ⇆ Decimal)
🔥 Verify 25M+ rows in <10s, and 1B+ rows in ~5min
♾️ Works for tables with 10s of billions of rows

Getting started

Install `data-diff` and database-specific drivers

First, install data-diff using pip.

pip install data-diff

Note: Once you've installed Python 3.7+, it's most likely that pip and pip3 can be used interchangeably.

Then, install one or more driver(s) specific to the database(s) you want to connect to.

pip install 'data-diff[mysql]'
pip install 'data-diff[postgresql]'
pip install 'data-diff[snowflake]'
pip install 'data-diff[presto]'
pip install 'data-diff[oracle]'
pip install 'data-diff[trino]'
pip install 'data-diff[clickhouse]'
pip install 'data-diff[vertica]'
For BigQuery, see: https://pypi.org/project/google-cloud-bigquery/

Run your first diff

Once you've installed data-diff, you can run it from the command line:

data-diff DB1_URI TABLE1_NAME DB2_URI TABLE2_NAME [OPTIONS]

We've included examples here for PostgreSQL and Snowflake. Additional database configurations and examples are available in the (docs TODO.)[link]

Comparing the same table in Snowflake vs Postgres

Here's an example comparing two versions of a table with 829,615 rows in two different databases. The code here has <> carrots 🥕 around variables in place of content that you will replace with your own information.

The -k flag is used to specify a primary key. Otherwise, data-diff will assume the primary key is named id.

$ data-diff \ postgresql://<YOUR_USERNAME>:<your_postgres_password>@<your_hostname>:5432/<your_database_name> <table_2_name> \ "snowflake://<YOUR_USERNAME>:<your_snowflake_password>@<YOUR_ACCOUNT>/SNOWFLAKE_DB/<YOUR_DATABASE>?warehouse=<YOUR_WAREHOUSE>&role=<YOUR_ROLE>" <TABLE_1_NAME> \ -k <primary_key>

Here's what the command looks like when you replace the carrots with real values and see the results:

We see that is one primary key that exists in the origin Postgres database, but is missing from the destination Snowflake warehouse.

If you want to see a summary of the results, you can use the -s flag. Here's what that looks like:

$ data-diff \ postgresql://<YOUR_USERNAME>:<your_postgres_password>@<your_hostname>:5432/<your_database_name> <table_2_name> \ "snowflake://<YOUR_USERNAME>:<your_snowflake_password>@<YOUR_ACCOUNT>/SNOWFLAKE_DB/<YOUR_DATABASE>?warehouse=<YOUR_WAREHOUSE>&role=<YOUR_ROLE>" <TABLE_1_NAME> \ -k <primary_key> -s

With the -c flag, you can specify additional columns to analyze for differences between the two tables.

$ data-diff \ postgresql://<YOUR_USERNAME>:<your_postgres_password>@<your_hostname>:5432/<your_database_name> <table_2_name> \ "snowflake://<YOUR_USERNAME>:<your_snowflake_password>@<YOUR_ACCOUNT>/SNOWFLAKE_DB/<YOUR_DATABASE>?warehouse=<YOUR_WAREHOUSE>&role=<YOUR_ROLE>" <TABLE_1_NAME> \ -k <primary_key> -c activity

We see that one row has a conflicting value in the activity column. Further investigation of this row revealed that there is extra trailing whitespace in the origin Postgres database.

Comparing tables within a database

In this example, we'll run a similar command comparing two tables within Snowflake. This could help you out when reviewing a PR and comparing the development vs production version of a table.

$ data-diff \ "snowflake://<YOUR_USERNAME>:<your_snowflake_password>@<YOUR_ACCOUNT>/SNOWFLAKE_DB/<YOUR_DATABASE>?warehouse=<YOUR_WAREHOUSE>&role=<YOUR_ROLE>" <TABLE_1_NAME> \ "snowflake://<YOUR_USERNAME>:<your_snowflake_password>@<YOUR_ACCOUNT>/SNOWFLAKE_DB/<YOUR_DATABASE>?warehouse=<YOUR_WAREHOUSE>&role=<YOUR_ROLE>" <TABLE_2_NAME> \ -k <primary_key> \ -s

We see that 7 primary keys exist only in the ANALYTICS_DEV schema, and 23 primary keys exist only in the ANALYTICS schema. Depending on the nature of the PR, this could be expected, or it could be a red flag indicating there's something wrong with the updated code.

[TODO want to learn more? Dive into the Datafold Documentation (link).]

Usage Analytics & Data Privacy

data-diff collects anonymous usage data to help our team improve the tool and to apply development efforts to where our users need them most. [TODO Read more about this and how to opt out in the documentation (link).]

Details for Developers and Contributors

The development setup centers around using docker-compose to boot up various databases, and then inserting data into them.

For Mac for performance of Docker, we suggest enabling in the UI:

Use new Virtualization Framework
Enable VirtioFS accelerated directory sharing

1. Install Data Diff

When developing/debugging, it's recommended to install dependencies and run it directly with poetry rather than go through the package.

$ brew install mysql postgresql # MacOS dependencies for C bindings $ apt-get install libpq-dev libmysqlclient-dev # Debian dependencies $ pip3 install poetry # Python dependency isolation tool $ poetry install # Install dependencies

2. Start Databases

Install docker-compose if you haven't already.

$ docker-compose up -d mysql postgres # run mysql and postgres dbs in background

3. Run Unit Tests

There are more than 1000 tests for all the different type and database combinations, so we recommend using unittest-parallel that's installed as a development dependency.

$ poetry run unittest-parallel -j 16 # run all tests $ poetry run python -m unittest -k <test> # run individual test

4. Seed the Database(s) (optional)

First, download the CSVs of seeding data:

$ curl https://datafold-public.s3.us-west-2.amazonaws.com/1m.csv -o dev/ratings.csv # For a larger data-set (but takes 25x longer to import): # - curl https://datafold-public.s3.us-west-2.amazonaws.com/25m.csv -o dev/ratings.csv

Now you can insert it into the testing database(s):

# It's optional to seed more than one to run data-diff(1) against. $ poetry run preql -f dev/prepare_db.pql mysql://mysql:Password1@127.0.0.1:3306/mysql $ poetry run preql -f dev/prepare_db.pql postgresql://postgres:Password1@127.0.0.1:5432/postgres # Cloud databases $ poetry run preql -f dev/prepare_db.pql snowflake://<uri> $ poetry run preql -f dev/prepare_db.pql mssql://<uri> $ poetry run preql -f dev/prepare_db.pql bigquery:///<project>

5. Run data-diff against seeded database (optional)

poetry run python3 -m data_diff postgresql://postgres:Password1@localhost/postgres rating postgresql://postgres:Password1@localhost/postgres rating_del1 --verbose

6. Run benchmarks (optional)

$ dev/benchmark.sh # runs benchmarks and puts results in benchmark_<sha>.csv $ poetry run python3 dev/graph.py # create graphs from benchmark_*.csv files

You can adjust how many rows we benchmark with by passing N_SAMPLES to dev/benchmark.sh:

$ N_SAMPLES=100000000 dev/benchmark.sh # 100m which is our canonical target

License

MIT License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

data-diff

But first, what's a diff?

This is a `git diff`, as you know.

This is a `data-diff`! Real code and output examples below.

Cool, but I already have a test suite.

`data-diff` is useful for:

Oh, you'd like a list of amazing things `data-diff` can do?

Getting started

Install `data-diff` and database-specific drivers

Run your first diff

Comparing the same table in Snowflake vs Postgres

Comparing tables within a database

Usage Analytics & Data Privacy

Details for Developers and Contributors

License

About

Uh oh!

Releases 63

Uh oh!

Contributors 48

Languages

License

datafold/data-diff

Folders and files

Latest commit

History

Repository files navigation

data-diff

But first, what's a diff?

This is a git diff, as you know.

This is a data-diff! Real code and output examples below.

Cool, but I already have a test suite.

data-diff is useful for:

Oh, you'd like a list of amazing things data-diff can do?

Getting started

Install data-diff and database-specific drivers

Run your first diff

Comparing the same table in Snowflake vs Postgres

Comparing tables within a database

Usage Analytics & Data Privacy

Details for Developers and Contributors

License

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 63

Uh oh!

Contributors 48

Languages

This is a `git diff`, as you know.

This is a `data-diff`! Real code and output examples below.

`data-diff` is useful for:

Oh, you'd like a list of amazing things `data-diff` can do?

Install `data-diff` and database-specific drivers