Skip to content
This repository was archived by the owner on May 17, 2024. It is now read-only.

Conversation

@dlawin
Copy link
Contributor

@dlawin dlawin commented Jan 5, 2023

Adds options to datadiff: data-diff --dbt and data-diff --dbt --cloud

With some suboptions:
--dbt-profiles-dir
--dbt-project-dir

The main options parse a dbt project's artifacts and profiles.yml in order to easily run multiple diffs without specifying the typical connection strings and table names.

Currently, it will run a diff for each successful model in the project's last run.

The main configuration step is to add one or both of the following to the vars section of the dbt_project.yml:

vars: data_diff: prod_database: PROD_DATABASE_NAME prod_schema: PROD_SCHEMA_NAME 

If a dbt project is organized at a schema level (e.g. all models write to a single schema), both are needed. If the dbt project writes to multiple schemas, just the database var is needed.

If you're using the --dbt-cloud option, that requires a Datafold api key on the env var DATAFOLD_API_KEY. Additionally a datasource_id in the vars section:

vars: data_diff: prod_database: PROD_DATABASE_NAME prod_schema: PROD_SCHEMA_NAME datasource_id: 123 

Caveats:

  • Limited database support (Snowflake, BigQuery so far)
  • Dbt version >= 1.0.0
    • Other versions may be fine if the format of the artifacts don't change dramatically. I have not tested < 1.0.0
  • I've only added unit tests for the most part. There are two integration tests that can be used to debug a "real" dbt diff session, but you need to set an env variable to enable them.
  • Cloud diffs just provide a link to the diff running on Datafold (no results)
  • Local diffs run sequentially, could improve performance by running in parallel
@dlawin dlawin self-assigned this Jan 5, 2023
@erezsh
Copy link
Contributor

erezsh commented Jan 5, 2023

This change of syntax is a breaking change for every user of data-diff, not to mention our own documentation and videos.

Why not add a --dbt flag instead?

@dlawin
Copy link
Contributor Author

dlawin commented Jan 5, 2023

This change of syntax is a breaking change for every user of data-diff, not to mention our own documentation and videos.

Why not add a --dbt flag instead?

Mainly to support flags for the dbt functionality. I think it would be confusing to have flags that only apply when the --dbt flag is present. This would be odd for example:
data-diff --dbt --cloud

What would the behavior of data-diff --cloud or data-diff -proj /path be?

@leoebfolsom
Copy link
Contributor

This is very minor, but we should be consistent around saying database/table 1/2 vs database/table a/b. We should use a/b instead of 1/2.

Erez will recall that I was very annoying about this when he was naming columns for materialized tables. While I don't feel strongly about 1 vs a, we settled on a, and that is reflected in this guide: https://docs.datafold.com/guides/os_diff

🙏

@erezsh
Copy link
Contributor

erezsh commented Jan 12, 2023

@dlawin

I think it would be confusing to have flags that only apply when the --dbt flag is present

We already have this kind of behavior for hashdiff vs joindiff. It might be confusing, but I don't know of an easy way to avoid it.

Also, I believe we can allow the data-diff dbt ... syntax without making any of these changes.

We already do a bit of syntax shenanigans here to allow the db table1 table2 syntax. It shouldn't be hard to tweak it, so it tests if database1 == "dbt" and acts accordingly.

The last alternative is to create a new entry-point, like data-diff-dbt ..., which I think is a bit of an overkill, but might be a bit less confusing.

Either way, breaking the API is a definite no-no.

@dlawin
Copy link
Contributor Author

dlawin commented Jan 26, 2023

@dlawin

I think it would be confusing to have flags that only apply when the --dbt flag is present

We already have this kind of behavior for hashdiff vs joindiff. It might be confusing, but I don't know of an easy way to avoid it.

Also, I believe we can allow the data-diff dbt ... syntax without making any of these changes.

We already do a bit of syntax shenanigans here to allow the db table1 table2 syntax. It shouldn't be hard to tweak it, so it tests if database1 == "dbt" and acts accordingly.

The last alternative is to create a new entry-point, like data-diff-dbt ..., which I think is a bit of an overkill, but might be a bit less confusing.

Either way, breaking the API is a definite no-no.

Switched to option flags, dbt specific functionality is prefixed by --dbt-

@dlawin
Copy link
Contributor Author

dlawin commented Jan 26, 2023

I did move some of the option functionality up to the main method when I thought it made sense (--debug, --no-tracking, etc.)

For now, the majority of the other options are ignored when using --dbt or --dbt --cloud, but we can add that over time

@dlawin dlawin force-pushed the dbt branch 3 times, most recently from 4ce9f39 to 27e0a87 Compare January 27, 2023 18:00
@williebsweet williebsweet requested a review from nolar January 27, 2023 18:43
@dlawin
Copy link
Contributor Author

dlawin commented Feb 1, 2023

tested compatibility with dbt:
1.0
1.1
1.2
1.3
1.4

I didn't cover every adapter (just snowflake), but that should not affect the dbt-core artifacts being parsed.

@dlawin
Copy link
Contributor Author

dlawin commented Feb 1, 2023

Hey @williebsweet @nolar, I think I've addressed all of the feedback. Re-requested review

@dlawin dlawin requested a review from erezsh February 3, 2023 18:22
Copy link
Contributor

@erezsh erezsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good enough for me

@erezsh erezsh merged commit 13f9beb into datafold:master Feb 4, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

5 participants