Skip to content
This repository was archived by the owner on May 17, 2024. It is now read-only.

Data-diff fails for Motherduck connections when using multiple databases as environments #858

@rob-teeuwen

Description

@rob-teeuwen

Describe the bug
When connecting to Motherduck, query_table_schema() throws an assertion error if the table path exists in more than 1 database.

I have a motherduck project with multiple databases, 1 per environment. I use these environments to materialize dbt models. In this example, I have a schema called dbt_marts with a table called orders. This exists in 3 databases, namely production, development, and local_test. The schema in all 3 tables is exactly the same.

I came across this bug when experimenting with data-diff.

data-diff-config.toml:

[database.motherduck] driver = "duckdb" filepath = "md:production?motherduck_token=${MOTHERDUCK_TOKEN}" [run.test] # Source 1 ("left") 1.database = "motherduck" 1.table = "dbt_marts.orders" # Source 1 ("right") 2.database = "motherduck" 2.table = "dbt_marts.orders" 

then I ran the command: data-diff --conf data-diff-config.toml --run test2

which throws:

Traceback (most recent call last): File "/Users/robteeuwen/Library/Caches/pypoetry/virtualenvs/analytics-dagster-p7XfvFOJ-py3.11/bin/data-diff", line 8, in <module> sys.exit(main()) ^^^^^^ File "/Users/robteeuwen/Library/Caches/pypoetry/virtualenvs/analytics-dagster-p7XfvFOJ-py3.11/lib/python3.11/site-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/robteeuwen/Library/Caches/pypoetry/virtualenvs/analytics-dagster-p7XfvFOJ-py3.11/lib/python3.11/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) ^^^^^^^^^^^^^^^^ File "/Users/robteeuwen/Library/Caches/pypoetry/virtualenvs/analytics-dagster-p7XfvFOJ-py3.11/lib/python3.11/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/robteeuwen/Library/Caches/pypoetry/virtualenvs/analytics-dagster-p7XfvFOJ-py3.11/lib/python3.11/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/robteeuwen/Library/Caches/pypoetry/virtualenvs/analytics-dagster-p7XfvFOJ-py3.11/lib/python3.11/site-packages/data_diff/__main__.py", line 344, in main return _data_diff( ^^^^^^^^^^^ File "/Users/robteeuwen/Library/Caches/pypoetry/virtualenvs/analytics-dagster-p7XfvFOJ-py3.11/lib/python3.11/site-packages/data_diff/__main__.py", line 482, in _data_diff schemas = list(differ._thread_map(_get_schema, safezip(dbs, table_paths))) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/robteeuwen/.pyenv/versions/3.11.0/lib/python3.11/concurrent/futures/_base.py", line 619, in result_iterator yield _result_or_cancel(fs.pop()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/robteeuwen/.pyenv/versions/3.11.0/lib/python3.11/concurrent/futures/_base.py", line 317, in _result_or_cancel return fut.result(timeout) ^^^^^^^^^^^^^^^^^^^ File "/Users/robteeuwen/.pyenv/versions/3.11.0/lib/python3.11/concurrent/futures/_base.py", line 449, in result return self.__get_result() ^^^^^^^^^^^^^^^^^^^ File "/Users/robteeuwen/.pyenv/versions/3.11.0/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result raise self._exception File "/Users/robteeuwen/.pyenv/versions/3.11.0/lib/python3.11/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/robteeuwen/Library/Caches/pypoetry/virtualenvs/analytics-dagster-p7XfvFOJ-py3.11/lib/python3.11/site-packages/data_diff/__main__.py", line 77, in _get_schema return db.query_table_schema(table_path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/robteeuwen/Library/Caches/pypoetry/virtualenvs/analytics-dagster-p7XfvFOJ-py3.11/lib/python3.11/site-packages/data_diff/databases/base.py", line 1042, in query_table_schema assert len(d) == len(rows) AssertionError 

I traced it back to this line. It takes the output of the information_schema query over here and converts it to a dictionary. This line verifies that the number of rows in the dictionary (representing columns in the schema) is the same as the number of rows in the original result.

This fails in my case, because the information_schema returns the columns from all that meet WHERE table_name = '{name}' AND table_schema = '{schema}', regardless of the database in which they're in. During the dictionary conversion however, the set of rows is reduced to unique ones, so the assertion fails.

Not sure how this should be fixed. Not sure if it's expected that the information_schema returns columns from tables in other databases than the one specified in the connection parameters. One way to avoid this error is modifying the where clause as follows:

WHERE table_name = '{name}' AND table_schema = '{schema}' AND table_catalog = '[database-name]'

Describe the environment

Running MacOS with data-diff version v0.10.1 and duckdb 0.9.2

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingtriage

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions