Skip to content

Performance issue with large tables #28

@ramyamasani

Description

@ramyamasani

Noticed data-diff taking long time for large data sets.

Ex: We have a table where we are diffing 2 days worth of data consisting of 6500577 records and data-diff took about 8398 secs to complete.

Expected: As per this technical doc a table with 100M rows will finish in 20 mins so a table with 6M rows should finish in less than 2 mins.
Actual: In the above case it is taking 8398 secs to finish.

Here are the data-diff parameters that we used.

import data_diff segment_kwargs = { "key_columns": table_info['primary_keys'], "extra_columns": table_info['extra_columns'], "table_path": (table_info["table_schema"], table_info["table_name"]), } if table_info['bookmark_column']: segment_kwargs['update_column'] = table_info['bookmark_column'] segment_kwargs['min_update'] = table_info['min_update'] segment_kwargs['max_update'] = table_info['max_update'] diff_mysql = connect( db_conf={ "database": table_info['table_schema'], "driver": "mysql", "host": table_info['hostname'], "port": 3306, "password": mysql_password, "username": mysql_user, } ) diff_bigquery = connect( db_conf={ "driver": "bigquery", "bigquery_credentials": bq_credentials, "project": bq_project, "dataset": table_info['table_schema'], } ) for difference in data_diff.diff_tables( table1=table_segment.TableSegment(diff_mysql, **segment_kwargs), table2=table_segment.TableSegment(diff_bigquery, **segment_kwargs), algorithm=Algorithm.HASHDIFF, threaded=False, auto_bisection_factor=True, ): 

Table details:
Total rows to process between min and max bookmark_column's timestamp : 6500577

Table: {'hostname': 'xxxxxxxx.com', 'table_schema': 'aaaaaa', 'table_name': 'xxxxreport', 'bookmark_column': 'bbbb_timestamp', 'primary_keys': ('idxxxxxreport_int',), 'extra_columns': ('id1_int', 'id2_int', 'id3_int', 'metric_value'), 'min_update': '2025-05-07 23:24:37', 'max_update': '2025-05-09 23:24:37'} 

Metadata

Metadata

Assignees

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions