Skip to content
This repository was archived by the owner on May 17, 2024. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
85 commits
Select commit Hold shift + click to select a range
a9fac08
move technical comments to technical explanation section
leoebfolsom Sep 27, 2022
1d2c6f7
Updating Leo's fork to reflect recent changes to datafold master branch.
leoebfolsom Sep 27, 2022
d350522
reorganize table of contents a bit
leoebfolsom Sep 27, 2022
1eb566a
a bit of formatting and wordsmithing
leoebfolsom Sep 27, 2022
8b7f10b
rephrase use case - debugging complex data pipelines
leoebfolsom Sep 27, 2022
12e109b
pare down text in header, add emoji
leoebfolsom Oct 7, 2022
aba16b1
features and common use cases to be rearranged WIP
leoebfolsom Oct 7, 2022
f02e9e6
Merge branch 'master' into readme-ideas
Oct 7, 2022
69470c0
clean up WIP notes
leoebfolsom Oct 7, 2022
4a5dcb3
update pip to pip3
leoebfolsom Oct 7, 2022
b901767
numerous formatting, wording, structure updates
leoebfolsom Oct 8, 2022
bbec148
note to clarify pip v pip3
leoebfolsom Oct 8, 2022
cc23c30
add a table formatting option for intro text
leoebfolsom Oct 8, 2022
2624c98
tidying up pip and pip3
leoebfolsom Oct 10, 2022
45dc579
add postgres example from dvdrental dataset
leoebfolsom Oct 10, 2022
63fe65e
remove leo-specific info from dvdrental example
leoebfolsom Oct 10, 2022
545db75
Add structure/outline for various db examples
leoebfolsom Oct 10, 2022
9b2b9eb
restructure examples
leoebfolsom Oct 10, 2022
1e0795a
minor annotation
leoebfolsom Oct 10, 2022
f92e75e
resolve merge conflicts
leoebfolsom Oct 11, 2022
e19db01
slightly improve formatting of markdown table
leoebfolsom Oct 11, 2022
580bb0f
put emoji back for bugs and issues
leoebfolsom Oct 11, 2022
7e440e3
refocus on migrations
leoebfolsom Oct 11, 2022
59d8e3e
tidy up db uri table
leoebfolsom Oct 11, 2022
db6ac0e
clarify sso vs pw auth for snowflake
leoebfolsom Oct 11, 2022
d32e752
create snowflake cli example
leoebfolsom Oct 11, 2022
e086647
capitalization in snowflake uri example
leoebfolsom Oct 11, 2022
9e31855
rewrite snowflake cli example for clarity
leoebfolsom Oct 11, 2022
0acb8df
lite formatting
leoebfolsom Oct 11, 2022
56de02f
moved options
leoebfolsom Oct 12, 2022
33f2a3b
remove large sections and restructre examples
leoebfolsom Oct 12, 2022
c70169e
remove table at the top
leoebfolsom Oct 12, 2022
aa2596f
restructure the examples and sections
leoebfolsom Oct 12, 2022
f96e6ce
jai feedback
leoebfolsom Oct 12, 2022
78f01fc
add vertical white space
leoebfolsom Oct 12, 2022
6ffd598
fix typo and white space
leoebfolsom Oct 12, 2022
e83ec7f
updated the image to include logos
leoebfolsom Oct 12, 2022
dce6097
make a space for the screenshot example
leoebfolsom Oct 13, 2022
a9fed7d
overhaul
leoebfolsom Oct 13, 2022
7f9ff3d
typo
leoebfolsom Oct 13, 2022
e81f8e9
rephrase capabilities and add vertical whitespace
leoebfolsom Oct 13, 2022
5f145ab
add images
leoebfolsom Oct 13, 2022
5cd4c99
concision
leoebfolsom Oct 13, 2022
2fa1d98
tearing it all down
leoebfolsom Oct 14, 2022
9da9349
simplify driver installation explanation
leoebfolsom Oct 14, 2022
6099651
small update
leoebfolsom Oct 15, 2022
092ad56
add loom and screenshot to readme
leoebfolsom Oct 15, 2022
bfb5944
target blank
leoebfolsom Oct 15, 2022
10d596d
typo
leoebfolsom Oct 15, 2022
82f7859
refresh example images
leoebfolsom Oct 17, 2022
f0bf0ea
better scaled image
leoebfolsom Oct 17, 2022
15a9b9b
Merge branch 'master' into readme-ideas
leoebfolsom Oct 17, 2022
1cc8363
add development details to contributing.md
leoebfolsom Oct 17, 2022
0f125cc
updated image
leoebfolsom Oct 18, 2022
866e59a
update with TODOs around links
leoebfolsom Oct 18, 2022
1270be7
add todo around code example
leoebfolsom Oct 18, 2022
ccf6f40
free and open source
leoebfolsom Oct 18, 2022
dd817f7
update we're hiring
leoebfolsom Oct 18, 2022
c999f30
update documentation and hiring links
leoebfolsom Oct 18, 2022
c3549b3
clean up images
leoebfolsom Oct 18, 2022
421ea2c
two user persona version of readme
leoebfolsom Oct 18, 2022
9938138
remove todo
leoebfolsom Oct 18, 2022
265db4e
here to help
leoebfolsom Oct 18, 2022
61698cc
typo
leoebfolsom Oct 18, 2022
cd311fc
version without python section
leoebfolsom Oct 18, 2022
18fd58b
version with information in CONTRIBUTING.md
leoebfolsom Oct 18, 2022
c981a86
gleb feedback
leoebfolsom Oct 18, 2022
3c446a9
update explainer image
leoebfolsom Oct 18, 2022
4e2d12d
update url
leoebfolsom Oct 18, 2022
2c92021
try to add iframe
leoebfolsom Oct 19, 2022
eb7f026
update video
leoebfolsom Oct 19, 2022
613bbb2
add details that had been moved to docusaurus
leoebfolsom Oct 19, 2022
735b7d2
Merge branch 'readme-ideas' of https://github.com/leoebfolsom/data-di…
leoebfolsom Oct 19, 2022
a680676
add information from docusaurus to readme
leoebfolsom Oct 19, 2022
dacb25e
update image, include How to Use table of contents
leoebfolsom Oct 19, 2022
a554a96
add notes about beta
leoebfolsom Oct 19, 2022
782a3a9
change beta to pre
leoebfolsom Oct 19, 2022
55fe3ca
give config file its own heading
leoebfolsom Oct 19, 2022
e752756
update contributing.md to not link to docs
leoebfolsom Oct 19, 2022
bd2764e
update Contributing.md
leoebfolsom Oct 19, 2022
9b1eb20
add technical explanation and benchmarking
leoebfolsom Oct 19, 2022
9712d6c
re-add technical explanation and benchmarking
leoebfolsom Oct 19, 2022
7cb4ca6
change 'diff' to 'diff tables' in headers
leoebfolsom Oct 19, 2022
9923f8e
again, diff to diff tables
leoebfolsom Oct 19, 2022
d45eb80
Merge branch 'master' into readme-ideas
leoebfolsom Oct 19, 2022
File filter

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
83 changes: 83 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,3 +79,86 @@ New databases should be added as a new module in the `data-diff/databases/` fold
If possible, please also add the database setup to `docker-compose.yml`, so that we can run and test it for ourselves. If you do, also update the CI (`ci.yml`).

Guide to implementing a new database driver: https://data-diff.readthedocs.io/en/latest/new-database-driver-guide.html

## Development Setup

The development setup centers around using `docker-compose` to boot up various
databases, and then inserting data into them.

For Mac for performance of Docker, we suggest enabling in the UI:

* Use new Virtualization Framework
* Enable VirtioFS accelerated directory sharing

**1. Install Data Diff**

When developing/debugging, it's recommended to install dependencies and run it
directly with `poetry` rather than go through the package.

```
$ brew install mysql postgresql # MacOS dependencies for C bindings
$ apt-get install libpq-dev libmysqlclient-dev # Debian dependencies
$ pip install poetry # Python dependency isolation tool
$ poetry install # Install dependencies
```
**2. Start Databases**

[Install **docker-compose**][docker-compose] if you haven't already.

```shell-session
$ docker-compose up -d mysql postgres # run mysql and postgres dbs in background
```

[docker-compose]: https://docs.docker.com/compose/install/

**3. Run Unit Tests**

There are more than 1000 tests for all the different type and database
combinations, so we recommend using `unittest-parallel` that's installed as a
development dependency.

```shell-session
$ poetry run unittest-parallel -j 16 # run all tests
$ poetry run python -m unittest -k <test> # run individual test
```

**4. Seed the Database(s) (optional)**

First, download the CSVs of seeding data:

```shell-session
$ curl https://datafold-public.s3.us-west-2.amazonaws.com/1m.csv -o dev/ratings.csv
# For a larger data-set (but takes 25x longer to import):
# - curl https://datafold-public.s3.us-west-2.amazonaws.com/25m.csv -o dev/ratings.csv
```

Now you can insert it into the testing database(s):

```shell-session
# It's optional to seed more than one to run data-diff(1) against.
$ poetry run preql -f dev/prepare_db.pql mysql://mysql:Password1@127.0.0.1:3306/mysql
$ poetry run preql -f dev/prepare_db.pql postgresql://postgres:Password1@127.0.0.1:5432/postgres
# Cloud databases
$ poetry run preql -f dev/prepare_db.pql snowflake://<uri>
$ poetry run preql -f dev/prepare_db.pql mssql://<uri>
$ poetry run preql -f dev/prepare_db.pql bigquery:///<project>
```

**5. Run **data-diff** against seeded database (optional)**

```bash
poetry run python3 -m data_diff postgresql://postgres:Password1@localhost/postgres rating postgresql://postgres:Password1@localhost/postgres rating_del1 --verbose
```

**6. Run benchmarks (optional)**

```shell-session
$ dev/benchmark.sh # runs benchmarks and puts results in benchmark_<sha>.csv
$ poetry run python3 dev/graph.py # create graphs from benchmark_*.csv files
```

You can adjust how many rows we benchmark with by passing `N_SAMPLES` to `dev/benchmark.sh`:

```shell-session
$ N_SAMPLES=100000000 dev/benchmark.sh # 100m which is our canonical target
```
Loading