Skip to content
This repository was archived by the owner on May 17, 2024. It is now read-only.

Commit 53b6a3c

Browse files
authored
Merge pull request #241 from leoebfolsom/readme-ideas
README updates for coalesce and pre release
2 parents 83d9c51 + d45eb80 commit 53b6a3c

File tree

2 files changed

+289
-331
lines changed

2 files changed

+289
-331
lines changed

CONTRIBUTING.md

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,3 +79,86 @@ New databases should be added as a new module in the `data-diff/databases/` fold
7979
If possible, please also add the database setup to `docker-compose.yml`, so that we can run and test it for ourselves. If you do, also update the CI (`ci.yml`).
8080

8181
Guide to implementing a new database driver: https://data-diff.readthedocs.io/en/latest/new-database-driver-guide.html
82+
83+
## Development Setup
84+
85+
The development setup centers around using `docker-compose` to boot up various
86+
databases, and then inserting data into them.
87+
88+
For Mac for performance of Docker, we suggest enabling in the UI:
89+
90+
* Use new Virtualization Framework
91+
* Enable VirtioFS accelerated directory sharing
92+
93+
**1. Install Data Diff**
94+
95+
When developing/debugging, it's recommended to install dependencies and run it
96+
directly with `poetry` rather than go through the package.
97+
98+
```
99+
$ brew install mysql postgresql # MacOS dependencies for C bindings
100+
$ apt-get install libpq-dev libmysqlclient-dev # Debian dependencies
101+
$ pip install poetry # Python dependency isolation tool
102+
$ poetry install # Install dependencies
103+
```
104+
**2. Start Databases**
105+
106+
[Install **docker-compose**][docker-compose] if you haven't already.
107+
108+
```shell-session
109+
$ docker-compose up -d mysql postgres # run mysql and postgres dbs in background
110+
```
111+
112+
[docker-compose]: https://docs.docker.com/compose/install/
113+
114+
**3. Run Unit Tests**
115+
116+
There are more than 1000 tests for all the different type and database
117+
combinations, so we recommend using `unittest-parallel` that's installed as a
118+
development dependency.
119+
120+
```shell-session
121+
$ poetry run unittest-parallel -j 16 # run all tests
122+
$ poetry run python -m unittest -k <test> # run individual test
123+
```
124+
125+
**4. Seed the Database(s) (optional)**
126+
127+
First, download the CSVs of seeding data:
128+
129+
```shell-session
130+
$ curl https://datafold-public.s3.us-west-2.amazonaws.com/1m.csv -o dev/ratings.csv
131+
# For a larger data-set (but takes 25x longer to import):
132+
# - curl https://datafold-public.s3.us-west-2.amazonaws.com/25m.csv -o dev/ratings.csv
133+
```
134+
135+
Now you can insert it into the testing database(s):
136+
137+
```shell-session
138+
# It's optional to seed more than one to run data-diff(1) against.
139+
$ poetry run preql -f dev/prepare_db.pql mysql://mysql:Password1@127.0.0.1:3306/mysql
140+
$ poetry run preql -f dev/prepare_db.pql postgresql://postgres:Password1@127.0.0.1:5432/postgres
141+
# Cloud databases
142+
$ poetry run preql -f dev/prepare_db.pql snowflake://<uri>
143+
$ poetry run preql -f dev/prepare_db.pql mssql://<uri>
144+
$ poetry run preql -f dev/prepare_db.pql bigquery:///<project>
145+
```
146+
147+
**5. Run **data-diff** against seeded database (optional)**
148+
149+
```bash
150+
poetry run python3 -m data_diff postgresql://postgres:Password1@localhost/postgres rating postgresql://postgres:Password1@localhost/postgres rating_del1 --verbose
151+
```
152+
153+
**6. Run benchmarks (optional)**
154+
155+
```shell-session
156+
$ dev/benchmark.sh # runs benchmarks and puts results in benchmark_<sha>.csv
157+
$ poetry run python3 dev/graph.py # create graphs from benchmark_*.csv files
158+
```
159+
160+
You can adjust how many rows we benchmark with by passing `N_SAMPLES` to `dev/benchmark.sh`:
161+
162+
```shell-session
163+
$ N_SAMPLES=100000000 dev/benchmark.sh # 100m which is our canonical target
164+
```

0 commit comments

Comments
 (0)