Merge sample datasets, examples with docs into dev #48

lrog · 2018-08-08T09:50:16Z

Created a set of examples for running CogStack out-of-the-box which can be possibly re-used for possible future deployment.

The datasets (in examples/rawdata) include:

syn -- synthetic structured patient data dataset generated using [Synthea],(https://github.com/synthetichealth/synthea)
mt -- free-text medical reports from MTSamples.

The CogStack pipeline examples include:

Running CogStack on a subset of syn dataset. Using a single DB source, a single CogStack engine instance, storing the processed records into ElasticSearch.
Ex. 1 extended with parsed mt dataset. Used as an example in CogStack quickstart.
Running CogStack on a full syn dataset with mt. Datasets stored in separate DBs. Using multiple DBs as a source, running multiple CogStack engine instances, single ES sink.
Running CogStack with Apache Tika for processing (parsing + OCR) mt documents data embedded with syn records in DB. Using DOCX, PDF and JPG documents formats.
Running 2-step CogStack data processing pipeline, firstly processing the mt documents using Tika and then ingesting the syn records enriched with parsed docs to ES sink.

All the examples can be easily deployed using Docker Compose for which YAML configuration files are provided (examples/example*/docker). The YAML scripts are based on /docker-cogstack/compose-ymls/cogstack-clust/docker-compose.yml, but using only a single ES node with X-Pack security disabled.

During the deployment of the sample databases, pre-generated DB dumps are being loaded during initialisation of the container. These DB dumps can be either:

downloaded from Amazon S3 cogstack bucket (examples/download_db_dumps.sh)
generated locally (examples/prepare_docs.sh, examples/prepare_db_dumps.sh) using provided raw datasets data and using predefined DB sql schemas available in each of the examples directory (examples/example*/extra)

A more detailed description of the examples (preparing the data, deployment, running, etc.) can be found in the accompanying documentation (Jekyll-based):

quickstart tutorial in docs/quickstart,
a bit more exhaustive description of examples in docs/examples.

…tly from db dump; example of cogstack ingesting from 2 data sources

…eated explocitly examples directory with data, scripts and docker-compose files

…enerator (GH-pages compatible)

…) handling by cogstack

…ing Tika ; includes: pdf-text, pdf-img, doc, jpg

…d the examples data preparation scripts. added script for downloading db dumps directly from s3 bucket

… docker deployment folder -> creation of temporary __deploy dir

…p dir exists

…cessary postgres profile)

afolarin · 2018-08-15T14:49:02Z

LGTM

* creating a sample postgres database from csv files for testing cogstack * minor reformatting * minor cleanup in the sample job properties file * added support for MTSamples; postgres sample db initializes now directly from db dump; example of cogstack ingesting from 2 data sources * added db dump for synthetic and mtsamples data * removed the cogsack-sample from docker-comppose ymls directory and created explocitly examples directory with data, scripts and docker-compose files * added raw datasets to re-generate db dumps for examples * added db dump for example 1 * added db dump for example 2 * added db dump for example 3 * added quickstart based on Example 2 and using Jekyll static website generator (GH-pages compatible) * minor comment in the db schema * minor fixes in setup.sh scripts (typo + gen .htpasswd) * example2: changed paritioner.gridSize: 3 --> 1 * missing minor nginx fix in examples setup.sh scripts * added script to generate a static websites from a Jekyll-based GH pages * added .gitignore to ignore _www dir * added getting cogstack + setup parts for quickstart; minor refactoring * added .gitignore for quickstart Jekyll * added DCT field in examples schemas for more convenient (and explicit) handling by cogstack * added script to automatically generate sample db dumps for all samples * updated quickstart with changes in examples DB schemas * re-generated DB dumps * added ex.4 as extended ex.2 with PDFs instead of text documents * working on mtsamples pdf and jpg versions -- example 4-* * added example4 -- a set of examples for processing documents in DB using Tika ; includes: pdf-text, pdf-img, doc, jpg * removed db dumps from GH and moved them to S3 cogstack bucket. updated the examples data preparation scripts. added script for downloading db dumps directly from s3 bucket * finished example4. minor updates to examples 1-3 changing the default docker deployment folder -> creation of temporary __deploy dir * minor update to db dump creation/download scripts -- check for db_dump dir exists * updated the quickstart covering the update with deployment: __deploy dir * added example 5 -- a 2-step data ingestion (WIP) * minor refactoring + fix for using dockerhub vrrsion of cogstack image * minor refactoring of example cogstack *.properties files * updated quickstart documentation with minor corrections (removed unnecessary postgres profile) * minor update of quickstart, missing bits * added a preliminary documentation for all the available examples * minor cleanup in examples documentation yml configuration file

lrog added 30 commits July 4, 2018 13:39

creating a sample postgres database from csv files for testing cogstack

0ea2126

minor reformatting

a041de5

minor cleanup in the sample job properties file

128bc4f

added support for MTSamples; postgres sample db initializes now direc…

1b9f276

…tly from db dump; example of cogstack ingesting from 2 data sources

added db dump for synthetic and mtsamples data

58533d0

removed the cogsack-sample from docker-comppose ymls directory and cr…

e6c0a52

…eated explocitly examples directory with data, scripts and docker-compose files

added raw datasets to re-generate db dumps for examples

a6af2e3

added db dump for example 1

31e687b

added db dump for example 2

ace3341

added db dump for example 3

8ae572f

added quickstart based on Example 2 and using Jekyll static website g…

830b4a6

…enerator (GH-pages compatible)

minor comment in the db schema

35d3f1a

minor fixes in setup.sh scripts (typo + gen .htpasswd)

b703467

example2: changed paritioner.gridSize: 3 --> 1

f552a82

missing minor nginx fix in examples setup.sh scripts

866f720

added script to generate a static websites from a Jekyll-based GH pages

0242b10

added .gitignore to ignore _www dir

05776b2

added getting cogstack + setup parts for quickstart; minor refactoring

88d3dee

added .gitignore for quickstart Jekyll

f9df970

added DCT field in examples schemas for more convenient (and explicit…

47cc9de

…) handling by cogstack

added script to automatically generate sample db dumps for all samples

1d1b55b

updated quickstart with changes in examples DB schemas

cf23c85

re-generated DB dumps

1fc5b6a

added ex.4 as extended ex.2 with PDFs instead of text documents

f84d84a

working on mtsamples pdf and jpg versions -- example 4-*

c1bd979

added example4 -- a set of examples for processing documents in DB us…

6d59298

…ing Tika ; includes: pdf-text, pdf-img, doc, jpg

removed db dumps from GH and moved them to S3 cogstack bucket. update…

bd2c223

…d the examples data preparation scripts. added script for downloading db dumps directly from s3 bucket

finished example4. minor updates to examples 1-3 changing the default…

4e73285

… docker deployment folder -> creation of temporary __deploy dir

minor update to db dump creation/download scripts -- check for db_dum…

96cbd2e

…p dir exists

updated the quickstart covering the update with deployment: __deploy dir

a149369

lrog added 7 commits August 3, 2018 16:57

added example 5 -- a 2-step data ingestion (WIP)

7efdf47

minor refactoring + fix for using dockerhub vrrsion of cogstack image

bff7973

minor refactoring of example cogstack *.properties files

518b369

updated quickstart documentation with minor corrections (removed unne…

256328a

…cessary postgres profile)

minor update of quickstart, missing bits

e9f905f

added a preliminary documentation for all the available examples

2483a4d

minor cleanup in examples documentation yml configuration file

a306780

lrog requested a review from afolarin August 8, 2018 09:50

afolarin merged commit 559a3d2 into dev Aug 15, 2018

lrog deleted the sample_data branch September 12, 2018 21:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Merge sample datasets, examples with docs into dev #48

Merge sample datasets, examples with docs into dev #48

Uh oh!

lrog commented Aug 8, 2018

afolarin commented Aug 15, 2018

Labels

3 participants

Merge sample datasets, examples with docs into dev #48

Merge sample datasets, examples with docs into dev #48

Uh oh!

Conversation

lrog commented Aug 8, 2018

afolarin commented Aug 15, 2018

Labels

3 participants