This repository was archived by the owner on Jan 13, 2023. It is now read-only.
- Notifications
You must be signed in to change notification settings - Fork 13
Merge sample datasets, examples with docs into dev #48
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
…tly from db dump; example of cogstack ingesting from 2 data sources
…eated explocitly examples directory with data, scripts and docker-compose files
…enerator (GH-pages compatible)
…) handling by cogstack
…ing Tika ; includes: pdf-text, pdf-img, doc, jpg
…d the examples data preparation scripts. added script for downloading db dumps directly from s3 bucket
… docker deployment folder -> creation of temporary __deploy dir
…cessary postgres profile)
Contributor
| LGTM |
vladd-bit pushed a commit that referenced this pull request Nov 10, 2021
* creating a sample postgres database from csv files for testing cogstack * minor reformatting * minor cleanup in the sample job properties file * added support for MTSamples; postgres sample db initializes now directly from db dump; example of cogstack ingesting from 2 data sources * added db dump for synthetic and mtsamples data * removed the cogsack-sample from docker-comppose ymls directory and created explocitly examples directory with data, scripts and docker-compose files * added raw datasets to re-generate db dumps for examples * added db dump for example 1 * added db dump for example 2 * added db dump for example 3 * added quickstart based on Example 2 and using Jekyll static website generator (GH-pages compatible) * minor comment in the db schema * minor fixes in setup.sh scripts (typo + gen .htpasswd) * example2: changed paritioner.gridSize: 3 --> 1 * missing minor nginx fix in examples setup.sh scripts * added script to generate a static websites from a Jekyll-based GH pages * added .gitignore to ignore _www dir * added getting cogstack + setup parts for quickstart; minor refactoring * added .gitignore for quickstart Jekyll * added DCT field in examples schemas for more convenient (and explicit) handling by cogstack * added script to automatically generate sample db dumps for all samples * updated quickstart with changes in examples DB schemas * re-generated DB dumps * added ex.4 as extended ex.2 with PDFs instead of text documents * working on mtsamples pdf and jpg versions -- example 4-* * added example4 -- a set of examples for processing documents in DB using Tika ; includes: pdf-text, pdf-img, doc, jpg * removed db dumps from GH and moved them to S3 cogstack bucket. updated the examples data preparation scripts. added script for downloading db dumps directly from s3 bucket * finished example4. minor updates to examples 1-3 changing the default docker deployment folder -> creation of temporary __deploy dir * minor update to db dump creation/download scripts -- check for db_dump dir exists * updated the quickstart covering the update with deployment: __deploy dir * added example 5 -- a 2-step data ingestion (WIP) * minor refactoring + fix for using dockerhub vrrsion of cogstack image * minor refactoring of example cogstack *.properties files * updated quickstart documentation with minor corrections (removed unnecessary postgres profile) * minor update of quickstart, missing bits * added a preliminary documentation for all the available examples * minor cleanup in examples documentation yml configuration file
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes were made to the code. Suggestions cannot be applied while the pull request is closed. Suggestions cannot be applied while viewing a subset of changes. Only one suggestion per line can be applied in a batch. Add this suggestion to a batch that can be applied as a single commit. Applying suggestions on deleted lines is not supported. You must change the existing code in this line in order to create a valid suggestion. Outdated suggestions cannot be applied. This suggestion has been applied or marked resolved. Suggestions cannot be applied from pending reviews. Suggestions cannot be applied on multi-line comments. Suggestions cannot be applied while the pull request is queued to merge. Suggestion cannot be applied right now. Please check back later.
Created a set of examples for running CogStack out-of-the-box which can be possibly re-used for possible future deployment.
The datasets (in
examples/rawdata) include:syn-- synthetic structured patient data dataset generated using [Synthea],(https://github.com/synthetichealth/synthea)mt-- free-text medical reports from MTSamples.The CogStack pipeline examples include:
syndataset. Using a single DB source, a single CogStack engine instance, storing the processed records into ElasticSearch.mtdataset. Used as an example in CogStack quickstart.syndataset withmt. Datasets stored in separate DBs. Using multiple DBs as a source, running multiple CogStack engine instances, single ES sink.mtdocuments data embedded withsynrecords in DB. Using DOCX, PDF and JPG documents formats.mtdocuments using Tika and then ingesting thesynrecords enriched with parsed docs to ES sink.All the examples can be easily deployed using Docker Compose for which YAML configuration files are provided (
examples/example*/docker). The YAML scripts are based on/docker-cogstack/compose-ymls/cogstack-clust/docker-compose.yml, but using only a single ES node with X-Pack security disabled.During the deployment of the sample databases, pre-generated DB dumps are being loaded during initialisation of the container. These DB dumps can be either:
examples/download_db_dumps.sh)examples/prepare_docs.sh,examples/prepare_db_dumps.sh) using provided raw datasets data and using predefined DB sql schemas available in each of the examples directory (examples/example*/extra)A more detailed description of the examples (preparing the data, deployment, running, etc.) can be found in the accompanying documentation (Jekyll-based):
docs/quickstart,docs/examples.