Skip to content
This repository was archived by the owner on Jan 13, 2023. It is now read-only.

Conversation

@lrog
Copy link
Contributor

@lrog lrog commented Aug 8, 2018

Created a set of examples for running CogStack out-of-the-box which can be possibly re-used for possible future deployment.

The datasets (in examples/rawdata) include:

The CogStack pipeline examples include:

  1. Running CogStack on a subset of syn dataset. Using a single DB source, a single CogStack engine instance, storing the processed records into ElasticSearch.
  2. Ex. 1 extended with parsed mt dataset. Used as an example in CogStack quickstart.
  3. Running CogStack on a full syn dataset with mt. Datasets stored in separate DBs. Using multiple DBs as a source, running multiple CogStack engine instances, single ES sink.
  4. Running CogStack with Apache Tika for processing (parsing + OCR) mt documents data embedded with syn records in DB. Using DOCX, PDF and JPG documents formats.
  5. Running 2-step CogStack data processing pipeline, firstly processing the mt documents using Tika and then ingesting the syn records enriched with parsed docs to ES sink.

All the examples can be easily deployed using Docker Compose for which YAML configuration files are provided (examples/example*/docker). The YAML scripts are based on /docker-cogstack/compose-ymls/cogstack-clust/docker-compose.yml, but using only a single ES node with X-Pack security disabled.

During the deployment of the sample databases, pre-generated DB dumps are being loaded during initialisation of the container. These DB dumps can be either:

  • downloaded from Amazon S3 cogstack bucket (examples/download_db_dumps.sh)
  • generated locally (examples/prepare_docs.sh, examples/prepare_db_dumps.sh) using provided raw datasets data and using predefined DB sql schemas available in each of the examples directory (examples/example*/extra)

A more detailed description of the examples (preparing the data, deployment, running, etc.) can be found in the accompanying documentation (Jekyll-based):

  • quickstart tutorial in docs/quickstart,
  • a bit more exhaustive description of examples in docs/examples.
lrog added 30 commits July 4, 2018 13:39
…tly from db dump; example of cogstack ingesting from 2 data sources
…eated explocitly examples directory with data, scripts and docker-compose files
…ing Tika ; includes: pdf-text, pdf-img, doc, jpg
…d the examples data preparation scripts. added script for downloading db dumps directly from s3 bucket
… docker deployment folder -> creation of temporary __deploy dir
@lrog lrog requested a review from afolarin August 8, 2018 09:50
@afolarin
Copy link
Contributor

LGTM

@afolarin afolarin merged commit 559a3d2 into dev Aug 15, 2018
@lrog lrog deleted the sample_data branch September 12, 2018 21:01
vladd-bit pushed a commit that referenced this pull request Nov 10, 2021
* creating a sample postgres database from csv files for testing cogstack * minor reformatting * minor cleanup in the sample job properties file * added support for MTSamples; postgres sample db initializes now directly from db dump; example of cogstack ingesting from 2 data sources * added db dump for synthetic and mtsamples data * removed the cogsack-sample from docker-comppose ymls directory and created explocitly examples directory with data, scripts and docker-compose files * added raw datasets to re-generate db dumps for examples * added db dump for example 1 * added db dump for example 2 * added db dump for example 3 * added quickstart based on Example 2 and using Jekyll static website generator (GH-pages compatible) * minor comment in the db schema * minor fixes in setup.sh scripts (typo + gen .htpasswd) * example2: changed paritioner.gridSize: 3 --> 1 * missing minor nginx fix in examples setup.sh scripts * added script to generate a static websites from a Jekyll-based GH pages * added .gitignore to ignore _www dir * added getting cogstack + setup parts for quickstart; minor refactoring * added .gitignore for quickstart Jekyll * added DCT field in examples schemas for more convenient (and explicit) handling by cogstack * added script to automatically generate sample db dumps for all samples * updated quickstart with changes in examples DB schemas * re-generated DB dumps * added ex.4 as extended ex.2 with PDFs instead of text documents * working on mtsamples pdf and jpg versions -- example 4-* * added example4 -- a set of examples for processing documents in DB using Tika ; includes: pdf-text, pdf-img, doc, jpg * removed db dumps from GH and moved them to S3 cogstack bucket. updated the examples data preparation scripts. added script for downloading db dumps directly from s3 bucket * finished example4. minor updates to examples 1-3 changing the default docker deployment folder -> creation of temporary __deploy dir * minor update to db dump creation/download scripts -- check for db_dump dir exists * updated the quickstart covering the update with deployment: __deploy dir * added example 5 -- a 2-step data ingestion (WIP) * minor refactoring + fix for using dockerhub vrrsion of cogstack image * minor refactoring of example cogstack *.properties files * updated quickstart documentation with minor corrections (removed unnecessary postgres profile) * minor update of quickstart, missing bits * added a preliminary documentation for all the available examples * minor cleanup in examples documentation yml configuration file
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

3 participants