Sample data generation

@leehinman

When building integration packages, sample data is important to develop ingest pipeline and build dashboards. Unfortunately in most cases, real sample data is limited and often tricky to produce. This issues proposes a tool as part of elastic-package that can generate and load sample data.

Important: The following is only an initial proposal to better explain the problem and share existing ideas. A proper design is still required.

Why part of elastic-package

Generating sample data is not a new problem and there are several tools which already provide partial solutions to this. A tool to generate sample data in elastic-package is needed to make it available in a simple way to each package developers. How sample data should look and be generated becomes part of the package spec. Like this, someone building a package directly also gets the possibility of generating sample data and use it as part of the developer experience.

Data generation - metrics / logs

For the data generation, two different types of data exist. Metrics and traces are mostly already in the format that will be ingested into Elasticsearch and require very little processing. Logs on the other hand often come as raw messages and require ingest pipelines or runtime fields to structure the data. The goal is that the tool can generate both types of data but it can happen in iterations.

Metrics generation

For the generation of metrics, I suggest to take strong inspiration from the elastic-integration-corpus-generator-tool tool built by @aspacca. Instead of having to build separate config files, the config params for each field would be directly in the fields.yml of each data stream. The definition could look similar to the following:

- name: kubernetes.pod.network.rx.bytes type: long format: bytes unit: byte metric_type: counter description: | Received bytes _data_generation: # Discuss exact name fuzziness: 1000 range: 10000

The exact syntax for each field type needs definition.

Logs generation

For logs generation, inspiration can be taken from the tool spigot by @leehinman. Ideally we could simplify this by allowing users to specify the message patterns something like {@timestamp} {source.ip} and the specify for these fields what the values should be. Then the tool would take over the generation of sample logs.

Important is that the log generation outputs message fields pre ingest pipeline.

Generated data format

The proposed data structure generated by the tool is the one used by esrally. It contains 1 JSON doc per line with all the fields inside. This makes it simple to just deliver the data to Elasticsearch and makes it possible to potentially reuse some of this generated data with rally tracks.

Non goals

A non goal of the data generation on loading of data is to replace rally. Rally measures the exact performance and builds reproducible benchmarks. When generating and loading data with elastic-package it is about testing ingest pipelines, testing dashboards and test queries on larger sets of data in an easy way. The focus is on package development.

Another non goal is to generate events that are related to each other. For some solutions it is important that if a host.name shows up other parts of the data contain the same host.name to be able to browse through the solution. This might be added at a later point but is not part of the scope.

Sample data storage

As the sample data can always be generated on the fly, it is not required to store it. If some of the sample data sets should be stored for later use, package-spec should provide a schema to reference sample datasets.

Command line

Command line arguments must be available to generate sample data for a dataset or a package and load it into Elasticsearch. Ideally package-spec allows to store some config files around which data sets can be generated so a package developer can share these configs as part of the package.

Initial packages to start with

I recommend to pick 2 initial packages to start with around the data generation. As k8s and AWS are both more complex package that also generate lots of data, this could be a good start focusing on the metrics part.

Future ideas

Use the data generation to test expected storage use per dataset. This can be used to compare storage use across versions but also help users predict how much storage will be needed.
Load sample data and run a report on the dashboards and export performance metrics as part of a pull request. The report will also help to see if some parts of a dashboard are broken
Real time event generation: Instead of pre-generating sample data, elastic-package could keep continuing shipping events to Elasticsearch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Sample data generation #984

Why part of elastic-package

Data generation - metrics / logs

Metrics generation

Logs generation

Generated data format

Non goals

Sample data storage

Command line

Initial packages to start with

Future ideas

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Sample data generation #984

Description

Why part of elastic-package

Data generation - metrics / logs

Metrics generation

Logs generation

Generated data format

Non goals

Sample data storage

Command line

Initial packages to start with

Future ideas

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions