Skip to content

Data Streams #53100

@martijnvg

Description

@martijnvg

Update: the description is out dated, for more information take a look at the unreleased data streams docs.

The meta issue will track the development of a new feature named data streams.

Background

Data streams are targeted towards time-based data sources and enable us to solve bootstrap problems when indexing via a write alias (logstash-plugins/logstash-output-elasticsearch#858). Today aliases have some deficiencies in how they are implemented in Elasticsearch; namely they are not first-class concepts, making them confusing to use. Aliases have a broad type of usages, whereas data streams will be a solution focused for time-based data sources. Data streams should be a first-class concept, but should also be non intrusive.

Concept

A data stream formalizes the notion of a stream of data for time based data sources. Data streams groups indices from the same time-based data source together as an opaque container. A data stream keeps track of a list of indices ordered by generation. A generation starts at 0 and each time the stream is rolled over the generation is incremented by one. Writes are forwarded to the index with the highest generation (last index). Searches and other multi-index APIs forward requests to all indices that are part of a data stream (this is similar to how aliases are resolved in these APIs).

Because data streams are aimed at time-series data sources, a date field is required and must be identified as the "timestamp" for the documents in the data stream. This will enable Kibana to detect that is dealing with time-series data automatically, and we can internally apply some optimizations (e.g., automatically sorting on the timestamp field).

Indices that are contained with a data stream are hidden. The idea is that users interact with the data stream as much as possible and not directly with the indices that are contained within a data stream.

Data streams only accept append-only writes (index requests with op_type=create). Deletes and updates are rejected. If specific documents need to be updated or deleted then these operations should happen via the index these documents reside in. The reason that these write operations are rejected via a data stream is that these operations work as expected until a data stream is rolled over and then these operations result in 404 errors. Therefore it is better to reject these operations consistently.

The rollover API needs to understand how to handle a data stream. The data stream’s generation needs to be incremented and a new hidden index needs to be created atomically. The name of the index is based on the name of the data stream and its current generation, which looks like this: [datastream-name]-[datastream-generation].

A data stream can be created with the create data stream API and remove via the delete data stream API.

It should also be possible to reserve a namespace to be a data stream before actually creating the data stream. For this data streams will depend on index templates v2. Index templates will get an additional setting, named data_stream, which will create a data stream with the name of what should be the concrete index and a hidden index:

  • A user creates an index template (in v2 format) with a desired index pattern, mappings, index settings and sets the data_stream setting to true.
  • This user starts ingesting data and the auto create index functionality kicks in. The previously created template matches, but instead of creating an index the following is created:
    • A data stream with the targeted index name as name.
    • A hidden index with the following name: [data-stream-name]-0.
    • The hidden index is added to the list of indices of the data stream and the current generation is set to 0.

Additionally, data streams will be integrated into Kibana. For example, Kibana can automatically generate index patterns based on data streams, and identity the timestamp field.

Data Streams and ILM

The main change will be that it will no longer be required to configure index.lifecycle.rollover_alias setting for ilm. ILM can automatically figure out if an index is part of a datastream and act accordingly. An index can only be part of a single data stream, the user doesn’t create the index and the index is hidden, because of this clear structure ILM can make assumptions and doesn’t need additional configuration. Compared to using aliases (even with alias templates) this wouldn’t be true and this is a big upside of using datastreams.

ILM should also be able to update data streams atomically. For example in the context of ILM’s shrink action, the data stream should be updated to not refer to the un-shrunken index and refer to the shrunken index. For the rest, ILM should be able work as it is today.

Integration with security

TBD

APIs

The APIs and they way data streams are used from other APIs is not final and may change in the future.

Index expressions and data streams in APIs

Data streams should be taken into account when resolving an index expression. Also the API that resolves an index expressions is important. If a multi-index API tries to resolve the name of a data stream then all the indices of a stream should be resolved and if a write API tries to resolve the name of a data stream then only the latest index should be resolved.

The following write APIs will resolve a data stream to the latest index: bulk API and index API.
The following multi index APIs should resolve a data stream to all indices it contains: search, msearch, field caps and eql search.

Single document read APIs should fail when being used via a data stream. The following APIs fall into this category: explain, get, mget, termvector.

There are many admin APIs that are multi index. These apis should be able to resolve data streams and resolve to the latest hidden index backed by a data stream. Examples of these APIs are: put mapping, get mapping, get settings or get index.

The rollover API both accepts a write alias and a data stream.

Get index API

The get index api should if an index is part of a data stream include what data stream it is part of.

Create data stream API

Request:

PUT /_data_stream/[name] { "timestamp_field": ... } 

The create data stream API allows to create new data stream and the first backing index. This API creates a new data stream with the provided name and the provided timestamp field. The generation is set to 0. The backing index is created with the following name: '[data-stream-name]-000000'. The settings & mappings originate from any index template that match.

If an existing data stream, index or alias already exists with the same name as the provided data stream name then the create data stream will return with an error. Also if indices or aliases exist with the same prefix in the name as the provided data stream name then an error is returned as well.

Get data streams API

Request:

GET /_data_streams/[name] 

Returns the list of data streams based on the specified name and for each data stream additional meta data is included. (for example list of backing indices and current generation). If no name is provided then all data streams are returned. Also wildcard expressions are supported.

Delete data stream API

Request:

DELETE /_data_stream/[name] 

Deletes the specified data stream. Also the indices that are part of the data stream are removed.

Updating a data stream

TBD

A data stream can not be updated to include allow system or indices that are already part of data stream.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions