Skip to content

Conversation

@s1monw
Copy link
Contributor

@s1monw s1monw commented Aug 14, 2018

This change adds a _source only snapshot repository that allows to wrap
any existing repository as a backend to snapshot only the _source part
including live docs markers. Snapshots taken with the source repository
won't include any indices, doc-values or points. The snapshot will be reduced in size and
functionality such that it requires full re-indexing after it's successfully restored.

The restore process will copy the _source data locally starts a special shard and engine
to allow match_all scrolls and searches. Any other query, or get call will fail with and unsupported operation exception. The restored index is also marked as read-only.

This feature aims mainly for disaster recovery use-cases where snapshot size is
a concern or where time to restore is less of an issue.

NOTE: The snapshot produced by this repository is still a valid lucene index. This change doesn't allow for any longer retention policies which is out of scope for this change.

This change adds a `_source` only snapshot repository that allows to wrap any existing repository as a _backend_ to snapshot only the `_source` part including live docs markers. Snapshots taken with the `source` repository won't include any index structures. The snapshot will be reduced in size and functionality such that it requires in-place reindexing during restore. The restore process will copy the `_source` data locally and reindexing all data during the recovery from snapshot phase. Users have 2 options for re-indexing: * full reindex: where the data will be reindexed with the original mapping * minimal reindex: where the data will be reindexed with a disabled mapping that results in an index that can only be accessed via `_id`. Both options allow using and updating the index while the latter is mainly for scan/scroll purposes and re-indexing after the fact. This feature aims mainly for disaster recovery use-cases where snapshot size is a concern or where time to restore is less of an issue.
@s1monw s1monw requested a review from bleskes August 14, 2018 13:10
@@ -0,0 +1,39 @@
[[repository-src-only]]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clintongormley @debadair I'd love to get input where this should be linked from and where it should be located. at this point it's stand-alone.

import static org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.FIELDS_EXTENSION;
import static org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.FIELDS_INDEX_EXTENSION;

public final class SourceOnlySnapshot {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be generally useful and moved to lucene land. I will do this after the fact.

public static final Setting<Boolean> RESTORE_MINIMAL = Setting.boolSetting("restore_minimal",
false, Setting.Property.NodeScope);

public static final String SNAPSHOT_DIR_NAME = "_snapshot";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bleskes we use tmp dirs for restore and snapshot. I wonder if that is ok or if there are any concerns. Our shard deletion mechanism should take care of cleaning up.

BytesReference source = rootFieldsVisitor.source();
if (source != null) { // nested fields don't have source. in this case we should be fine.
// TODO we should have a dedicated origin for this LOCAL_TRANSLOG_RECOVERY is misleading.
Engine.Result result = shard.applyTranslogOperation(new Translog.Index(uid.type(), uid.id(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a pretty terrible abuse of the Translog phase. we should rename it to Ops phase or so. I spoke with @bleskes about this. This is a shortcut and needs some discussion. also with @ywelsch

@s1monw
Copy link
Contributor Author

s1monw commented Aug 15, 2018

I did some initial benchmarks using our geonames and http_logs dataset we use for our benchmarks:

repository type dataset snapshot time taken snapshot size
fs geonames 82 sec 3.2 GB
source delegate to fs geonames 28 sec 922 MB
fs http_logs 7 min 6 sec 16 GB
source delegate to fs http_logs 3 min 44 sec 7.5 GB

the snapshots are all taken to a local disk ie. no network involved here. I will follow up with restore times which I expect to be much better for full backups (fs) since source needs to reindex. Yet, I already have some numbers for the geonames dataset:

repository type dataset restore time taken snapshot size num docs reindexed num shards
fs geonames 1 min 10 sec 3.2 GB 0 5
source full reindex geonames 3 min 15 sec 922 MB 11396505 5
source minimal reindex geonames 1 min 24 sec 922 MB 11396505 5
@s1monw
Copy link
Contributor Author

s1monw commented Aug 16, 2018

here are some updated numbers:

repository type dataset restore time taken snapshot size num docs reindexed num shards
fs http_logs 6.7 min 16 GB 0 4
source full reindex http_logs 33.6 min 7.1 GB 181463624 4
source minimal reindex http_logs 16.9 min 7.1 GB 181463624 4
@s1monw
Copy link
Contributor Author

s1monw commented Sep 12, 2018

@bleskes @debadair I pushed changes. Thanks for the reviews.

Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

* Queries other than `match_all` will return no results.
* `_get` requests are not supported.
* Queries other than `match_all` and `_get` requests are not supported.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this reads in two ways - you can also read this as if _get works (if you don't understand that _get is not what we see as a query)


@Override
public Terms terms(String field) {
throw new UnsupportedOperationException("_source only indices can't be searched or filtered");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

@s1monw s1monw merged commit c783488 into elastic:master Sep 12, 2018
s1monw added a commit to s1monw/elasticsearch that referenced this pull request Sep 13, 2018
This change adds a `_source` only snapshot repository that allows to wrap any existing repository as a _backend_ to snapshot only the `_source` part including live docs markers. Snapshots taken with the `source` repository won't include any indices, doc-values or points. The snapshot will be reduced in size and functionality such that it requires full re-indexing after it's successfully restored. The restore process will copy the `_source` data locally starts a special shard and engine to allow `match_all` scrolls and searches. Any other query, or get call will fail with and unsupported operation exception. The restored index is also marked as read-only. This feature aims mainly for disaster recovery use-cases where snapshot size is a concern or where time to restore is less of an issue. **NOTE**: The snapshot produced by this repository is still a valid lucene index. This change doesn't allow for any longer retention policies which is out of scope for this change.
s1monw added a commit that referenced this pull request Sep 13, 2018
This change adds a `_source` only snapshot repository that allows to wrap any existing repository as a _backend_ to snapshot only the `_source` part including live docs markers. Snapshots taken with the `source` repository won't include any indices, doc-values or points. The snapshot will be reduced in size and functionality such that it requires full re-indexing after it's successfully restored. The restore process will copy the `_source` data locally starts a special shard and engine to allow `match_all` scrolls and searches. Any other query, or get call will fail with and unsupported operation exception. The restored index is also marked as read-only. This feature aims mainly for disaster recovery use-cases where snapshot size is a concern or where time to restore is less of an issue. **NOTE**: The snapshot produced by this repository is still a valid lucene index. This change doesn't allow for any longer retention policies which is out of scope for this change.
@jhalterman
Copy link
Contributor

@s1monw What's the meaning of minimal vs full reindex in this comment?

@s1monw
Copy link
Contributor Author

s1monw commented Sep 13, 2018

@jhalterman that was an early version of the change that I reverted. These numbers are meaningless now.

s1monw added a commit to s1monw/elasticsearch that referenced this pull request Sep 17, 2018
We can't rely on the leaf reader ordinal in a wrapped reader since it might not correspond to the ordinal in the SegmentInfos for it's SegmentCommitInfo. Relates to elastic#32844 Closes elastic#33689
s1monw added a commit that referenced this pull request Sep 17, 2018
We can't rely on the leaf reader ordinal in a wrapped reader since it might not correspond to the ordinal in the SegmentInfos for it's SegmentCommitInfo. Relates to #32844 Closes #33689 Closes #33755
s1monw added a commit that referenced this pull request Sep 18, 2018
We can't rely on the leaf reader ordinal in a wrapped reader since it might not correspond to the ordinal in the SegmentInfos for it's SegmentCommitInfo. Relates to #32844 Closes #33689
s1monw added a commit that referenced this pull request Sep 18, 2018
We can't rely on the leaf reader ordinal in a wrapped reader since it might not correspond to the ordinal in the SegmentInfos for it's SegmentCommitInfo. Relates to #32844 Closes #33689 Closes #33755
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment