Skip to content

Conversation

@tlrx
Copy link
Member

@tlrx tlrx commented May 10, 2023

Before #95891 a file was considered as reused in recovery if
it was fully cached in the cold persistent cache. Otherwise
the full file length was reported as recovered from the
blob store during prewarming.

While working on #95891 I changed the
CachedBlobContainerIndexInput.prefetchPart()
method to be more precise on the number of bytes that were
effectively read from the blob store during prewarming.

But this obviously broke some tests (#95970) because a file
cannot be partially recovered from disk and from remote. This
change restores the previous behavior with one adjustment:
the file is considered as reused if the prewarm method
effectively prefetched 0 bytes.

Closes #95970
Closes #95994

…d from cache Before elastic#95891 a file was considered as reused in recovery if it was fully cached in the cold persistent cache. Otherwise the full file length was reported as recovered from the blob store during prewarming. While working on elastic#95891 I changed the CachedBlobContainerIndexInput.prefetchPart() method to be more precise on the number of bytes that were effectively read from the blob store during prewarming. But this obviously broke some tests (elastic#95970) because a file cannot be partially recovered from disk and from remote. This change restores the previous behavior with one adjustment: the file is considered as reused if less than 50% of the file is prewarmed from the blob store. Relates elastic#95891
@tlrx tlrx added >bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.8.1 v8.9.0 labels May 10, 2023
@elasticsearchmachine
Copy link
Collaborator

Hi @tlrx, I've created a changelog YAML for you.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label May 10, 2023
// we can't report a file as being partially recovered from disk and partially from the blob store, but this is
// something that can happen for fully mounted searchable snapshots. So we make a choice here: if less than 50%
// of the file is prewarmed from the blob store then it is marked as reused from cache.
if ((prefetchedBytes.get() / (double) file.length()) < 0.5d) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a chance this could break for a very small segment files?
I am thinking about example where 1/3 parts were in disk already and we only downloaded 2/3.

I wonder if we should consider changins boolean reused to long reused in FileDetail and consider a file fully recovered when recovered < length && reused + recovered == length?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure to understand why it would break for small files? We're comparing the number of bytes prewarmed with the total file length so the number of parts do not really matter. Also, most small files are likely to be either fully prewarmed or fully reused.

I wonder if we should consider changins boolean reused to long reused in FileDetail and consider a file fully recovered when recovered < length && reused + recovered == length?

We could but this is a larger change that I don't think is really worth doing since it only makes sense for cold cache.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to understand if there is something that guarantee that the following would not happen:

  • 2/3 are already cached locally and would not be downloaded
  • 1/3 is downloaded updating a stats so that it expects something would complete download before reporting it as finished.
@idegtiarenko
Copy link
Contributor

Is it worth adding an assertion somewhere around


that would verify that for each index for each FileDetail we have assert reused ^ recovered == length (possibly in a separate pr)?

@bpintea bpintea added v8.8.2 and removed v8.8.1 labels Jun 6, 2023
@elasticsearchmachine
Copy link
Collaborator

Hi @tlrx, I've updated the changelog YAML for you.

@tlrx tlrx added the v8.9.1 label Jun 23, 2023
@tlrx
Copy link
Member Author

tlrx commented Jun 23, 2023

@elasticmachine retest this please

@tlrx
Copy link
Member Author

tlrx commented Jun 23, 2023

Thanks @idegtiarenko for your feedback. I reworked this change to revert to the previous behavior.

Is it worth adding an assertion somewhere around [..] that would verify that for each index for each FileDetail we have assert reused ^ recovered == length (possibly in a separate pr)?

I suppose we can yes. We don't have must assertions around recovery file details but I can add one as a follow up.

@joegallo joegallo added v8.8.3 and removed v8.8.2 labels Jun 23, 2023
@tlrx tlrx added auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) auto-backport-and-merge and removed auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) labels Jun 26, 2023
@tlrx
Copy link
Member Author

tlrx commented Jun 26, 2023

@elasticmachine run elasticsearch-ci/part-1

@tlrx tlrx merged commit cab2192 into elastic:main Jun 26, 2023
@tlrx tlrx deleted the fix-recovered-bytes-cold-cache branch June 26, 2023 11:28
@tlrx
Copy link
Member Author

tlrx commented Jun 26, 2023

Thanks Ievgen!

tlrx added a commit to tlrx/elasticsearch that referenced this pull request Jun 26, 2023
…d from cache (elastic#95987) Before elastic#95891 a file was considered as reused in recovery if it was fully cached in the cold persistent cache. Otherwise the full file length was reported as recovered from the blob store during prewarming. While working on elastic#95891 I changed the CachedBlobContainerIndexInput.prefetchPart() method to be more precise on the number of bytes that were effectively read from the blob store during prewarming. But this obviously broke some tests (elastic#95970) because a file cannot be partially recovered from disk and from remote. This change restores the previous behavior with one adjustment: the file is considered as reused if the prewarm method effectively prefetched 0 bytes. Closes elastic#95970 Closes elastic#95994
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.9
8.8
tlrx added a commit to tlrx/elasticsearch that referenced this pull request Jun 26, 2023
…d from cache (elastic#95987) Before elastic#95891 a file was considered as reused in recovery if it was fully cached in the cold persistent cache. Otherwise the full file length was reported as recovered from the blob store during prewarming. While working on elastic#95891 I changed the CachedBlobContainerIndexInput.prefetchPart() method to be more precise on the number of bytes that were effectively read from the blob store during prewarming. But this obviously broke some tests (elastic#95970) because a file cannot be partially recovered from disk and from remote. This change restores the previous behavior with one adjustment: the file is considered as reused if the prewarm method effectively prefetched 0 bytes. Closes elastic#95970 Closes elastic#95994
elasticsearchmachine pushed a commit that referenced this pull request Jun 26, 2023
…d from cache (#95987) (#97103) Before #95891 a file was considered as reused in recovery if it was fully cached in the cold persistent cache. Otherwise the full file length was reported as recovered from the blob store during prewarming. While working on #95891 I changed the CachedBlobContainerIndexInput.prefetchPart() method to be more precise on the number of bytes that were effectively read from the blob store during prewarming. But this obviously broke some tests (#95970) because a file cannot be partially recovered from disk and from remote. This change restores the previous behavior with one adjustment: the file is considered as reused if the prewarm method effectively prefetched 0 bytes. Closes #95970 Closes #95994
elasticsearchmachine pushed a commit that referenced this pull request Jun 26, 2023
…d from cache (#95987) (#97102) Before #95891 a file was considered as reused in recovery if it was fully cached in the cold persistent cache. Otherwise the full file length was reported as recovered from the blob store during prewarming. While working on #95891 I changed the CachedBlobContainerIndexInput.prefetchPart() method to be more precise on the number of bytes that were effectively read from the blob store during prewarming. But this obviously broke some tests (#95970) because a file cannot be partially recovered from disk and from remote. This change restores the previous behavior with one adjustment: the file is considered as reused if the prewarm method effectively prefetched 0 bytes. Closes #95970 Closes #95994
@rjernst rjernst added v8.9.0 and removed v8.9.1 labels Jun 28, 2023
tlrx added a commit to tlrx/elasticsearch that referenced this pull request Jun 30, 2023
tlrx added a commit that referenced this pull request Jul 4, 2023
…97278) Fix in #95987 wasn't complete, I did not restore all the previous behavior. Closes #95994
tlrx added a commit to tlrx/elasticsearch that referenced this pull request Jul 4, 2023
…lastic#97278) Fix in elastic#95987 wasn't complete, I did not restore all the previous behavior. Closes elastic#95994
tlrx added a commit to tlrx/elasticsearch that referenced this pull request Jul 4, 2023
…lastic#97278) Fix in elastic#95987 wasn't complete, I did not restore all the previous behavior. Closes elastic#95994
elasticsearchmachine pushed a commit that referenced this pull request Jul 4, 2023
…97278) (#97348) Fix in #95987 wasn't complete, I did not restore all the previous behavior. Closes #95994
elasticsearchmachine pushed a commit that referenced this pull request Jul 4, 2023
…97278) (#97347) Fix in #95987 wasn't complete, I did not restore all the previous behavior. Closes #95994
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.8.3 v8.9.0 v8.10.0

7 participants