Wait for all shards to be available in synonyms test #117281

bcully · 2024-11-21T20:09:22Z

Now that fast refresh searches go to search nodes, we need to wait for replicas to be available on them before we hit them. See #117217 for context, it's the same issue.

Now that fast refresh searches go to search nodes, we need to wait for replicas to be available on them before we hit them. See elastic#117217 for context, it's the same issue.

elasticsearchmachine · 2024-11-21T20:09:46Z

Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing)

bcully · 2024-11-21T20:17:16Z

This is an upgrade of #114641

bcully · 2024-11-21T23:50:35Z

hitting timeouts only in v8.18.0 mixedClusterTest, but they look like they might be legitimate. Typical failure (2 active shards, 2 unassigned after 30 seconds):

[2024-11-22T09:04:15,230][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] Stash dump on test failure [{ "stash" : { "body" : { "cluster_name" : "v8.18.0", "status" : "yellow", "timed_out" : true, "number_of_nodes" : 4, "number_of_data_nodes" : 4, "active_primary_shards" : 1, "active_shards" : 2, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 2, "unassigned_primary_shards" : 0, "delayed_unassigned_shards" : 0, "number_of_pending_tasks" : 0, "number_of_in_flight_fetch" : 0, "task_max_waiting_in_queue_millis" : 0, "active_shards_percent_as_number" : 50.0 } } }] [2024-11-22T09:04:15,236][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] Waiting for all cluster updates up to this moment to be processed [2024-11-22T09:04:15,814][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] There are still tasks running after this test that might break subsequent tests [health-node[c]]. [2024-11-22T09:04:15,814][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test] [p0=synonyms/40_synonyms_sets_get/List synonyms set] after test REPRODUCE WITH: ./gradlew ":qa:mixed-cluster:v8.18.0#mixedClusterTest" -Dtests.class="org.elasticsearch.backwards.MixedClusterClientYamlTestSuiteIT" -Dtests.method="test {p0=synonyms/40_synonyms_sets_get/List synonyms set}" -Dtests.seed=5C3E518684DC6E0E -Dtests.bwc=true -Dtests.locale=sr-Cyrl-BA -Dtests.timezone=Asia/Vladivostok -Druntime.java=23

Not sure exactly what this test is doing yet, but maybe this is expected during an upgrade scenario.

kingherc · 2024-11-22T08:51:23Z

Hi @bcully ! Yes this one is probably more complex than the one you solved. We had some discussions with @carlosdelest who lastly did #114641 . The problem with the synonyms tests is that indeed they have unassigned shards but the tests still work in stateful because they go to the primary.

@carlosdelest , what @bcully recently figured out is that wait_for_no_initializing_shards: true may not always work out, because there's a chance the query goes out before the search shard appears as initializing and the request failing :(

I think by the way this is not a Distributed Indexing ticket, but should be for :Search Relevance/Analysis to resolve. It's not intrinsic to the fast refresh changes, meaning there are not actual bugs for the synonyms, but rather because of the test infrastructure (especially because it involves unassigned shards). There needs to be a good way found for waiting for at least one search shard to be fully recovered and at this point I'm running out of ideas :/

elasticsearchmachine · 2024-11-22T08:51:59Z

Pinging @elastic/es-search-relevance (Team:Search Relevance)

kingherc · 2024-11-22T09:03:41Z

I have forgotten the test details, but I wonder if maybe a combination of wait_for_active_shards: (e.g., 2) and wait_for_no_initializing_shards: true might work.

carlosdelest · 2024-11-22T09:04:28Z

@carlosdelest , what @bcully recently figured out is that wait_for_no_initializing_shards: true may not always work out, because there's a chance the query goes out before the search shard appears as initializing and the request failing :(

Thanks for the ping @kingherc - that is effectively what we're seeing in some test failures as you mention.

You're right that this is partly a test limitation - BwC updates nodes, but the shards cannot be assigned to them. As synonyms index has 0-all for auto-expand replicas, this results in the index not being green.

A way around that would be to update the synonyms system index to only change auto-expand replicas, but this is something the current system index infrastructure does not provide out of the box. I need to look into this to find a way forward with these test failures.

carlosdelest · 2024-11-22T09:21:38Z

This solution unfortunately won't work. There will always be unassigned shards on BwC due to synonyms system index having auto-expand replicas 0-all.

carlosdelest · 2024-11-22T10:35:44Z

I have forgotten the test details, but I wonder if maybe a combination of wait_for_active_shards: (e.g., 2) and wait_for_no_initializing_shards: true might work.

The problem is, we don't know how many shards there are as the number varies for YAML regular tests vs BwC vs serverless.

We could get that information from the mapping, but unf it is returned as a string and thus we can't use it as an input to the wait_for_active_shards option, which expects an integer.

We could of course add the test infra for checking equality between string and int, but in my mind the best way to solve this would be to change the auto expand replicas for Synonyms API, which is something we want to do anyway to align it with other system indices.

kingherc · 2024-11-22T10:44:23Z

change the auto expand replicas for Synonyms API, which is something we want to do anyway to align it with other system indices.

@carlosdelest that sounds like a good idea. And I guess then we'd know the number of shards and be able to use wait_for_active_shards?

Since it might take you a bit of time to get there, I remember I had a crude proposal to make the tests pass and that would be to do instead:

 - do: cluster.health: index: .synonyms-2 timeout: 1m wait_for_status: green ignore: 408

This means that in the BWC tests with unassigned shards, we'll make the tests wait 1m longer unfortunately, but it should give sufficient time to any search shard to recover so that subsequent synonyms search APIs work.
Just a crude proposal to avoid muting the tests.

carlosdelest · 2024-11-22T11:18:18Z

I guess then we'd know the number of shards and be able to use wait_for_active_shards?

We would just wait for green, as in BwC tests there would be 2 unassigned shards and 2 assigned ones, enough for an auto-expand of 0-1.

This means that in the BWC tests with unassigned shards, we'll make the tests wait 1m longer unfortunately, but it should give sufficient time to any search shard to recover so that subsequent synonyms search APIs work.

@kingherc I think you are right. I don't like the idea of making the tests run longer, but this should be an interim solution until we get the auto expand replicas fix sorted out.

We should apply the ignore timeout fix until we get this working.

carlosdelest · 2024-11-22T17:20:31Z

@kingherc , I would prefer not to add a long timeout that adds up to the suite execution time. We have 13 wait_for_no_initializing_shards on the tests - we need to ensure the timeout is something reasonable.

As this check is only needed for serverless, what would be a reasonable amount of time to wait for the search shard to be available? I would expect something less than 5 seconds?

kingherc · 2024-11-22T17:23:47Z

@carlosdelest you're right. I think a few seconds should be more than enough, assuming these indices are quite light in data.

carlosdelest · 2024-11-25T14:28:22Z

I've done the changes we discussed in #117486. Crossed fingers! 🤞

bcully · 2024-11-25T17:50:56Z

Superseded by #117486

Wait for all shards to be available in synonyms test

f2eda1f

Now that fast refresh searches go to search nodes, we need to wait for replicas to be available on them before we hit them. See elastic#117217 for context, it's the same issue.

bcully added >test Issues or PRs that are addressing/adding tests :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. v9.0.0 labels Nov 21, 2024

bcully requested a review from kingherc November 21, 2024 20:09

bcully self-assigned this Nov 21, 2024

elasticsearchmachine added the Team:Distributed Indexing Meta label for Distributed Indexing team label Nov 21, 2024

elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Nov 21, 2024

kingherc requested a review from carlosdelest November 22, 2024 08:51

kingherc added the :Search Relevance/Analysis How text is split into tokens label Nov 22, 2024

elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Nov 22, 2024

carlosdelest mentioned this pull request Nov 25, 2024

Fix and unmute synonyms tests #117486

Merged

bcully closed this Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Wait for all shards to be available in synonyms test #117281

Wait for all shards to be available in synonyms test #117281

Uh oh!

bcully commented Nov 21, 2024

elasticsearchmachine commented Nov 21, 2024

bcully commented Nov 21, 2024

bcully commented Nov 21, 2024

kingherc commented Nov 22, 2024

elasticsearchmachine commented Nov 22, 2024

kingherc commented Nov 22, 2024

carlosdelest commented Nov 22, 2024

carlosdelest commented Nov 22, 2024

carlosdelest commented Nov 22, 2024

kingherc commented Nov 22, 2024

carlosdelest commented Nov 22, 2024

carlosdelest commented Nov 22, 2024

kingherc commented Nov 22, 2024

carlosdelest commented Nov 25, 2024

bcully commented Nov 25, 2024

Labels

4 participants

Wait for all shards to be available in synonyms test #117281

Wait for all shards to be available in synonyms test #117281

Uh oh!

Conversation

bcully commented Nov 21, 2024

elasticsearchmachine commented Nov 21, 2024

bcully commented Nov 21, 2024

bcully commented Nov 21, 2024

kingherc commented Nov 22, 2024

elasticsearchmachine commented Nov 22, 2024

kingherc commented Nov 22, 2024

carlosdelest commented Nov 22, 2024

carlosdelest commented Nov 22, 2024

carlosdelest commented Nov 22, 2024

kingherc commented Nov 22, 2024

carlosdelest commented Nov 22, 2024

carlosdelest commented Nov 22, 2024

kingherc commented Nov 22, 2024

carlosdelest commented Nov 25, 2024

bcully commented Nov 25, 2024

Labels

4 participants