- Notifications
You must be signed in to change notification settings - Fork 25.5k
Wait for all shards to be available in synonyms test #117281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Now that fast refresh searches go to search nodes, we need to wait for replicas to be available on them before we hit them. See elastic#117217 for context, it's the same issue.
Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing) |
This is an upgrade of #114641 |
hitting timeouts only in v8.18.0 mixedClusterTest, but they look like they might be legitimate. Typical failure (2 active shards, 2 unassigned after 30 seconds):
Not sure exactly what this test is doing yet, but maybe this is expected during an upgrade scenario. |
Hi @bcully ! Yes this one is probably more complex than the one you solved. We had some discussions with @carlosdelest who lastly did #114641 . The problem with the synonyms tests is that indeed they have unassigned shards but the tests still work in stateful because they go to the primary. @carlosdelest , what @bcully recently figured out is that I think by the way this is not a Distributed Indexing ticket, but should be for :Search Relevance/Analysis to resolve. It's not intrinsic to the fast refresh changes, meaning there are not actual bugs for the synonyms, but rather because of the test infrastructure (especially because it involves unassigned shards). There needs to be a good way found for waiting for at least one search shard to be fully recovered and at this point I'm running out of ideas :/ |
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
I have forgotten the test details, but I wonder if maybe a combination of |
Thanks for the ping @kingherc - that is effectively what we're seeing in some test failures as you mention. You're right that this is partly a test limitation - BwC updates nodes, but the shards cannot be assigned to them. As synonyms index has 0-all for auto-expand replicas, this results in the index not being green. A way around that would be to update the synonyms system index to only change auto-expand replicas, but this is something the current system index infrastructure does not provide out of the box. I need to look into this to find a way forward with these test failures. |
This solution unfortunately won't work. There will always be unassigned shards on BwC due to synonyms system index having auto-expand replicas 0-all. |
The problem is, we don't know how many shards there are as the number varies for YAML regular tests vs BwC vs serverless. We could get that information from the mapping, but unf it is returned as a string and thus we can't use it as an input to the We could of course add the test infra for checking equality between string and int, but in my mind the best way to solve this would be to change the auto expand replicas for Synonyms API, which is something we want to do anyway to align it with other system indices. |
@carlosdelest that sounds like a good idea. And I guess then we'd know the number of shards and be able to use wait_for_active_shards? Since it might take you a bit of time to get there, I remember I had a crude proposal to make the tests pass and that would be to do instead:
This means that in the BWC tests with unassigned shards, we'll make the tests wait 1m longer unfortunately, but it should give sufficient time to any search shard to recover so that subsequent synonyms search APIs work. |
We would just wait for green, as in BwC tests there would be 2 unassigned shards and 2 assigned ones, enough for an auto-expand of 0-1.
@kingherc I think you are right. I don't like the idea of making the tests run longer, but this should be an interim solution until we get the auto expand replicas fix sorted out. We should apply the ignore timeout fix until we get this working. |
@kingherc , I would prefer not to add a long timeout that adds up to the suite execution time. We have 13 As this check is only needed for serverless, what would be a reasonable amount of time to wait for the search shard to be available? I would expect something less than 5 seconds? |
@carlosdelest you're right. I think a few seconds should be more than enough, assuming these indices are quite light in data. |
I've done the changes we discussed in #117486. Crossed fingers! 🤞 |
Superseded by #117486 |
Now that fast refresh searches go to search nodes, we need to wait for replicas to be available on them before we hit them. See #117217 for context, it's the same issue.