- Notifications
You must be signed in to change notification settings - Fork 25.6k
EQL: Add circuit breaker to sampling #87545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EQL: Add circuit breaker to sampling #87545
Conversation
astefan left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if another code line is a better place for calculating the memory usage. Added a comment related to that.
| if (match != null) { | ||
| finalSamples.add(match); | ||
| samples.add(new Sample(sampleKeys.get(responseIndex / maxCriteria), match)); | ||
| addSample(new Sample(sampleKeys.get(responseIndex / maxCriteria), match)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may not be the best place to compute the memory usage... At this point in code, one Sample is found and added to the final results list. I'm worried that the algorithm will run multiple cycles, not finding any samples, but still use memory to go over pages and pages of results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only place where a Page is added to the stack. The page size depends on how many join keys are returned and its max size is configurable (defaults to 1000). I'm wondering if this shouldn't be a better place for calculating the memory usage. We add a new page elements counter and every 1000 (?) new elements in pages, we calculate the memory usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it makes sense, I'm adding a memory check there as well.
As a side effect, the page size becomes a constraint for circuit breaker precision (ie. the bigger the pages, the less precise the circuit breaker becomes), limiting our ability of fine tuning.
To mitigate this problem, probably it's worth having the memory check both on stack push and on sample add (the samples collection is monotonically growing and potentially unlimited, so it's a hotspot for memory usage)
costin left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that computing memory for every sample is expensive however I'm not sure why there are two sampling sizes.
I would pick only one strategy and see how well it works, for example pick one of:
a. check the memory at the end of every listener call; essentially after the results have been parsed
b. check the memory every X samples added to the stack
| protected void pushToStack(Page nextPage) { | ||
| stack.push(nextPage); | ||
| totalItemsAddedToStack += nextPage.size; | ||
| if ((totalItemsAddedToStack - lastStackSizeWhenCalculatedMemory) / CB_STACK_SIZE_PRECISION > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to do division just substraction since this checks if a threshold was met, it doesn't matter by how much:
if (totalItemsAddedToStack - lastStackSizeWhenCalculatedMemory >= CB_STACK_SIZE_PRECISION)...
| private long stackRamBytesUsed = 0; | ||
| private long totalRamBytesUsed = 0; | ||
| private long totalItemsAddedToStack = 0; | ||
| private long lastStackSizeWhenCalculatedMemory = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Despite the long name I find the variable name non-descriptive. How about: samplesCount and lastSampleCount?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not really samples count (it's intermediate data, not samples), but I see your point.
Probably we can go with totalPageSize and lastTotalPageSize, WDYT?
| Thanks @costin, I think your comment (better having only one strategy to about memory calculation) together with what @astefan suggested (use the stack to decide when to execute the memory check) leads us to a simplification: I'm removing the check on the samples for now, and leaving only the one on the stack insertions. |
- simplify circuit breaker logic - variable naming
bpintea left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
| | ||
| int initialSize = samples.size(); | ||
| client.multiQuery(searches, ActionListener.wrap(r -> { | ||
| List<List<SearchHit>> finalSamples = new ArrayList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👌
| private long samplesRamBytesUsed = 0; | ||
| private long stackRamBytesUsed = 0; | ||
| private long totalRamBytesUsed = 0; | ||
| /** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this might be a practice by now, with ES's heap pages being locked in RAM, but just xxxMemSize might have been a bit less assuming of where that memory is.
| | ||
| private void testMemoryCleared(boolean fail) { | ||
| List<BreakerSettings> eqlBreakerSettings = Collections.singletonList( | ||
| new BreakerSettings( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if it'd be worth making EqlPlugin#getCircuitBreaker() static and feed it Settings.EMPTY to get this value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, having a helper method for that in the test suite would save some code.
Unfortunately we cannot make EqlPlugin#getCircuitBreaker() static though, since it's part of CircuitBreakerPlugin interface contract...
| ESMockClient esClient = new ESMockClient(service.getBreaker(CIRCUIT_BREAKER_NAME)); | ||
| ) { | ||
| CircuitBreaker eqlCircuitBreaker = service.getBreaker(CIRCUIT_BREAKER_NAME); | ||
| EqlConfiguration eqlConfiguration = new EqlConfiguration( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also here, wondering if it'd be worth adding a EqlTestUtils method to configure a configuration, since most of this is contained in the TEST_CFG.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's what EqlTestUtils#randomConfiguration() already does, actually. Double-checking to see if there are significant differences, but I guess we can just use that.
| Sample sample = new Sample(new SequenceKey(randomAlphaOfLength(10)), searchHits); | ||
| return sample; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Local variable 'sample' is redundant"
| | ||
| @Override | ||
| public void readBytes(byte[] b, int offset, int len) throws IOException { | ||
| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: could these empty methods be collapsed on one line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately our style check rules complain about that
astefan left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Left two minor comments.
| /** | ||
| * total number of hits (ie. sum of page sizes) added to the stack when last memory check was executed | ||
| */ | ||
| private long lastTotalPageSize = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
previousTotalPageSize is a better name?
| | ||
| CIRCUIT_BREAKER.startBreaking(); | ||
| iterator.pushToStack(new SampleIterator.Page(CB_STACK_SIZE_PRECISION / 2)); | ||
| iterator.pushToStack(new SampleIterator.Page(CB_STACK_SIZE_PRECISION - (CB_STACK_SIZE_PRECISION / 2) - 1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CB_STACK_SIZE_PRECISION - (CB_STACK_SIZE_PRECISION / 2) meaning CB_STACK_SIZE_PRECISION / 2. Why showing it like that? I think it's relevant enough that the first push adds half of the limit, the second push adds another half minus one, and the last step it's triggering the circuit breaker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that's due to rounding (should the constant be changed to an odd number). But yes, I also thought it could be simplified (smth like a push with CB_STACK_SIZE_PRECISION, then +1).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really needed indeed, I'm simplifying it
costin left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
No description provided.