I have a couple of questions regarding LogsDB, and more specifically optimized routing. It is generally recommended to use low cardinality fields for LogsDB index sorting, but there are no pointers on what cardinality is best-practice.
Since the default is host.name, one can assume that the best cardinality would be the data source for the data stream integrations. For example, on system.syslog it would be host.name, but for network related integrations, like fortigate it would be observer.name.
However, for data sets with less sources, like fortigate, the storage savings are between 40 and 50 percent for our clusters. Compared to datasets with more sources, like system which is installed on a lot of hosts/agents, the savings are about 20%.
Is it then recommended to change those defaults for the system integration as well, and if so, what would be an example field used for sorting.
This is especially important if we want to implement optimized routing which requires a combination of two low cardinality fields, at the very least.
First, what level of license do you have: Basic or Enterprise?
I ask because the majority of disk size savings typically come from the Synthetic Source, which Logsdb uses, but requires an Enterprise license. The sorting has a much lower, though not nonexistent, impact on storage savings compared to the synthetic source.
Thus, I'm a bit confused by your results, Synthetic Source, savings are related to the size and complexity of the source material. You can use the _disk_usage API to validate where the savings are coming from.
Regarding sorting fields, I'm not entirely sure your assumptions are accurate. In short, I would not change the default sorting unless there's a clear reason to do so. I consider what the most likely and consistently present filter will be on the search side.
If you have an Enterprise license, you can route on sorting, but this is an "Expert" only feature, as misapplication can lead to negative consequences like hot-spotting on ingest.
...
EDIT: Sorry re-reading... What are you actually trying to accomplish with the routing... Do you already have synthetic source enabled? Are you trying to just squeeze out more disk space? If you don't get it right and thoroughly test you can introduce negative consequences.
Thanks for the detailed response. We’re on enterprise and are using synthetic source. Although only a couple of days have passed so maybe some more time needs to pass for the savings to be more noticeable.
We are trying to squeeze out more disk space as a lot of our installations are being expanded, but we’re limited on storage.
Yeah, we don’t wanna cause shard hotspots, and we do realise we need to test everything, but there are no references or guidelines to do so - hence the post here.
We’ll definitely play around ourselves and see if, and how much datasets change based on the routing fields with and without optimised routing.
For reference, our clusters have x2 nodes for each data tier, with a replica. Not sure if that sort of info is important for optimised routing.
One thing we've noticed is that on datasets that have the event.original field stored the savings are about 20%. While on those which do not the saving are between 60 and 80 percent. Does this make sense? if we're on to something, the docs should be updated to reflect this.
Not sure which docs you are referring to you. But If you keep the event original logsDB or not, it's going to be a much larger storage footprint. It's basically a copy of the whole original event.
Typically people leave that in for two reasons initially while they're debugging or for some compliance reason that they need to keep the original event.
If you don't need to for most integrations, there's a simple flag to drop it. If not, you can simply drop it in in an @custom pipeline at the end
We went a bit off topic, the thread was originally meant to get a bit more info on advanced logsdb routing, as the docs are a bit scarce with the required information. Is there any additional information we can get about how it works, what is the optimal cardinality, what would create segments hotspots, etc.
I do not have exact guidance on cardinality... in the end, you want whatever routing you need to be pretty evenly distributed. You have 2 nodes, which usually means a low number of shards... which means if you mess up the sorting you may have unbalanced shards.
I would ask what is the most common search partition is which might be a good approach.
host.name? service,name?
Cardinality not 1 not 1M
Also note :
If you apply custom sort settings, the @timestamp field is injected into the mappings but is not automatically added to the list of sort fields. For best results, include it manually as the last sort field, with desc ordering.