Skip to content

Conversation

@tvernum
Copy link
Contributor

@tvernum tvernum commented Mar 31, 2023

When parsing role descriptors, we ensure that the FieldPermissions ("field_security":{ "grant":[ ... ], "except":[ ... ] }) are valid - that is that any patterns compile correctly, and the "except" is a subset of the "grant".

However, the previous implementation would not use the FieldPermissionsCache for this, so it would compile (union, intersect & minimize) automatons every time a role was parsed.

This was particularly an issue when parsing roles (from the security index) in the GET /_security/role/ endpoint. If there were a large number of roles with field level security the automaton parsing could have significant impact on the performance of this API.

When parsing role descriptors, we ensure that the FieldPermissions (`"field_security":{ "grant":[ ... ], "except":[ ... ] }`) are valid - that is that any patterns compile correctly, and the "except" is a subset of the "grant". However, the previous implementation would not use the FieldPermissionsCache for this, so it would compile (union, intersect & minimize) automatons every time a role was parsed. This was particularly an issue when parsing roles (from the security index) in the GET /_security/role/ endpoint. If there were a large number of roles with field level security the automaton parsing could have significant impact on the performance of this API.
@tvernum tvernum requested a review from ywangd March 31, 2023 07:32
@elasticsearchmachine elasticsearchmachine added the Team:Security Meta label for security team label Mar 31, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-security (Team:Security)

@elasticsearchmachine
Copy link
Collaborator

Hi @tvernum, I've created a changelog YAML for you.

Comment on lines 71 to 72
if (cache == null) {
return new FieldPermissions(definition);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't allow this cache to be disabled. So this should never be null. We should either assert that or throw error. Also this constructor is currently only used in tests because that's where cache may not exist. With this new method, it is possible to replace the constructor usages with init and this method and drop this constructor entirely and I think that would be a good thing.

final AuthorizationService authzService = new AuthorizationService(
settings,
allRolesStore,
fieldPermissionsCache,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to make this change? The cache used by RBACEngine is techinically cross different index groups and the combination depends on the requested indices patterns, e.g. if we have

"indices": [ { "name": ["logs-*"], "privileges": ["read"], "field_security": {"grant": ["xxx"] } }, { "name": ["*-staging"], "privileges": ["read"], "field_security": {"grant": ["yyy"] } }, ] 

If the requested index pattern is logs-nginx-staging, both groups will match and the cache key will be

new FieldPermissionsDefinition( Set.of( new FieldGrantExcludeGroup(["xxx"], null), new FieldGrantExcludeGroup(["yyy"], null) ) ) 

While the cache used in CompositeRolesStore is most likely keyed by a single index group. With the above example, the two groups will be cached separately. If we combine the two caches, it is possible that more entries would be competing for cache usages.

Along the same line of reasoning, the cache used by RoleDescriptor is purely for a single index definition. Strictly speaking, it can also be different from the cache usage by CompositeRolesStore which can combine index groups if they have exactly the same names definition.

In summary, the cache is keyed slightly different with the 3 use cases:

  1. RoleDescriptor - keyed strictly by a single index group's FLS definition
  2. CompositeRolesStore - keyed by FLS combination of multiple index groups that share the same names definition
  3. RBACEngine - keyed by FLS combination of mulitple index groups that have names matching the requested index.

That said, I do think having a single cache for them is reasonable because we don't really want to differentiate clearly between the different type of cache keys. My only concern is whether combining them could add extra noticeable pressure on this single cache.

Comment on lines 48 to 50
final FieldPermissionsCache cache = new FieldPermissionsCache(settings);
INSTANCE.set(cache);
return cache;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels a bit odd to have both a staticly shared instance and a public constructor.

@tvernum
Copy link
Contributor Author

tvernum commented Apr 3, 2023

@ywangd I'll respond to your comments collectively because I think they're centred around 2 issues:

  1. How much refactoring do we want to do in this PR?
  2. To what extent should we share caches?

On point 1 we have the issues of:

  • FieldPermissionsCache.resolve works correctly even if the static cache has not been configured.
  • The FieldPermissionsCache constructor is public.

In both of those cases the reason is that a large number of tests assume the old behaviour of either (1) constructing a cache inside the test or (2) being able to parse a role without configuring a cache.
I can fix them - it's a fairly mechanical issue, but it would change this PR from being 9 modified files (as it is right now) to potentially hundreds. Do we want to do that in this PR? My feeling is that we'd prefer to backport a smaller PR to 7.17, and follow up the PR with a separate cleanup.
However, if you're happy to review a very large change then I can make it here.

To be honest, I hate the static cache, but RoleDescriptor.parse is static, and passing a cache into that method would also be a massive change.
I toyed with the idea of having a RoleDescriptorParser that had the cache as an instance field. That would have been much nicer, but would have been an enormous PR that I don't think we'd be comfortable backporting to 7.17

Perhaps we need to chat about which big change we'd prefer (tighten up the static cache, or remove the need for a static cache) before we go much further.

On the question of sharing caches, my thoughts are:

  • The FieldPermissionsCache is really just an automaton cache like the one in Automatons, but it knows some special behaviours around meta fields, and grant/except. It's possible to merge it with the Automatons cache (in fact the earlier patch you saw tried to do that). That might also be a reasonable direction to take (do we really need both caches?) though it is complicated by the "subset" checking in FieldPermissionsCache (see more discussion below)
  • Because FieldPermissionsCache is just automaton caches, I think having a single instance is better, but I accept there are tradeoffs. I'd reasonably expect for some clusters the current 2 cache model means the same automatons are being constructed multiple times. If I've followed the code correctly, in cases where users have 1 role that has a single index group, the 2 caches would end up with the same entries (unnecessarily).
  • We have a setting for FieldPermissionsCache (authz.store.roles.field_permissions.cache.max_size_in_bytes). It's is strange therefore that we have more than 1 instance of the cache, each doing its own accounting. In current releases it's technically possible for the field permissions cache(s) to use twice as much RAM as configured, because there are 2 of them. I definitely don't want to have three of them,
  • The FieldPermissionsCache doesn't ever expire entries (unless full) which is probably a mistake, but it also seems like an argument for having only 1 instance. If we're worried about introducing pressure on the cache, that implies its using a lot of RAM, and the 2 (now 3) different usages would compete. While it's possible that could impact an existing cluster negatively, it still seems like the right behaviour. There should be 1 block of RAM allocated to this, and it should evict based on a global LRU rather than an LRU per caller (RBACEngine vs CompositeRolesStore)

Aside: Subset Checking: I find the existing subset checking to be a bit weird. The design is that you can have

{ "grant": "*", "except": "secret.*" } 

but not

{ "grant": "public.*", "except": "private.*" } 

However, because it's a subset check rather than an intersection check, you also can't say

{ "grant": "shared.*", "except": "*.secret" } 

It is what it is, and I don't want to argue to change it, but it means that I don't think we'd want to push this exact check into Automatons - it's a specific behaviour for FieldPermissions, not a general "allow/deny" check.

@ywangd
Copy link
Member

ywangd commented Apr 3, 2023

For the cache sharing, I think you have some good points on having a single cache. It is more intuitive (a single setting configures a single cache). The existing 2-cache structure also has memory duplication issue. So the change here actually fixes it with a different tradeoff. I think it is reasonable to assume that the memory duplication issue (single index section matching the requested index) is more likely to happen than the issue with cache sharing. Based on these, I am on board with the approach of a single shared cache.

As for how much changes we want for this PR, I do think we want to keep it as small as possible for the sake of 7.17 backport. But I also find it is rather non-ideal that it is possible to call new FieldPermissions(...) multiple times and get separate underlying caches. What do you think about the following changes:

private static final AtomicReference<Cache<FieldPermissionsDefinition, FieldPermissions>> CACHE_REF = new AtomicReference<>(); public FieldPermissionsCache(Settings settings) { CACHE_REF.updateAndGet(c -> { if (c == null) { return CacheBuilder.<FieldPermissionsDefinition, FieldPermissions>builder() .setMaximumWeight(CACHE_SIZE_SETTING.get(settings)) .weigher((key, fieldPermissions) -> fieldPermissions.ramBytesUsed()) .build(); } else { return c; } }); } public FieldPermissions getFieldPermissions(FieldPermissionsDefinition fieldPermissionsDefinition) { return CACHE_REF.get().computeIfAbsent(...); }

It helps getting rid of the static init method and the INSTANCE variable. It only generates a single underlying cache on mulitple constructor calls. It should also help fixing the current test failures with SetOnce. It is fairly small change as well. What do you think?

@tvernum tvernum requested a review from ywangd April 6, 2023 04:28
@tvernum
Copy link
Contributor Author

tvernum commented Apr 6, 2023

@ywangd I've updated this based on our conversation. I think it ended up being a much simpler change, though we acknowledge there's some debt to refactor later.

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I like how small the changeset is!

}

public void testParseIndicesPrivilegesFailsWhenExceptFieldsAreNotSubsetOfGrantedFields() {
resetFieldPermssionsCache();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question: Is this reset necessary or nice to have just in case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's no longer necessary. It was needed in an earlier version where sometimes tests ran without a cache, because the behaviour with/without the cache was subtly different.
Now it's just there because it makes sense to have a known clean base for the test (and not have it be at risk of flakiness due to what's existing in the cache)

@elasticsearchmachine
Copy link
Collaborator

elasticsearchmachine commented Apr 6, 2023

Backport

Status Branch Result
7.17
8.7
tvernum added a commit to tvernum/elasticsearch that referenced this pull request Apr 6, 2023
When parsing role descriptors, we ensure that the FieldPermissions (`"field_security":{ "grant":[ ... ], "except":[ ... ] }`) are valid - that is that any patterns compile correctly, and the "except" is a subset of the "grant". However, the previous implementation would not use the FieldPermissionsCache for this, so it would compile (union, intersect & minimize) automatons every time a role was parsed. This was particularly an issue when parsing roles (from the security index) in the GET /_security/role/ endpoint. If there were a large number of roles with field level security the automaton parsing could have significant impact on the performance of this API.
tvernum added a commit to tvernum/elasticsearch that referenced this pull request Apr 6, 2023
When parsing role descriptors, we ensure that the FieldPermissions (`"field_security":{ "grant":[ ... ], "except":[ ... ] }`) are valid - that is that any patterns compile correctly, and the "except" is a subset of the "grant". However, the previous implementation would not use the FieldPermissionsCache for this, so it would compile (union, intersect & minimize) automatons every time a role was parsed. This was particularly an issue when parsing roles (from the security index) in the GET /_security/role/ endpoint. If there were a large number of roles with field level security the automaton parsing could have significant impact on the performance of this API. Backport of: elastic#94931
elasticsearchmachine pushed a commit that referenced this pull request Apr 6, 2023
* Reuse FieldPermissionsCache in Role parsing (#94931) When parsing role descriptors, we ensure that the FieldPermissions (`"field_security":{ "grant":[ ... ], "except":[ ... ] }`) are valid - that is that any patterns compile correctly, and the "except" is a subset of the "grant". However, the previous implementation would not use the FieldPermissionsCache for this, so it would compile (union, intersect & minimize) automatons every time a role was parsed. This was particularly an issue when parsing roles (from the security index) in the GET /_security/role/ endpoint. If there were a large number of roles with field level security the automaton parsing could have significant impact on the performance of this API. * Fix spotless and imports
elasticsearchmachine pushed a commit that referenced this pull request Apr 6, 2023
When parsing role descriptors, we ensure that the FieldPermissions (`"field_security":{ "grant":[ ... ], "except":[ ... ] }`) are valid - that is that any patterns compile correctly, and the "except" is a subset of the "grant". However, the previous implementation would not use the FieldPermissionsCache for this, so it would compile (union, intersect & minimize) automatons every time a role was parsed. This was particularly an issue when parsing roles (from the security index) in the GET /_security/role/ endpoint. If there were a large number of roles with field level security the automaton parsing could have significant impact on the performance of this API. Backport of: #94931
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement :Security/Authorization Roles, Privileges, DLS/FLS, RBAC/ABAC Team:Security Meta label for security team v7.17.10 v8.7.1 v8.8.0

4 participants