Skip to content

Conversation

@asgerf
Copy link
Contributor

@asgerf asgerf commented Oct 28, 2022

Adds some classes to do basic stuff with DataFlow nodes without having to map down to CFG and AST nodes, and a few other tidbits I found myself missing.

To ensure some degree of battle-hardening I (mostly) ported three library models to use the new nodes where it makes sense. The affected models are Rails.qll, Railties.qll, and ActionController.qll.

Migrating from AST nodes to DataFlow nodes revealed a couple of pre-existing issues, some of which have now been fixed or fixed enough to avoid a regression:

Some other issues which were too hard to fix prior to opening this PR:

  • We're missing flows into captured variables in some cases, in particular when returning a lambda. I attempted a fix in this draft PR which would align closer with how we do things in JS, but after a whole day of debugging strange call graph changes, I've put that aside for now. Instead, I've resorted to a rather unambitious workaround in ab91b57.
  • Blocks are assumed to capture self, but of course that's not always the case. There is currently no way to tell the analysis what the value of self is, so we have to work around it for now. This means using things like getParent+() and treating the module self as if it's an instance of the module.

getConst() and what about API graphs?

One of the changes that may look odd is the introduction of DataFlow::getConst() and ConstRef. It provides convenient access to constants that may resolve to a given qualified name. For example, the incantation

DataFlow::getConst("ActionController").getConst("Base").getADescendentModule()

gives us all descendents of ActionController::Base.

This was probably the biggest use case for API graphs in Ruby, as you can get to (almost) the same thing with API::getTopLevelMember("ActionController").getMember("Base"). So why add getConst()?

  • getConst() doesn't rely on the call graph, which means it can be used for things like getACallSimple() and other pre-call graph customizations (we have quite a few of those in JS and I expect we'll find the need in Ruby as well).
  • I intend to share more of the API graphs implementation between languages (stay tuned!), but the rules for constant lookup in Ruby are pretty language-specific, so it would be nice to have some of this factored out anyway.
  • getConst() handles the rules of constant lookup more precisely than API graphs, in particular when related to include, nested modules, and modules with more than one declaration. I think API graphs could be extended to make this work, but it's not really a natural fit for its use/def tracking, it would need to be a bunch of new predicates specific to constant lookup, which might as well exist earlier in the pipeline.

I replaced some uses of API graphs with getConst, not because it matters in practice, I just had to ensure getConst() is being used enough to test it in the wild.

I hope the name getConst() makes more sense in the context of Ruby. I think the name in API graphs, getMember, was used mostly because that's what JS called it.

Evaluation

Evaluation shows

  • 100 new taint sinks
  • 491 new taint sources, resulting in 1890 new tainted expressions
  • 3 new call edges due to the change in flowsTo
  • There seems to be a slight performance hit, which I might investigate a bit further, but I wanted to get the PR up for now.

The new sources and sinks are mainly due to resolving methods upwards in the hierarchy in ActionController. The combination getADescendentModule().getAnInstanceMethod() takes "sideways" method resolution into account without having to really think about it, so I like that.

@github-actions github-actions bot added the Ruby label Oct 28, 2022
Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CodeQL found more than 10 potential problems in the proposed changes. Check the Files changed tab for more details.

@asgerf asgerf force-pushed the rb/data-flow-layer-capture2 branch from 99c8eba to be1d931 Compare October 28, 2022 12:15
@asgerf asgerf force-pushed the rb/data-flow-layer-capture2 branch 2 times, most recently from 04a5c91 to c2e33a2 Compare October 28, 2022 17:50
* end
* ```
*/
ModuleNode getADescendentModule() { MkAncestorLookup(result) = this.getATargetScope() }
Copy link
Contributor

@hmac hmac Nov 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can understand why we've mixed them together, but I'm slightly concerned about the conflation of classes and modules here. You can't subclass a module and you can't include, extend or prepend a class. Users may find it confusing that they need to use getADescendentModule() to find subclasses.

Additionally, if our module resolution gets confused at any point and conflates a class A with some unrelated module M::A in a different namespace M, we may find that we can't find that fetching subclasses of A also yields modules that include M::A.

Is there an easy way to split the class and module logic into separate predicates, i.e. something like this?

ModuleNode getADescendentModule; ClassNode getASubclass;

Overall though, I don't think this is a huge deal if it has to stay as-is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On further thought, it might be enough to add

ClassNode getASubclass() { result = this.getADescendentModule() }

but I'm also happy to leave this as-is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have Module#getASubClass() from before this PR, which returns an immediate subclass (not transitive). For that reason I'm trying to avoid adding things called getSubClass/getSubclass that don't behave the same way.

@hmac
Copy link
Contributor

hmac commented Nov 7, 2022

On the naming of getConst, is there any reason not to use the full word getConstant? It seems a bit arbitrary to shorten that term when we don't do so for anything else.

@asgerf
Copy link
Contributor Author

asgerf commented Nov 7, 2022

On the naming of getConst, is there any reason not to use the full word getConstant?

I preferred a short name given how ubiquitous the corresponding DataFlow::moduleImport and DataFlow::globalVarRef are in JavaScript. I'm not sure it's a good enough reason though, since it's not that much shorter. I'd be happy to hear if anyone else has thoughts on this as well.

@asgerf
Copy link
Contributor Author

asgerf commented Nov 7, 2022

New evaluation shows we gain about 1,600 new call edges due to handling of proc, and is otherwise unchanged.

hvitved
hvitved previously approved these changes Nov 8, 2022
Copy link
Contributor

@hvitved hvitved left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see what DCA says.

@asgerf
Copy link
Contributor Author

asgerf commented Nov 8, 2022

Latest evaluation still looks good

predicate forceCachingBackref() { exists(any(ConstRef const).getConstant(_)) }
}

private import Cached

Check warning

Code scanning / CodeQL

Dead code

This code is never used, and it's not publicly exported.
@asgerf asgerf merged commit 859dc7b into github:main Nov 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no-change-note-required This PR does not need a change note Ruby

5 participants