Hi again,
Sorry for my late response, but I've been busy with other stuff.
I'm writing firstly to make sure I want to do sane things and that they are doable with current clang's CFG infrastructure.
For example, one of the errors I would like to detect is the following (from a patch that fixed a crash in PHP):
- if (ZEND_NUM_ARGS() != 5 || zend_get_parameters_ex(4, &domain, &msgid1, &msgid2, &count, &category) == FAILURE) {
+ if (ZEND_NUM_ARGS() != 5 || zend_get_parameters_ex(5, &domain, &msgid1, &msgid2, &count, &category) == FAILURE) {
WRONG_PARAM_COUNT;
}
(ZEND_NUM_ARGS() is just an int variable).
My question is how can I track those values? I think that tracking them in a more general way (e.g. var > 5; var2 < var1+3) needs a full SAT solver. But simplifying things, is this doable with the PersistentMap, for example?
Hi Nuno,
Static analysis technology can be based on a whole variety of "solvers" that perform symbolic logic. Bit-blasting to propositional logic and solving with SAT is only one approach. Bit-blasting, while extremely powerful, is a very heavy hammer, and there are hundreds of tradeoffs when choosing the right solver and analysis method for a particular problem. Sometimes a "simpler" approach, such as those based on traditional dataflow analysis, can accomplish the task just as well, but in the process be more scalable. It really depends on your problem, but you certainly don't *need* a SAT solver for tracking and solving the constraints you described, nor would it always be the best approach.
For example, there are many techniques out there use specialized solvers within an inter-procedural, path-sensitive, dataflow framework. For example, I highly suggest checking out the work by Dawson Engler on his group's work on Metacompilation (and a related project at MSR was ESP):
Checking System Rules Using System-Specific, Programmer-Written Compiler Extension
Dawson Engler, Benjamin Chelf, Andy Chou, and Seth Hallem.
Appeared in: Proceedings of the 4th Symposium on Operating System Design and Implementation.
Manuvir Das, Sorin Lerner, and Mark Seigle
ESP: Path-Sensitive Program Verification in Polynomial Time
PLDI '02: Proceedings of the ACM SIGPLAN 2002 Conference on Programming Language Design and Implementation, Berlin, Germany, June 2002.
http://research.microsoft.com/esp/pubs/esp-dataflow.pdf
If you want an example of a specialized solver built on top of the Metacompilation work, check out Archer, which found buffer overruns in systems code such as the Linux kernel:
ARCHER -- An Automated Tool for Detecting Buffer Access Errors
Yichen Xie, Andy Chou, and Dawson Engler
Proceedings of ESEC/FSE 2003, Helsinki, Finland
I have also seen some recent work on doing path-sensitive analysis using type systems, or using denotational semantics. None of these are based on SAT solvers. This isn't an argument against using SAT solvers; I'm just saying there are a lot of options, and it really depends on your problem domain. My main argument is that often taking a more "light-weight" approach can get you more traction both in the short term, but also in terms of the overall effectiveness in the tool (in terms of scalability, etc.).
In terms of things that could be used to help implement such techniques, we now have the ImmutableMap and ImmutableSet data types in the LLVM libraries (see later in my email), which supersede the PerstistentMap implementation. These could be used to track mappings from variable names to constraints (the ARCHER paper provides some excellent details on how such constraints can be tracked). There of course needs to be some other supporting infrastructure for doing path-sensitive analysis, which we are working on (see below), or you would have to develop yourself.
Also I would like to find memory overflow bugs, like:
char dest[10];
if (size <= 11)
memcpy(dest, input, size);
At the CallExpr, how do I know that 'size' is not sanitized correctly (i.e. size <= sizeof(dest))?
An excellent paper to read would be the ARCHER paper (mentioned above), which describes a specialized solver for finding these kinds of errors. It's a fairly scalable system that had a specialized constraint solver that kept track of linear constraints on buffers and extents. In such a system, flagging an error like this is trivial.
Also, from which example of the Analysis dir should I base my code?
The Analysis dir contains very nascent code. In the last couple of months I put in some code for a flow-sensitive dataflow analysis framework, and implemented "Live Variables" and "Uninitialized Values" on top of that framework. As you have seen, these analyses themselves have bugs, some of which have been introduced as the rest of clang has evolved. That said, I think it is a good baseline to build flow-sensitive analyses on top of, and in the process the API will be refined as users decide what they want (and don't want) from the framework.
I have also added both an ImmutableSet and ImmutableMap data structures to the LLVM core libraries, which implement a functional set and map similar to those in the Objective Caml libraries. There is still some tweaking to do on their interfaces, but these ADTs were implemented with the intention of being used for program analysis in clang. These can be used to efficiently implement dataflow "stores" that map from symbols to values, and their functional nature allows them to efficiently represent the dataflow values at different program points while exploiting the fact that most dataflow values across program points are very similar (and thus much of their representation can be shared).
Clang currently lacks, however, a system for doing path-sensitive analysis, as well as the infrastructure to support inter-procedural analysis. Work on this is currently underway. Thus there is currently no example code, however, in the Analysis directory (other than the clients of the flow-sensitive solver) on which to base your own code. At this point you would have to build the necessary infrastructure yourself, or depending on your timeline, work with the libraries that we provide as we gradually bring them online. We are currently building the logic into clang to serialize out its ASTs, and eventually build an infrastructure for reasoning about a collection of serialized ASTs (which could constitute an "image" of a code base, for example) to enable inter-procedural analysis (among other things).
Our initial goals are to provide a dataflow-based path-sensitive solver and supporting libraries, e.g., similar to those in ESP or Metacompilation, all of which are based on the standard Reps-Horwitz-Sagiv (RHS) inter-procedural, path-sensitive dataflow algorithm:
Precise interprocedural dataflow analysis via graph reachability
This approach, as illustrated by the ESP and Metacompilation systems, permits a wide range of checking tools, and it was also the basis of the model-checker SLAM that was developed at MSR (although SLAM operated on boolean programs and solved constraints using a theorem prover, it was still based on RHS).
Our current goals for clang are to provide the infrastructure for building such tools, and we plan on focusing on building such libraries starting in the next month or so. I should caution that developing these libraries will be a process, and will not happen overnight.
Note that we are not engineering clang to support only one kind of analysis framework. For example, a program analysis system such as those based on SAT (such as the Saturn project at Stanford) or any other technology are all reasonable things that could be built on clang.
My advice is that you that you approach the problem that you wish to tackle by first simplifying it. Don't go for the most general solution, and think about how you would go about solving a special subset of the problem. From that you can build simple analyses, which can gradually be refined on a need by need basis. I would do this regardless of whether or not you are using clang or some other framework. If your project will be over a long period of time, you may potentially be able to use some of the infrastructure we are bringing online (and perhaps, if you are interested, you can also be active in its development). Otherwise, I would stick with a project that is manageable and only requires that you develop the infrastructure you need to get some initial results. This will allow you to quickly experiment with new things, which will better help you understand your problem domain as well as the right program analysis approach to solving it.
Ted