CS243: Program Analysis and Optimization Winter 2023
The JoeQ Compiler System
The joeq compiler system is a compiler analysis framework, compiler, and full virtual machine written in the Java programming language. It is open source, distributed under the LGPL - for full information, check out the joeq homepage.
A description of the JoeQ system appears in the article:
J. Whaley, Joeq: A Virtual Machine and Compiler Infrastructure. In ACM SIGPLAN 2003 Workshop on Interpreters, Virtual Machines and Emulators, June 2003.
The joeq system is extremely large. We will restrict ourselves to a small, standalone subset of the complete system.
We use a slightly modified joeq that adds Java generics. Instead of using joeq.Util.Templates.ListIterator.BasicBlock or joeq.Util.Templates.ListIterator.Operand, you should use java.util.ListIterator<BasicBlock> or java.util.ListIterator<Operand>. Since only the portion of joeq that is used for our project has been modified accordingly, there can be minor descrepancies between the interface of joeq we provide and the JavaDoc in the sourceforge website (e.g., QuadIterator.nextQuad() is obsolete. You should use QuadIterator.next()).
Contents
- Installation
- The resources you need to run JoeQ.
- Java Basics
- If you've never actually used the Java VM before, this will get you started, as well as pointing you to official resources.
- Program Representation
- A quick overview of how the Java VM stores code as bytecodes, and how bytecode is represented as Quads in joeq.
- Code Organization
- How joeq stores and operates on programs, from class files down to individual instructions.
- Details of the Quad representation
- The raw material your analyses will work upon.
- Details of the Visitor classes
- The interfaces you must implement, or the classes you must extend, to do useful work.
- The Analysis Interface
- The Main.Helper class and the primary operations it supports.
- Writing and executing a compiler pass
- The basics of creating a pass and running it over arbitrary code.
- Examples
- Determining all possible thrown exceptions.
Manipulating variables in quads.
Manipulating control flow graphs.
Installation Setting up the JoeQ framework
Java Basics Quick overview
The Java programming language is a general-purpose object-oriented concurrent language. Its syntax is similar to C and C++, but it omits many of the features that make C and C++ complex, confusing, and unsafe.
Java programs run on a standardized platform called the Java virtual machine. The Java virtual machine is an abstract computing machine. Like a real computing machine, it has an instruction set and manipulates various memory areas at run time.
In this class, we will be using the Java 2 Platform, Standard Edition v1.5.0 (commonly called Java 5). This is installed on the Myth machines. See the homework handout for details on installing Java 5 on your own machine.
This page assumes that you've got some basic familiarity with the Java language, but haven't had to use it on Myth. If you've never dealt with Java before, read The Java Tutorial and familiarize yourself with the language.
Using the Java Development Kit
There are three main programs that you will use in the JDK:
javac- convert Java source to Java bytecodejava- run Java bytecodejavap- disassemble Java bytecode
All of these run from the command line. We will delay discussion of javap until our discussion of the Java bytecode system, and concentrate more on the more frequently used javac and java.
Create the following file on a Myth machine, and name it Hello.java:
public class Hello { public static void main(String[] args) { System.out.println("Hello, world!"); } } Now execute the following commands from that directory. You should get the same results as listed here.
myth:~/examples> javac Hello.java myth:~/examples> java Hello Hello, world! myth:~/examples>
The javac command reads Hello.java and produces Hello.class, which is then executed by the java command. Note that you do not specify the .class extension when running code.
If your shell doesn't find these programs, run source hw2.env (or equivalent), where hw2.env is distributed as part of the homework starter code, first.
The CLASSPATH
One of Java's more unusual features is that there is no static linking step. The name of any class provides sufficient information to allow the java command to locate the relevant class file. Package names correspond to directories, and the name of the .class file must correspond to the name of the class given. The VM, however, must be informed where to begin searching. This is the purpose of the CLASSPATH variable. This is a list of directories, much like the PATH variable under Unix or Win32. Each directory listed is a possible "root" for the class search. Archives of Java code (which usually have the extension .jar) may also be listed as if they were directories.
With the 1.5 JDK, CLASSPATH must include the current directory, named by a single period. (If CLASSPATH is completely undefined, that's OK too.)
Program Representation An Intermediate Representation of code as bytecodes and quads
A Java source program is first converted to a class file containing bytecode, which is then translated into an IR representation in the compiler called Quads. This page familiarizes you with the program representations by way of examples.
Java Bytecode
The Java bytecode is a rather high-level representation of a Java program. While some information, like local variable names, is dropped, high-level information such as class layouts and object hierarchies is retained. Java bytecode is stack-oriented: operands are pushed on the operand stack and arithmetic operations are applied to the top variables on the stack. The stack architecture was chosen because their programs are compact.
One can examine the bytecode of a class by invoking the bytecode disassembler using the command javap -c <classname>. You do not need to know the details of Java bytecode for this class. We include a brief discussion here so that you can understand the process by which Java source code is translated to our own internal compiler representation. If you are interested in finding out more, here are the overviews of the class file format, and the compilation from Java source to bytecode.
The Quad Intermediate Representation (IR)
The representation that you will be using for the first two assignments is also a rather high-level IR. Like Java bytecode, it retains source program information such as field accesses and virtual method invocations. This supports the implementation of high-level optimizations such as minimizing the cost of virtual function invocations.
Instead of a stack architecture, however, we will use as our model a machine with an unbounded number of pseudo registers. Pseudo registers hold local variables of a method, as well as temporary variables generated by the compiler to store intermediate results. All data must first be loaded into pseudo registers before they can be operated on. This architecture is more conducive to program optimization than the stack architecture.
All the stack operations in the class files are translated into a series of simple instructions, each accepting up to three input operands and writing to one result variable. Hence, the IR is called Quads. Instructions are organized in a control flow graph, where nodes are the basic blocks and edges are the possible flow of control. Furthermore, the compiler also puts in the verification checks imposed by the Java semantics. For example, references are checked for NULL values before they can be used. These checks are inserted into the Quad representation directly.
Examples
Below we will show a few simple examples to illustrate how the same program can be represented at the source, bytecode, and the quad representation. You are encouraged to write new Java examples and use the same steps to find out how they are represented at the byte code and, more importantly, as quads.
Example 1
This example illustrates how basic expressions are represented as quads.
Source Program class ExprTest { int test(int a) { int b, c, d, e, f; c = a + 10; f = a + c; if (f > 2) { f = f - c; } return f; } } Bytecode We first use javac to compile the Java source to a class file, then run the disassembler over the class file.
myth:~/examples> javac ExprTest.java myth:~/examples> javap -c ExprTest Compiled from ExprTest.java class ExprTest extends java.lang.Object { ExprTest(); int test(int); } Method ExprTest() 0 aload_0 1 invokespecial #1 <Method java.lang.Object()> 4 return Method int test(int) 0 iload_1 1 bipush 10 3 iadd 4 istore_3 5 iload_1 6 iload_3 7 iadd 8 istore 6 10 iload 6 12 iconst_2 13 if_icmple 22 16 iload 6 18 iload_3 19 isub 20 istore 6 22 iload 6 24 ireturn myth:~/examples> javap first prints out the names of the methods defined for each class, then the definition of the individual methods. By default, all classes extend java.lang.Object; an appropriate constructor is automatically generated by the compiler if one does not exist.
For each method, javap prints out its signature--for example, test accepts an integer and returns an integer. A frame is created for each invocation. Location 0 holds the this pointer; the parameter and local variables a,b,c,d,e,f are numbered 1 to 6, respectively. Instructions are labeled by their position in the array of bytecodes representing the procedure.
Instructions such as load are prefixed by the result type: a,b,c,d,f,i,j,s, and z represent reference, byte, character, double, float, integer, long, short, boolean, respectively. An instruction's parameter is either represented as a suffix or an extra operand. iload_1 and iload_6 load the 1st and 6th variables from the frame onto the stack, respectively. The difference is just an optimization in encoding; the former, which is more common, is encoded in one byte and the latter is encoded in two.
iconst refers to pushing an integer constant on the stack. if_icmple_22 is a conditional branch based on an integer comparison between two operands on the stack. Namely, if the top of stack is less than or equal to the second operand on the stack then go to instruction 22.
You can print out a textual representation of the quad IR by using the PrintQuads.java program (also included as part of the example programs in the joeq distribution). Run it using the following commands:
myth:~/examples> javac examples.PrintQuads myth:~/examples> java examples.PrintQuads examples.ExprTest Class: ExprTest Method: <init>()V Control flow graph for ExprTest.<init> ()V: BB0 (ENTRY) (in: <none>, out: BB2) BB2 (in: BB0 (ENTRY), out: BB1 (EXIT)) 2 NULL_CHECK T-1 <g>, R0 ExprTest 1 INVOKESPECIAL_V% java.lang.Object.<init> ()V, (R0 ExprTest) 3 RETURN_V BB1 (EXIT) (in: BB2, out: <none>) Exception handlers: [] Register factory: Local: (I=1,F=1,L=1,D=1,A=1) Stack: (I=1,F=1,L=1,D=1,A=1) Method: test(I)I Control flow graph for ExprTest.test (I)I: BB0 (ENTRY) (in: <none>, out: BB2) BB2 (in: BB0 (ENTRY), out: BB3, BB4) 1 ADD_I T0 int, R1 int, IConst: 10 2 MOVE_I R3 int, T0 int 3 ADD_I T0 int, R1 int, R3 int 4 MOVE_I R6 int, T0 int 5 IFCMP_I R6 int, IConst: 2, LE, BB4 BB3 (in: BB2, out: BB4) 6 SUB_I T0 int, R6 int, R3 int 7 MOVE_I R6 int, T0 int BB4 (in: BB2, BB3, out: BB1 (EXIT)) 8 RETURN_I R6 int BB1 (EXIT) (in: BB4, out: <none>) Exception handlers: [] Register factory: Local: (I=7,F=7,L=7,D=7,A=7) Stack: (I=2,F=2,L=2,D=2,A=2) myth:~/examples>
This command invokes a program that loads in classes, then invokes the compiler pass joeq.Compiler.Quad.PrintCFG on each method in the class given.
Here we see that BB0 and BB1 are the entry and exit blocks, respectively. There is a conditional flow of control from BB2 around BB3 arriving at BB4. The first operand of each quad is the destination variable.
The this pointer is allocated to R0. The parameters and local variables a,b,c,d,e,f are allocated to pseudo registers R1 to R6, respectively. Intermediate results are stored into temporary registers. For example, the result of R1 + 10 is stored into T0, before it is stored into R3.
The IFCMP_I instruction is similar to the if_icmpl instruction, except that the comparison operation is one of the parameters and the target is basic block BB4. The type of the operations is attached to the operation as a suffix. The initialization routine includes an INVOKESPECIAL_V% operation. INVOKESPECIAL invokes an instance method which requires special handling, such as an instance initialization method, a private method, or a superclass method. The suffix _V indicates that the function invoked returns void, and the % symbol indicates that the invoked function may need to be loaded dynamically. java.lang.Object.<init> ()V says to invoke the initialization function in java.lang.Object, its superclass. The signature of the class is that it takes no explicit argument and returns a void. It passes to it the this pointer in R0 which is an instance of the class ExprTest.
Example 2
Here is another example to illustrate how fields and arrays are handled.
Source Program class ArrayTest { int A[]; ArrayTest() { A = new int[10]; } int access(int i) { return A[i]; } } Quads Control flow graph for ArrayTest.access (I)I: BB0 (ENTRY) (in: <none>, out: BB2) BB2 (in: BB0 (ENTRY), out: BB1 (EXIT)) 1 NULL_CHECK T-1 <g>, R0 ArrayTest 2 GETFIELD_A T0 int[], R0 ArrayTest, .A, T-1 3 NULL_CHECK T-1 <g>, T0 int[] 4 BOUNDS_CHECK T0 int[], R1 int, T-1 5 ALOAD_I T0 int, T0 int[], R1 int, T-1 6 RETURN_I T0 int BB1 (EXIT) (in: BB2, out: <none>)
The first NULL_CHECK checks if the this pointer is not null. T-1, read T minus one, is a fake location referenced by the subsequent operation (GETFIELD_A) that uses the checked pointer. This fake dependence between the definition and the use of T-1 prevents the instruction scheduler from inverting the order of NULL_CHECK and GETFIELD_A. The GETFIELD_A operation stores the A field of the instance, which is a reference to an array, into the temporary variable T0. The NULL and BOUNDS checks are then performed. The ALOAD_I instruction loads into a register an indexed array location of type int. The NEWARRAY is a special instruction that creates a new array of a given size.
Code Organization How JoeQ stores code as an IR
The joeq system allows access to program code at a wide variety of levels. This section will cover the important methods of each level of focus. Where appropriate, links to the relevant joeq javadocs will be available.
Java Elements: The joeq.Class package
The Class package contains classes that represent the various components of Java class files. Many of the String-like return values in this package return objects of type Utf8, explicitly representing the Unicode format the JVM uses; these may be turned back into Strings by calling toString upon them.
Types: jq_Type and jq_Primitive
jq_Type is the basic type representing types. It has a subclass, jq_Primitive, representing the primitive types. The jq_Primitive class has the following static fields for referring to each type:
| jq_Primitive.BOOLEAN |
| jq_Primitive.BYTE |
| jq_Primitive.CHAR |
| jq_Primitive.DOUBLE |
| jq_Primitive.FLOAT |
| jq_Primitive.INT |
| jq_Primitive.LONG |
| jq_Primitive.SHORT |
| jq_Primitive.VOID |
Neither jq_Type nor jq_Primitive have any methods of interest to us.
Arrays: jq_Array
Arrays are represented with the jq_Array class. The methods of note here are:
int getDimensionality(): Returns how many dimensions the array has:int[][]is dimensionality 2.jq_Type getElementType(): Returns the element type.int[][]has element typeint[].jq_Type getInnermostElementType(): Returns the result of the maximum number of nestedgetElementType()calls.int[][]has innermost element typeint.
Classes: jq_Class
The jq_Class class represents a class file as a whole. You acquire the jq_Class for a type by calling joeq.Main.Helper.load; see the section on joeq.Main.Helper for details. Important methods in jq_Class are:
String getName():Returns the name of the class.jq_Class getSuperclass():Returns the parent class, or null if the class is java.lang.Object.jq_Class[] getDeclaredInterfaces():Returns an array of all interfaces this class claims to implement. This does not include superinterfaces of those interfaces, nor does it contain interfaces declared by superclasses.jq_InstanceField[] getDeclaredInstanceFields(): Returns all instance fields that this class declares. It does not mention fields declared by superclasses.jq_StaticField[] getDeclaredStaticFields(): Returns all static fields that this class declares. It does not mention fields declared by superclasses.jq_InstanceMethod[] getVirtualMethods(): Returns all instance methods that this class declares. It does not mention methods declared by superclasses.jq_StaticMethod[] getDeclaredStaticMethods(): Returns all static methods that this class declares. It does not mention methods declared by superclasses.
Methods: jq_Method
Utf8 getName(): Returns the (simple) name of this method.jq_Type[] getParamTypes(): The arguments to the routine.jq_Type getReturnType(): The return type of the routine.jq_Class getDeclaringClass(): Returns the class that declared this method.
Fields: jq_Field
Utf8 getName(): Name of the field. Doesn't include the class designator.jq_Type getType(): The type of this field's value.jq_Class getDeclaringClass(): Returns the class that declared this field.
Code: The jeoq.Compiler Package
Most of the code for representing the actual program code in joeq is contained in the joeq.Compiler package and its subpackages.
joeq.Compiler.Quad.ControlFlowGraph
BasicBlock entry(): The first BasicBlock in the CFG. (This is always empty.)BasicBlock exit(): The last BasicBlock in the CFG. (This is also always empty.)jq_Method getMethod(): The method that corresponds to this CFG.
Iterating over CFGs: joeq.Compiler.Quad.QuadIterator
The QuadIterator loops through a ControlFlowGraph a Quad at a time. It is constructed taking a ControlFlowGraph as its constructor argument. The following methods then become available:
Quad next(): Returns the next quad in a reverse-post-order listing of Quads.Quad previous(): Returns the previous quad in a reverse-post-order listing of Quads.boolean hasNext(): Returns true if next is legal.boolean hasPrevious(): Returns true if previous is legal.java.util.Iterator successors(): Returns an iterator of the possible successor quads of the most recently returned quad. If a possible successor is the method exit, it includes the null value in the iteration.java.util.Iterator predecessors(): Returns an iterator of the possible predecessor quads of the most recently returned quad. If a possible predecessor is the method entrance, it includes the null value in the iteration.
joeq.Compiler.Quad.BasicBlock
The BasicBlock class represents a joeq basic block. These have one entrance, and, assuming no exceptions are thrown, one exit. (If exceptions are thrown, the "basic block" may have multiple exit points.)
There are no methods of interest to us defined in BasicBlock. We will usually be focussing either on ControlFlowGraphs or Quads. If you find this strange, consider what happens if you think about all the instructions in a basic block as residing in it's own basic block - that's what the Quad representation gives you, so you can implement all your algorithms directly on Quads.
joeq.Compiler.Quad.Quad
At the bottom of the code hierarchy is the quad, a construct very similar to three-address assembly code. Relevant methods on quads themselves are:
Operator getOperator()retrieve the Operator for this Quad. You are unlikely to ever check this directly, as the use of a QuadVisitor will usually do this for you.List getThrownExceptions(): Returns a list of jq_Classes - one for each exception that can be thrown.toString(): A textual mnemonic for this quad.
The operators and operands of Quads are numerous enough that they they warrant their own section: the Quads Overview.
joeq.Compiler.BytecodeAnalysis
Important This package deals mostly with the JVM's code representation -- we don't have to deal with it in this class. Note, however, that it has a ControlFlowGraph class of its own, so do not import joeq.Compiler.Quad.* and joeq.Compiler.BytecodeAnalysis.* in the same code.
Quads Details of the Quad representation of bytecodes
The program representation that we will be using is this class is a control flow graph, where basic blocks consist of simple operations known as Quads. Quads consist of an Operator and up to four Operands.
Operator types
All operators are inherited from the Operator class. Operators have the following methods:
getThrownExceptions(): gets the exception handlers for that operation.getDefinedRegisters(q): gets all the registers defined by quad q.getUsedRegisters(q): gets all the registers used by quad q.getReg1(q),getReg2(q),getReg3(q): gets the named register from quad q.INSTANCE: All operators are implemented using the Singleton pattern. This field holds the object associated with each operator.
All the operations are hierarchically ordered. For example, there are different operations to write to a field in a class, depending on the type of the field. For example, the PUTFIELD_{I,F,L,D,A,B,C,S,Z} Operators represent array load instructions for the types integer, floating-point, long, double, reference, byte, character, short, and boolean. PUTFIELD_{I,F,L,D,A,B,C,S,Z}_DYNLINK are similar to PUTFIELD_{I,F,L,D,A,B,C,S,Z}, except that the invoked function may need to be loaded dynamically. We use the % symbol to stand for DYNLINK when printing the quads. All the various PUTFIELD variants inherit from the Putfield class, which inherits from Operator. Associated with each class is a set of methods, allowing the compiler writer to refer to the operands of a quad by more meaningful names than getOp1, getOp2 etc. For example, the first operand for Binary operations can be retrieved by invoking getDest on the quad containing the operator.
| Operator | Description | Subclasses | Methods |
|---|---|---|---|
| Move | Move between pseudo registers | | getMoveOp |
| Binary | Binary operation, with two sources and one destination | | getDest |
| Unary | Unary operation, with one source and one destination | | getDest |
| Goto | unconditional branch | | getTarget |
| IntIfCmp | conditional branch | | getSrc1 |
| TableSwitch | table switch instruction | | setTarget |
| LookupSwitch | lookup switch instruction | | setMatch |
| Getstatic | load from a static field | | getDest |
| Putstatic | store to a static field | | getSrc |
| Getfield | load from an object field | | getDest |
| Putfield | store to an object field | | getBase |
| ALoad | array load | | getDest |
| AStore | array store | | getValue |
| ALength | array length | | getDest |
| BoundsCheck | array bounds check | | getRef |
| NullCheck | null pointer check | | getDest |
| ZeroCheck | null pointer check | | getDest |
| StoreCheck | array object type store check | | getRef |
| New | allocate a new object | | getDest |
| NewArray | allocate a new array | | getDest |
| CheckCast | run time type check | | getDest |
| InstanceOf | run time type check | | getDest |
| Invoke | call another method | | setParam |
| Jsr | call an internal subroutine | | getDest |
| Ret | return from an internal subroutine | | getTarget |
| Return | return from the current method | | getSrc |
| Monitor | access atomic lock for an object | | getSrc |
Operand types
The following provides a short description of all the different kinds of operands.
| Operand | Description |
|---|---|
| RegisterOperand | specifies an abstract location |
| IConstOperand | integer constant (32 bit) |
| FConstOperand | floating point constant operand (32 bit) |
| LConstOperand | long integer constant operand (64 bit) |
| DConstOperand | double-precision floating point constant operand (64 bit) |
| AConstOperand | object reference constant |
| MethodOperand | specifies the target method of a method call instruction |
| ParamListOperand | specifies the parameter list for a method call instruction |
| FieldOperand | specifies an object field associated with a get/put field instruction |
| TypeOperand | specifies a type for use in type check instructions |
| ConditionOperand | Specifies a condition code associated with a conditional branch instruction |
| TargetOperand | specifies the target of a branch instruction |
| BasicBlockTableOperand | specifies a table of targets, for use in switch instructions |
Basic blocks and the control flow graph
Quads are organized into BasicBlocks. BasicBlocks are single-entry, meaning that control flow can only enter at the start of a basic block. Barring exceptions and method calls, they are also single-exit, meaning that control flow can only leave at the end of a basic block. Important This is an important difference between the basic blocks that were covered in lecture. Exceptions can cause basic blocks to exit early.
Each BasicBlock contains a list of Quads. They also include a set of predecessor and successor basic blocks. Basic blocks are numbered with unique integer identifiers. There are two special basic blocks, the entry basic block and the exit basic block. These basic blocks always have identifiers 0 and 1, respectively. These blocks mark the method entry point and exit point. They never contain instructions. Obviously, the entry basic block has no predecessors, and the exit basic block has no successors.
Basic blocks also have exception handlers. When an exception occurs within a basic block, control flow jumps to the handler for that basic block that matches the type of the exception that occurred.
BasicBlocks are contained in a ControlFlowGraph. The ControlFlowGraph contains references to the start and end nodes and a list of all exception handlers.
Visitors Details of the Visitor class to traverse the IR
Central to the joeq analysis system is the concept of the Visitor. Visitors are described fully in the Design Patterns book by Gamma et al., but the concept is fairly straightforward. Each Visitor class has a number of methods of the form void visitX(X argument), which are called by the analysis machinery upon encountering objects of that type. Each pass is defined by one or more Visitor classes. The granularity of the analysis is determined by what kind of Visitor the class uses.
The Visitor classes are all actually interfaces. However, since some of the interfaces specify many, many methods, each includes a static nested class called EmptyVisitor that implements each method with a no-op. This allows one to override only those methods that your pass requires.
joeq.Class.jq_TypeVisitor
The jq_TypeVisitor visits arrays, primitives, and classes. For our purposes, the visitClass(jq_Class c) method is the only one we'll ever really need to override.
joeq.Class.jq_MethodVisitor
This Visitor can distinguish between static and virtual methods, or between various sorts of initializers. All visited methods will invoke their most specific method, then the generic visitMethod. For our purposes, we'll almost never use anything other than visitMethod.
joeq.Compiler.Quad.ControlFlowGraphVisitor
The only method in this interface is void visitCFG(ControlFlowGraph cfg). It's easy to implement.
joeq.Compiler.Quad.BasicBlockVisitor
The only method in this interface is void visitBasicBlock(BasicBlock bb). It's also easy to implement.
joeq.Compiler.Quad.QuadVisitor
The QuadVisitor is the bulkiest of the Visitor classes. There is one method for each of the major Operator subclasses (the left column in the chart in the Quads section), and a generic visitQuad. When this visits a Quad, it first calls the appropriate specific visit routine, and then calls visitQuad.
You will often use QuadVisitors in conjunction with QuadIterators or with the BasicBlock.visitQuads function. This will allow you to adjust evaluation order to react to control flow or intermediate analysis results.
joeq.Main.Helper Analysis Interface How to use Visitor classes you write
The joeq system has a lot of state and has many very complex internal dependencies. The joeq.Main.Helper class hides all of these complexities and requirements behind a very simple interface. In fact, there are only two method names you need to know.
jq_Type load(String name): This routine takes a class name specified by the argument and searches the classpath for a class of that name. If such a class exists, it loads it into a jq_Class object and returns it; otherwise, it throws a NoClassDefFoundError. For our purposes, it will always work to downcast the returned jq_Type to a jq_Class.void runPass(target, pass): This is actually a family of functions, but you do not ordinarily need to worry about that. The target is a chunk of code: a jq_Class, a jq_Method, a ControlFlowGraph, a BasicBlock, or a Quad. The pass is a relevant visitor: a jq_TypeVisitor, a jq_MethodVisitor, a ControlFlowGraphVisitor, a BasicBlockVisitor, or a QuadVisitor. You may run a pass on a target if the target is at least as fine-grained as the pass; so, for instance, a BasicBlock can receive a BasicBlockVisitor or a QuadVisitor, but a jq_Class can receive any kind of pass.
Writing and executing a compiler pass
Writing a compiler pass in joeq is easy, once you learn a few basics. You should be familiar with the basic operation of the visitors and the joeq code hierarchy before dealing with these samples.
Here's a simple compiler pass that counts all the quads it visits:
import joeq.Compiler.Quad.*; public class QuadCounter extends QuadVisitor.EmptyVisitor { public int count = 0; public void visitQuad(Quad q) { count++; } } There's not a lot to say about this class. The QuadVisitor interface specifies a vast number of methods, and we only care about one of them (the most generic, which is always invoked). Thus, we extend the associated EmptyVisitor instead of implementing the core interface. The operation of the pass is fairly straightforward -- any time a quad accepts this pass, we bump the counter.
This is all well and good, but how do we run our pass on anything? As a QuadVisitor, we can run it on a code chunk of any size -- a class, a method, a basic block, or even a single quad (though that last isn't likely to be too interesting). To run this pass, we simply use one of joeq.Main.Helper's many runPass methods. We add a main method that reads in class names from the command line and runs itself on those passes.
import joeq.Compiler.Quad.*; import joeq.Main.Helper; import joeq.Class.jq_Class; public class QuadCounter extends QuadVisitor.EmptyVisitor { public int count = 0; public void visitQuad(Quad q) { count++; } public static void main(String[] args) { jq_Class[] c = new jq_Class[args.length]; for (int i = 0; i < args.length; i++) { c[i] = (jq_Class)Helper.load(args[i]); } QuadCounter qc = new QuadCounter(); for (int i = 0; i < args.length; i++) { qc.count = 0; Helper.runPass(c[i], qc); System.out.println(c[i].getName() + " has " + qc.count + " Quads."); } } } What if you only wanted to count the number of "move" instructions? The "visitQuad" method matches all quads. The visitor interface also has other kinds of visit methods that match specific quads; that is to say, they are only called on quads that match a specific type. Here is an example of a visitor that counts the number of instructions that load from memory and the number of instructions that store into memory.
import joeq.Compiler.Quad.*; public class LoadStoreCounter extends QuadVisitor.EmptyVisitor { public int loadCount = 0, storeCount = 0; public void visitLoad(Quad q) { loadCount++; } public void visitStore(Quad q) { storeCount++; } } Examples
Example 1: Finding Thrown Exceptions
Let's write a pass that determines the set of uncaught exceptions that can be thrown by a method.
We will use a flow-insensitive technique. The problem is complicated by the fact that exceptions that are thrown can be caught within the method itself, and we don't want to include those exceptions in our result. Exception handlers are attached to the basic blocks that they protect, so we need to visit the basic blocks to learn about the exception handlers.
We will use one visitor object that is both a BasicBlockVisitor and a QuadVisitor. The visitBasicBlock routine will cache the current basic block and call visitQuads() to visit the quads.
import java.util.*; import joeq.Class.*; import joeq.Compiler.Quad.*; import joeq.Main.Helper; public class FindThrownExceptions extends QuadVisitor.EmptyVisitor implements BasicBlockVisitor { private Set<jq_Class> thrownExceptions = new HashSet<jq_Class>(); private BasicBlock currentBasicBlock; public void visitBasicBlock(BasicBlock bb) { currentBasicBlock = bb; Helper.runPass(bb, (QuadVisitor) this); } public void visitExceptionThrower(Quad q) { for (jq_Class c : q.getThrownExceptions()) { // need to make sure this exception type is not caught by a handler if (currentBasicBlock.getExceptionHandlers().mustCatch(c) == null) { thrownExceptions.add(c); } } } public String toString() { return "Possible thrown exceptions:" + thrownExceptions; } public static void main(String[] args) { jq_Class[] c = new jq_Class[args.length]; for (int i = 0; i < args.length; i++) { c[i] = (jq_Class) Helper.load(args[i]); } FindThrownExceptions fte = new FindThrownExceptions(); for (int i = 0; i < args.length; i++) { Helper.runPass(c[i], (BasicBlockVisitor) fte); } System.out.println(fte); } } Now let's take a closer look. Our class, FindThrownExceptions, both implements BasicBlockVisitor and extends QuadVisitor.EmptyVisitor. This means that in order to call Helper.runPass, we need to cast it to indicate which kind of Visitor we mean it to be treated as.
There are two fields. The first is thrownExceptions, which contains the exceptions seen thus far. It is initialized to an empty set. The second is currentBasicBlock, which keeps track of the current basic block as we iterate through the quads. We need to keep track of the current basic block because quads do not contain a reference to their enclosing basic block, and we need the enclosing basic block to find the exception handlers.
The visitBasicBlock method simply sets the currentBasicBlock field and then runs the Quad visitor on itself. That includes visitExceptionThrower, which matches quads that can throw exceptions. We start by iterating over the types of exceptions that the quad can throw, obtained via getThrownExceptions(). For each of those exception types, if that type is not necessarily caught by an exception handler, we add it to the set thrownExceptions.
Lastly, the main routine loads the classes from the command line, find all uncaught exceptions they throw (the thrownExceptions set only grows in size), and prints them out.
Example 2: "Using" Variables
In quad format, variables are stored in pseudo-registers. There are an infinite number of pseudo-registers. Pseudo-registers have an associated type: int, float, long, double, or object reference.
We will compute for each pseudo-register the set of quads that define that register ('def' set) and the set of quads that use that register ('use' set). This can be used, for example, to calculate 'gen' and 'kill' sets in the reaching definition dataflow problem.
This, again, is a flow-insensitive problem. We don't need to know about the basic blocks, so we simply make a quad visitor.
import java.util.*; import joeq.Class.*; import joeq.Compiler.Quad.*; import joeq.Compiler.Quad.Operand.RegisterOperand; import joeq.Compiler.Quad.RegisterFactory.Register; import joeq.Main.Helper; public class FindDefsAndUses extends QuadVisitor.EmptyVisitor { private Map<Register, List<Quad>> registersToDefs = new HashMap<Register, List<Quad>>(); // Register -> List private Map<Register, List<Quad>> registersToUses = new HashMap<Register, List<Quad>>(); // Register -> List public void visitQuad(Quad q) { for (RegisterOperand def : q.getDefinedRegisters()) addToMultiMap(registersToDefs, def.getRegister(), q); for (RegisterOperand use : q.getUsedRegisters()) addToMultiMap(registersToUses, use.getRegister(), q); } private void addToMultiMap(Map<Register, List<Quad>> m, Register key, Quad value) { List<Quad> s = m.get(key); if (s == null) m.put(key, s = new LinkedList<Quad>()); s.add(value); } public String toString() { return "Defs: " + registersToDefs + "\nUses: " + registersToUses; } public static void main(String[] args) { jq_Class[] c = new jq_Class[args.length]; for (int i = 0; i < args.length; i++) { c[i] = (jq_Class) Helper.load(args[i]); } FindDefsAndUses fdu = new FindDefsAndUses(); for (int i = 0; i < args.length; i++) { Helper.runPass(c[i], fdu); } System.out.println(fdu.toString()); } } The important point is that quads include methods getDefinedRegisters() and getUsedRegisters() that return a list of registers that are defined/used by that quad. The visitQuad method simply iterates over those lists and adds the quads to a multi-map keyed on the register.
Example 3: Manipulating control flow graphs
Now let's look at an example of control flow graph manipulation. Let's write a pass that will merge two basic blocks which are immediately adjacent where the first basic block has no other successors and the second basic block has no other predecessors. Obviously, we need to ignore the entry and exit basic blocks. Additionally, let's maintain exception semantics by requiring that the exception handler set is the same for both basic blocks.
Code goes here.
This one is a little more involved. When visiting a basic block, we look at the predecessors of the current basic block. By merging basic blocks with their predecessors rather than their successors, we can avoid having to re-compute the traversal order if we traverse forward through the graph.
We check that the basic block is not the exit node and that there is only one predecessor. Likewise, we check that the predecessor has only one successor and that the predecessor is not the entry node. Finally, we check the equality of the exception handlers. If all of these tests pass, the basic blocks are mergeable.
We check the last quad in the previous basic block; if it is a branch we remove it. We add all of the quads in the current basic block to the previous basic block, merge the flags of the two basic blocks, remove the successor edge between them, and add all successors of the second basic block to the first basic block. And the blocks are merged!
When should I use iterators versus visitors for IR elements?
These two techniques are interchangeable, but in some cases it is more convenient to use one rather than the other.
Visitors are good because they separate the traversal order from the operations performed on each element. Because they are objects, they can be passed around and be used to store results. They also allow advanced selection of objects based on their type, a la QuadVisitor.
However, visitors require a class definition and are therefore cumbersome for simple traversals. If a traversal is very simple and doesn't warrant a new class definition, it may be better to use an iterator.