Static analysis project
Control-flow graph
The point of the graph transformation is that you don't need to handle the language defined in abstract_syntax_tree.ml, which is complex. The graph in cfg.ml is much simpler, it is comprised of:
- node objects, that denote (roughly) program locations:
type node = {
node_id: int; (* unique identifier *)
node_pos: position; (* position in the source *)
node_out: arc list; (* arcs with this node as source *)
node_in: arc list; (* arcs with this node as destination *)
}
- arc objects, that link a source node to a destination node and are labelled with an instruction:
type arc = {
arc_id: int; (* unique identifier *)
arc_src: node; (* source node *)
arc_dst: node; (* destination node *)
arc_inst: inst; (* instruction *)
}
To ease graph traversal, as standard, each node contains a list of out-going arcs and in-coming arcs. Each arc and each node in the graph is given a unique integer identifier. For ease of programming, Set, Map, and Hashtbl collection modules over these objects are also provided.
Each arc is labelled with an instruction to execute when going from its source node to its destination node:
- CFG_skip does nothing. It is useful to model jumps or returns without a value.
- CFG_assign (var,expr) assigns the value of the integer expression expr to the variable var. The expressions in the graph are slightly simpler than those in the abstract syntax tree because function calls have been extracted to separate instructions. In place of the call, we put in the expression a variable where the function's return value has been stored by the function call.
- CFG_guard expr takes the transition to the destination node only if the boolean expression expr evaluates to true. Generally, a single node will feature two out-going arcs with a CFG_guard instruction: one with CFG_guard expr, and the other one with CFG_guard !expr, corresponding to the two branches of if conditionals or loops. Note that our expressions feature non-determinism (both as random integer intervals and as random boolean choices); hence, it is possible for a program environment to be propagated along both the expr and the !expr branches.
- CFG_assert expr is similar to CFG_guard expr, but executing the instruction causes a program execution error if expr is not true: i.e., the analyzer displays an assertion violation message and continues the analysis propagating only the memory environments that satisfy the condition.
- CFG_call f indicates a call to the function f. Note that all the information about the function arguments and return values are omitted in the instruction. The graph translation takes care of generating CFG_assign instructions storing the actual arguments into the formal arguments prior to the CFG_call instruction. The function itself, if it returns a value, takes care of storing this value into a special variable before returning.
Each program variable is associated a var structure:
var_id: int; (* unique variable identifier *)
var_name: id; (* original name, in the program *)
var_type: typ; (* variable type *)
var_pos: extent; (* position of the variable declaration *)
}
Note that the program can declare several different variables with the same name, with different scopes:
{
x++; // first x
{
int x;
{
int x;
x++; // third x
}
x++; // second x
}
}
In that case, each version x has its own structure with a different unique identifier. The graph transformation thus resolves scoping. Note that by variable, we actually denote the set of all global variables, local variables, and formal function arguments. Additionally, the translation to a graph may add temporary variables (e.g. to hold the return value of functions), which also have a structure. In your analyzer, you don't have to distinguish different kinds of variables: you can just assume that each var structure denotes a different global variable (to simplify, all global, local, formal argument and temporary variables are live at all program points)..
To each function in the program source is associated a func structure:
func_id: int; (* unique function identifier *)
func_name: string; (* function name *)
func_pos: extent; (* function position in the source *)
func_entry: node; (* entry node *)
func_exit: node; (* exit node *)
func_args: var list; (* list of formal arguments *)
func_ret: var option; (* variable used to store the return value *)
func_calls: arc list; (* list of calls to the function *)
}
The execution of the function starts at its entry node and finishes at its exit node: i.e., both the normal function return and every return instruction will jump to the exit node. For each function returning an integer, a synthetic variable is created (func_rec field). It is used to store the return value: i.e., return expr; is modeled as storing the value of expr into the variable and then jumping to the exit node. To ease inter-procedural analysis, we remember all the instructions, in other functions, that call this function.
The result of the translation is a prog structure that contains the whole control-flow graph for all the functions in the program.
cfg_vars: var list; (* list of all the variables *)
cfg_funcs: func list; (* list of all the functions *)
cfg_nodes: node list; (* list of all the nodes in the program *)
cfg_arcs: arc list; (* list of all the arcs in the program *)
cfg_init_entry: node; (* first node of code initializing global variables *)
cfg_init_exit: node; (* last node of initialization code *)
}
As expected, the structure contains the list of all variables and function structures, as well as the nodes and arcs of the whole program.
The control-flow graph is actually composed of several unconnected parts: one subgraph for each function, plus an initialization subgraph.
This initialization subgraph handles the initialization of global variables; its entry point is indicated by the cfg_init_entry field.
The initialization part must thus be analyzed prior to analyzing any program function.
Author: Antoine Miné