Many static analysis tools assume that the whole program will be available, in order to analyze it. In practice, however, that is rarely the case, and there are many times that you are stuck with a program which you are able to run, but not analyze (due to some missing code of some library that you don't actually need)!
JPhantom tries to deal with this problem for Java, while being agnostic to the actual tool that will perform the static analysis. It applies a preprocessing step that detects phantom code and replaces it with dummy code, thus producing a complete program that should be now analyzable.
The real challenge of this task is the various type constraints that the final program (plus its dummy complement) must satisfy in order to form valid Java bytecode. You can read more about it here and here.
This project is still at its early stages but, in the long run, it aims to host various analyses written in Datalog for C/C++ programs, or generally any LLVM bitcode generating language. Right now it provides a context-insensitive pointer analysis and is able to recover most of the C++ source information, like v-tables, class hierarchy, etc, and resolve virtual calls with reasonable precision.
There is also a presentation about the basic ideas behind cclyzer's design and modeling of its pointer analysis.
We present a points-to analysis for C/C++ that recovers much of the available high-level structure information of types and objects, by applying two key techniques: (1) It records the type of each abstract object and, in cases when the type is not readily available, the analysis uses an allocation-site plus type abstraction to create multiple abstract objects per allocation site, so that each one is associated with a single type. (2) It creates separate abstract objects that represent (a) the fields of objects of either struct or class type, and (b) the (statically present) constant indices of arrays, resulting in a limited form of array-sensitivity. We apply our approach to the full LLVM bitcode intermediate language and show that it yields much higher precision than past analyses, allowing accurate distinctions between subobjects, v-table entries, array components, and more. Especially for C++ programs, this precision is invaluable for a realistic analysis. Compared to the state-of-the-art past approach, our techniques exhibit substantially better precision along multiple metrics and realistic benchmarks (e.g., 40+% more variables with a single points-to target).
Reflection is a highly dynamic language feature that poses grave problems for static analyses. In the Java setting, reflection is ubiquitous in large programs. Any handling of reflection will be approximate, and overestimating its reach in a large codebase can be catastrophic for precision and scalability. We present an approach for handling reflection with improved empirical soundness (as measured against prior approaches and dynamic information) in the context of a points-to analysis. Our approach is based on the combination of string-flow and points-to analysis from past literature augmented with (a) substring analysis and modeling of partial string flow through string builder classes; (b) new techniques for analyzing reflective entities based on information available at their use-sites. In experimental comparisons with prior approaches, we demonstrate a combination of both improved soundness (recovering the majority of missing call-graph edges) and increased performance.
Pointer analysis is a fundamental static program analysis, with a rich literature and wide applications. The goal of pointer analysis is to compute an approximation of the set of program objects that a pointer variable or expression can refer to. We present an introduction and survey of pointer analysis techniques, with an emphasis on distilling the essence of common analysis algorithms. To this end, we focus on a declarative presentation of a common core of pointer analyses: algorithms are modeled as configurable, yet easy-to-follow, logical specifications. The specifications serve as a starting point for a broader discussion of the literature, as independent threads spun from the declarative model.
Context-sensitivity is the primary approach for adding more precision to a points-to analysis, while hopefully also maintaining scalability. An oft-reported problem with context-sensitive analyses, however, is that they are bi-modal: either the analysis is precise enough that it manipulates only manageable sets of data, and thus scales impressively well, or the analysis gets quickly derailed at the first sign of imprecision and becomes orders-of-magnitude more expensive than would be expected given the program’s size. There is currently no approach that makes precise context-sensitive analyses (of any flavor: call-site-, object-, or type-sensitive) scale across the board at a level comparable to that of a context-insensitive analysis. To address this issue, we propose introspective analysis: a technique for uniformly scaling context-sensitive analysis by eliminating its performance-detrimental behavior, at a small precision expense. Introspective analysis consists of a common adaptivity pattern: first perform a context-insensitive analysis, then use the results to selectively refine (i.e., analyze context-sensitively) program elements that will not cause explosion in the running time or space. The technical challenge is to appropriately identify such program elements. We show that a simple but principled approach can be remarkably effective, achieving scalability (often with dramatic speedup) for benchmarks previously completely out-of-reach for deep context-sensitive analyses.
We present the problem of class hierarchy complementation: given a partially known hierarchy of classes together with subtyping constraints (“A has to be a transitive subtype of B”) complete the hierarchy so that it satisfies all constraints. The problem has immediate practical application to the analysis of partial programs—e.g., it arises in the process of providing a sound handling of “phantom classes” in the Soot program analysis framework. We provide algorithms to solve the hierarchy complementation problem in the single inheritance and multiple inheritance settings. We also show that the problem in a language such as Java, with single inheritance but multiple subtyping and distinguished class vs. interface types, can be decomposed into separate single- and multiple-subtyping instances. We implement our algorithms in a tool, JPhantom, which complements partial Java bytecode programs so that the result is guaranteed to satisfy the Java verifier requirements. JPhantom is highly scalable and runs in mere seconds even for large input applications and complex constraints (with a maximum of 14s for a 19MB binary).
We present set-based pre-analysis: a virtually universal optimization technique for flow-insensitive points-to analysis. Points-to analysis computes a static abstraction of how object values flow through a program’s variables. Set-based pre-analysis relies on the observation that much of this reasoning can take place at the set level rather than the value level. Computing constraints at the set level results in significant optimization opportunities: we can rewrite the in- put program into a simplified form with the same essential points-to properties. This rewrite results in removing both local variables and instructions, thus simplifying the subsequent value-based points-to computation. Effectively, set-based pre-analysis puts the program in a normal form optimized for points-to analysis. Compared to other techniques for off-line optimization of points-to analyses in the literature, the new elements of our approach are the ability to eliminate statements, and not just variables, as well as its modularity: set-based pre-analysis can be performed on the input just once, e.g., allowing the pre-optimization of libraries that are subsequently reused many times and for different analyses. In experiments with Java programs, set-based pre-analysis eliminates 30% of the program’s local variables and 30% or more of computed context-sensitive points-to facts, over a wide set of benchmarks and analyses, resulting in a 24% average speedup (max: 103%, median: 20%).
Program analysis often requires manual inspection of not just the source code but the actual IR that is passed to the static analysis tool. Most of the times this intermediate representation is a compressed format such as Java bytecode.
Thus, to open an Emacs buffer containing Java bytecode for instance, one first needs to run a command that disassembles the file, and then he may open its disassembled contents.
Things are even worse when such a file is part of an archive (e.g., inside a jar file). Then you have to extract it first too.
This is a task that can be easily automated by emacs, assuming your system is properly configured and the actual disassembler commands are included in your PATH (e.g., javap).
Since most of the static analysis tools I have been working on use the Datalog language, and more specifically, the (proprietary) LogicBlox Engine, I have been maintaining an Emacs mode for this exact version of Datalog.
It mainly provides highlighting and indentation at the moment, but I will be pushing some new features from time to time.