A must-alias (or “definite-alias”) analysis computes sets of
expressions that are guaranteed to be aliased at a given pro-
gram point. The analysis has been shown to have significant
practical impact, and it is actively used in popular research
frameworks and commercial tools. We present a custom data
structure that speeds up must-alias analysis by nearly two
orders of magnitude (while computing identical results). The
data structure achieves efficiency by encoding multiple alias
sets in a single linked structure, and compactly representing
the aliasing relations of arbitrarily long expressions. We ex-
plore the data structure’s performance in both an imperative
and a declarative setting and contrast it extensively with
prior techniques. With our approach, must-alias analysis
can be performed efficiently, over large Java benchmarks, in
under half a minute, making the analysis cost acceptable for
most practical uses.
Static analysis aims to achieve an understanding of program
behavior, by means of automatic reasoning that requires only the
program’s source code and not any actual execution. To reach a
truly broad level of program understanding, static analysis
techniques need to create an abstraction of memory that covers all
possible executions. Such abstract models may quickly degenerate
after losing essential structural information about the memory
objects they describe, due to the use of specific programming
idioms and language features, or because of practical analysis
limitations. In many cases, some of the lost memory structure may
be retrieved, though it requires complex inference that takes
advantage of indirect uses of types. Such recovered structural
information may, then, greatly benefit static analysis.
This dissertation shows how we can recover structural information,
first (i) in the context of C/C++, and next, in the context of
higher-level languages without direct memory access, like Java,
where we identify two primary causes of losing memory structure:
(ii) the use of reflection, and (iii) analysis of partial
programs. We show that, in all cases, the recovered structural
information greatly benefits static analysis on the program.
For C/C++, we introduce a structure-sensitive pointer analysis
that refines its abstraction based on type information that it
discovers on-the-fly. This analysis is implemented in cclyzer, a
static analysis tool for LLVM bitcode. Next, we present techniques
that extend a standard Java pointer analysis by building on top of
state-of-the-art handling of reflection. The principle is similar
to that of our structure-sensitive analysis for C/C++: track the
use of reflective objects, during pointer analysis, to gain
important insights on their structure, which can be used to
“patch” the handling of reflective operations on the running
analysis, in a mutually recursive fashion. Finally, to address the
challenge of analyzing partial Java programs in full generality,
we define the problem of “program complementation”: given a
partial program we seek to provide definitions for its missing
parts so that the “complement” satisfies all static and dynamic
typing requirements induced by the code under analysis. Es-
sentially, complementation aims to recover the structure of
phantom types. Apart from discovering missing class members (i.e.,
fields and methods), satisfying the subtyping con- straints leads
to the formulation of a novel typing problem in the OO context,
regarding type hierarchy complementation. We offer algorithms to
solve this problem in various in- heritance settings, and
implement them in JPhantom, a practical tool for Java bytecode
complementation.
We give a declarative model of a rich family of must-alias
analyses. Our emphasis is on careful and compact modeling,
while exposing the key points where the algorithm can adjust
its inference power. The model is executable, in the Datalog
language, and forms the basis of a full-fledged must-alias
analysis of Java bytecode in the DOOP framework.
We present a points-to analysis for C/C++ that recovers much of
the available high-level structure information of types and
objects, by applying two key techniques: (1) It records the type
of each abstract object and, in cases when the type is not readily
available, the analysis uses an allocation-site plus type
abstraction to create multiple abstract objects per allocation
site, so that each one is associated with a single type. (2) It
creates separate abstract objects that represent (a) the fields of
objects of either struct or class type, and (b) the (statically
present) constant indices of arrays, resulting in a limited form
of array-sensitivity.
We apply our approach to the full LLVM bitcode intermediate
language and show that it yields much higher precision than past
analyses, allowing accurate distinctions between subobjects,
v-table entries, array components, and more. Especially for C++
programs, this precision is invaluable for a realistic
analysis. Compared to the state-of-the-art past approach, our
techniques exhibit substantially better precision along multiple
metrics and realistic benchmarks (e.g., 40+% more variables with a
single points-to target).
Reflection is a highly dynamic language feature that poses grave
problems for static analyses. In the Java setting, reflection is
ubiquitous in large programs. Any handling of reflection will be
approximate, and overestimating its reach in a large codebase can
be catastrophic for precision and scalability. We present an
approach for handling reflection with improved empirical soundness
(as measured against prior approaches and dynamic information) in
the context of a points-to analysis. Our approach is based on the
combination of string-flow and points-to analysis from past
literature augmented with (a) substring analysis and modeling of
partial string flow through string builder classes; (b) new
techniques for analyzing reflective entities based on information
available at their use-sites. In experimental comparisons with
prior approaches, we demonstrate a combination of both improved
soundness (recovering the majority of missing call-graph edges)
and increased performance.
Pointer Analysis
abstract
pdf
Yannis Smaragdakis, George Balatsouras
Foundations and Trends® in Programming Languages 2015
Pointer analysis is a fundamental static program analysis, with a
rich literature and wide applications. The goal of pointer
analysis is to compute an approximation of the set of program
objects that a pointer variable or expression can refer to. We
present an introduction and survey of pointer analysis techniques,
with an emphasis on distilling the essence of common analysis
algorithms. To this end, we focus on a declarative presentation of
a common core of pointer analyses: algorithms are modeled as
configurable, yet easy-to-follow, logical specifications. The
specifications serve as a starting point for a broader discussion
of the literature, as independent threads spun from the
declarative model.
Context-sensitivity is the primary approach for adding more
precision to a points-to analysis, while hopefully also
maintaining scalability. An oft-reported problem with
context-sensitive analyses, however, is that they are bi-modal:
either the analysis is precise enough that it manipulates only
manageable sets of data, and thus scales impressively well, or the
analysis gets quickly derailed at the first sign of imprecision
and becomes orders-of-magnitude more expensive than would be
expected given the program’s size. There is currently no approach
that makes precise context-sensitive analyses (of any flavor:
call-site-, object-, or type-sensitive) scale across the board at
a level comparable to that of a context-insensitive analysis. To
address this issue, we propose introspective analysis: a technique
for uniformly scaling context-sensitive analysis by eliminating
its performance-detrimental behavior, at a small precision
expense. Introspective analysis consists of a common adaptivity
pattern: first perform a context-insensitive analysis, then use
the results to selectively refine (i.e., analyze
context-sensitively) program elements that will not cause
explosion in the running time or space. The technical challenge is
to appropriately identify such program elements. We show that a
simple but principled approach can be remarkably effective,
achieving scalability (often with dramatic speedup) for benchmarks
previously completely out-of-reach for deep context-sensitive
analyses.
We present the problem of class hierarchy complementation: given a
partially known hierarchy of classes together with subtyping
constraints (“A has to be a transitive subtype of B”) complete the
hierarchy so that it satisfies all constraints. The problem has
immediate practical application to the analysis of partial
programs—e.g., it arises in the process of providing a sound
handling of “phantom classes” in the Soot program analysis
framework. We provide algorithms to solve the hierarchy
complementation problem in the single inheritance and multiple
inheritance settings. We also show that the problem in a language
such as Java, with single inheritance but multiple subtyping and
distinguished class vs. interface types, can be decomposed into
separate single- and multiple-subtyping instances. We implement
our algorithms in a tool, JPhantom, which complements partial Java
bytecode programs so that the result is guaranteed to satisfy the
Java verifier requirements. JPhantom is highly scalable and runs
in mere seconds even for large input applications and complex
constraints (with a maximum of 14s for a 19MB binary).
We present set-based pre-analysis: a virtually universal
optimization technique for flow-insensitive points-to
analysis. Points-to analysis computes a static abstraction of how
object values flow through a program’s variables. Set-based
pre-analysis relies on the observation that much of this reasoning
can take place at the set level rather than the value
level. Computing constraints at the set level results in
significant optimization opportunities: we can rewrite the in- put
program into a simplified form with the same essential points-to
properties. This rewrite results in removing both local variables
and instructions, thus simplifying the subsequent value-based
points-to computation. Effectively, set-based pre-analysis puts
the program in a normal form optimized for points-to analysis.
Compared to other techniques for off-line optimization of
points-to analyses in the literature, the new elements of our
approach are the ability to eliminate statements, and not just
variables, as well as its modularity: set-based pre-analysis can
be performed on the input just once, e.g., allowing the
pre-optimization of libraries that are subsequently reused many
times and for different analyses. In experiments with Java
programs, set-based pre-analysis eliminates 30% of the program’s
local variables and 30% or more of computed context-sensitive
points-to facts, over a wide set of benchmarks and analyses,
resulting in a 24% average speedup (max: 103%, median: 20%).