Many static analysis tools assume that the whole program will be available, in order to analyze it. In practice, however, that is rarely the case, and there are many times that you are stuck with a program which you are able to run, but not analyze (due to some missing code of some library that you don't actually need)!
JPhantom tries to deal with this problem for Java, while being agnostic to the actual tool that will perform the static analysis. It applies a preprocessing step that detects phantom code and replaces it with dummy code, thus producing a complete program that should be now analyzable.
The real challenge of this task is the various type constraints that the final program (plus its dummy complement) must satisfy in order to form valid Java bytecode. You can read more about it here and here.
This project is still at its early stages but, in the long run, it aims to host various analyses written in Datalog for C/C++ programs, or generally any LLVM bitcode generating language. Right now it provides a context-insensitive pointer analysis and is able to recover most of the C++ source information, like v-tables, class hierarchy, etc, and resolve virtual calls with reasonable precision.
There is also a presentation about the basic ideas behind cclyzer's design and modeling of its pointer analysis.
Static analysis aims to achieve an understanding of program behavior, by means of automatic reasoning that requires only the program’s source code and not any actual execution. To reach a truly broad level of program understanding, static analysis techniques need to create an abstraction of memory that covers all possible executions. Such abstract models may quickly degenerate after losing essential structural information about the memory objects they describe, due to the use of specific programming idioms and language features, or because of practical analysis limitations. In many cases, some of the lost memory structure may be retrieved, though it requires complex inference that takes advantage of indirect uses of types. Such recovered structural information may, then, greatly benefit static analysis.
This dissertation shows how we can recover structural information, first (i) in the context of C/C++, and next, in the context of higher-level languages without direct memory access, like Java, where we identify two primary causes of losing memory structure: (ii) the use of reflection, and (iii) analysis of partial programs. We show that, in all cases, the recovered structural information greatly benefits static analysis on the program.
For C/C++, we introduce a structure-sensitive pointer analysis that refines its abstraction based on type information that it discovers on-the-fly. This analysis is implemented in cclyzer, a static analysis tool for LLVM bitcode. Next, we present techniques that extend a standard Java pointer analysis by building on top of state-of-the-art handling of reflection. The principle is similar to that of our structure-sensitive analysis for C/C++: track the use of reflective objects, during pointer analysis, to gain important insights on their structure, which can be used to “patch” the handling of reflective operations on the running analysis, in a mutually recursive fashion. Finally, to address the challenge of analyzing partial Java programs in full generality, we define the problem of “program complementation”: given a partial program we seek to provide definitions for its missing parts so that the “complement” satisfies all static and dynamic typing requirements induced by the code under analysis. Es- sentially, complementation aims to recover the structure of phantom types. Apart from discovering missing class members (i.e., fields and methods), satisfying the subtyping con- straints leads to the formulation of a novel typing problem in the OO context, regarding type hierarchy complementation. We offer algorithms to solve this problem in various in- heritance settings, and implement them in JPhantom, a practical tool for Java bytecode complementation.
Program analysis often requires manual inspection of not just the source code but the actual IR that is passed to the static analysis tool. Most of the times this intermediate representation is a compressed format such as Java bytecode.
Thus, to open an Emacs buffer containing Java bytecode for instance, one first needs to run a command that disassembles the file, and then he may open its disassembled contents.
Things are even worse when such a file is part of an archive (e.g., inside a jar file). Then you have to extract it first too.
This is a task that can be easily automated by emacs, assuming your system is properly configured and the actual disassembler commands are included in your PATH (e.g., javap).
Since most of the static analysis tools I have been working on use the Datalog language, and more specifically, the (proprietary) LogicBlox Engine, I have been maintaining an Emacs mode for this exact version of Datalog.
It mainly provides highlighting and indentation at the moment, but I will be pushing some new features from time to time.