G. Balatsouras | Homepage

George Balatsouras
About Me
Projects
Publications
All Things Emacs
Blog
Contact Me
gr.linkedin.com/
last.fm/user/gbalats
github.com/gbalats
bitbucket.org/gbal

My name is George Balatsouras. I finished my PhD in 2017, at the University of Athens, in the area of programming languages (static analysis in particular). My advisor was Yannis Smaragdakis.

JPhantom

Many static analysis tools assume that the whole program will be available, in order to analyze it. In practice, however, that is rarely the case, and there are many times that you are stuck with a program which you are able to run, but not analyze (due to some missing code of some library that you don't actually need)!

JPhantom tries to deal with this problem for Java, while being agnostic to the actual tool that will perform the static analysis. It applies a preprocessing step that detects phantom code and replaces it with dummy code, thus producing a complete program that should be now analyzable.

The real challenge of this task is the various type constraints that the final program (plus its dummy complement) must satisfy in order to form valid Java bytecode. You can read more about it here and here.

Currently, JPhantom is being used by Doop and Droidel.

CCLYZER

This project is still at its early stages but, in the long run, it aims to host various analyses written in Datalog for C/C++ programs, or generally any LLVM bitcode generating language. Right now it provides a context-insensitive pointer analysis and is able to recover most of the C++ source information, like v-tables, class hierarchy, etc, and resolve virtual calls with reasonable precision.

Even though the project is open source and hosted on github, it requires either PA-Datalog or the original (proprietary) LogicBlox Engine to run.

There is also a presentation about the basic ideas behind cclyzer's design and modeling of its pointer analysis.

An Efficient Data Structure for Must-Alias Analysis abstract pdf George Kastrinis, George Balatsouras, Kostas Ferles, Nefeli Prokopaki-Kostopoulou, Yannis Smaragdakis CC '18

A must-alias (or “definite-alias”) analysis computes sets of expressions that are guaranteed to be aliased at a given pro- gram point. The analysis has been shown to have significant practical impact, and it is actively used in popular research frameworks and commercial tools. We present a custom data structure that speeds up must-alias analysis by nearly two orders of magnitude (while computing identical results). The data structure achieves efficiency by encoding multiple alias sets in a single linked structure, and compactly representing the aliasing relations of arbitrarily long expressions. We ex- plore the data structure’s performance in both an imperative and a declarative setting and contrast it extensively with prior techniques. With our approach, must-alias analysis can be performed efficiently, over large Java benchmarks, in under half a minute, making the analysis cost acceptable for most practical uses.

Recovering Structural Information for Better Static Analysis abstract pdf George Balatsouras PhD Thesis

Static analysis aims to achieve an understanding of program behavior, by means of automatic reasoning that requires only the program’s source code and not any actual execution. To reach a truly broad level of program understanding, static analysis techniques need to create an abstraction of memory that covers all possible executions. Such abstract models may quickly degenerate after losing essential structural information about the memory objects they describe, due to the use of specific programming idioms and language features, or because of practical analysis limitations. In many cases, some of the lost memory structure may be retrieved, though it requires complex inference that takes advantage of indirect uses of types. Such recovered structural information may, then, greatly benefit static analysis.

This dissertation shows how we can recover structural information, first (i) in the context of C/C++, and next, in the context of higher-level languages without direct memory access, like Java, where we identify two primary causes of losing memory structure: (ii) the use of reflection, and (iii) analysis of partial programs. We show that, in all cases, the recovered structural information greatly benefits static analysis on the program.

For C/C++, we introduce a structure-sensitive pointer analysis that refines its abstraction based on type information that it discovers on-the-fly. This analysis is implemented in cclyzer, a static analysis tool for LLVM bitcode. Next, we present techniques that extend a standard Java pointer analysis by building on top of state-of-the-art handling of reflection. The principle is similar to that of our structure-sensitive analysis for C/C++: track the use of reflective objects, during pointer analysis, to gain important insights on their structure, which can be used to “patch” the handling of reflective operations on the running analysis, in a mutually recursive fashion. Finally, to address the challenge of analyzing partial Java programs in full generality, we define the problem of “program complementation”: given a partial program we seek to provide definitions for its missing parts so that the “complement” satisfies all static and dynamic typing requirements induced by the code under analysis. Es- sentially, complementation aims to recover the structure of phantom types. Apart from discovering missing class members (i.e., fields and methods), satisfying the subtyping con- straints leads to the formulation of a novel typing problem in the OO context, regarding type hierarchy complementation. We offer algorithms to solve this problem in various in- heritance settings, and implement them in JPhantom, a practical tool for Java bytecode complementation.

A Datalog Model of Must-Alias Analysis abstract pdf George Balatsouras, Kostas Ferles, George Kastrinis, Yannis Smaragdakis SOAP '17

We give a declarative model of a rich family of must-alias analyses. Our emphasis is on careful and compact modeling, while exposing the key points where the algorithm can adjust its inference power. The model is executable, in the Datalog language, and forms the basis of a full-fledged must-alias analysis of Java bytecode in the DOOP framework.

Structure-Sensitive Points-To Analysis for C and C++ abstract pdf slides George Balatsouras, Yannis Smaragdakis SAS '16

We present a points-to analysis for C/C++ that recovers much of the available high-level structure information of types and objects, by applying two key techniques: (1) It records the type of each abstract object and, in cases when the type is not readily available, the analysis uses an allocation-site plus type abstraction to create multiple abstract objects per allocation site, so that each one is associated with a single type. (2) It creates separate abstract objects that represent (a) the fields of objects of either struct or class type, and (b) the (statically present) constant indices of arrays, resulting in a limited form of array-sensitivity. We apply our approach to the full LLVM bitcode intermediate language and show that it yields much higher precision than past analyses, allowing accurate distinctions between subobjects, v-table entries, array components, and more. Especially for C++ programs, this precision is invaluable for a realistic analysis. Compared to the state-of-the-art past approach, our techniques exhibit substantially better precision along multiple metrics and realistic benchmarks (e.g., 40+% more variables with a single points-to target).

More Sound Static Handling of Java Reflection abstract pdf Yannis Smaragdakis, George Balatsouras, George Kastrinis, Martin Bravenboer APLAS '15

Reflection is a highly dynamic language feature that poses grave problems for static analyses. In the Java setting, reflection is ubiquitous in large programs. Any handling of reflection will be approximate, and overestimating its reach in a large codebase can be catastrophic for precision and scalability. We present an approach for handling reflection with improved empirical soundness (as measured against prior approaches and dynamic information) in the context of a points-to analysis. Our approach is based on the combination of string-flow and points-to analysis from past literature augmented with (a) substring analysis and modeling of partial string flow through string builder classes; (b) new techniques for analyzing reflective entities based on information available at their use-sites. In experimental comparisons with prior approaches, we demonstrate a combination of both improved soundness (recovering the majority of missing call-graph edges) and increased performance.

Pointer Analysis abstract pdf Yannis Smaragdakis, George Balatsouras Foundations and Trends® in Programming Languages 2015

Pointer analysis is a fundamental static program analysis, with a rich literature and wide applications. The goal of pointer analysis is to compute an approximation of the set of program objects that a pointer variable or expression can refer to. We present an introduction and survey of pointer analysis techniques, with an emphasis on distilling the essence of common analysis algorithms. To this end, we focus on a declarative presentation of a common core of pointer analyses: algorithms are modeled as configurable, yet easy-to-follow, logical specifications. The specifications serve as a starting point for a broader discussion of the literature, as independent threads spun from the declarative model.

Introspective Analysis: Context-Sensitivity, Across the Board abstract pdf Yannis Smaragdakis, George Kastrinis, George Balatsouras PLDI '14

Context-sensitivity is the primary approach for adding more precision to a points-to analysis, while hopefully also maintaining scalability. An oft-reported problem with context-sensitive analyses, however, is that they are bi-modal: either the analysis is precise enough that it manipulates only manageable sets of data, and thus scales impressively well, or the analysis gets quickly derailed at the first sign of imprecision and becomes orders-of-magnitude more expensive than would be expected given the program’s size. There is currently no approach that makes precise context-sensitive analyses (of any flavor: call-site-, object-, or type-sensitive) scale across the board at a level comparable to that of a context-insensitive analysis. To address this issue, we propose introspective analysis: a technique for uniformly scaling context-sensitive analysis by eliminating its performance-detrimental behavior, at a small precision expense. Introspective analysis consists of a common adaptivity pattern: first perform a context-insensitive analysis, then use the results to selectively refine (i.e., analyze context-sensitively) program elements that will not cause explosion in the running time or space. The technical challenge is to appropriately identify such program elements. We show that a simple but principled approach can be remarkably effective, achieving scalability (often with dramatic speedup) for benchmarks previously completely out-of-reach for deep context-sensitive analyses.

Class Hierarchy Complementation: Soundly Completing a Partial Type Graph abstract pdf slides George Balatsouras, Yannis Smaragdakis OOPSLA '13

We present the problem of class hierarchy complementation: given a partially known hierarchy of classes together with subtyping constraints (“A has to be a transitive subtype of B”) complete the hierarchy so that it satisfies all constraints. The problem has immediate practical application to the analysis of partial programs—e.g., it arises in the process of providing a sound handling of “phantom classes” in the Soot program analysis framework. We provide algorithms to solve the hierarchy complementation problem in the single inheritance and multiple inheritance settings. We also show that the problem in a language such as Java, with single inheritance but multiple subtyping and distinguished class vs. interface types, can be decomposed into separate single- and multiple-subtyping instances. We implement our algorithms in a tool, JPhantom, which complements partial Java bytecode programs so that the result is guaranteed to satisfy the Java verifier requirements. JPhantom is highly scalable and runs in mere seconds even for large input applications and complex constraints (with a maximum of 14s for a 19MB binary).

Set-Based Pre-Processing for Points-To Analysis abstract pdf Yannis Smaragdakis, George Balatsouras, George Kastrinis OOPSLA '13

We present set-based pre-analysis: a virtually universal optimization technique for flow-insensitive points-to analysis. Points-to analysis computes a static abstraction of how object values flow through a program’s variables. Set-based pre-analysis relies on the observation that much of this reasoning can take place at the set level rather than the value level. Computing constraints at the set level results in significant optimization opportunities: we can rewrite the in- put program into a simplified form with the same essential points-to properties. This rewrite results in removing both local variables and instructions, thus simplifying the subsequent value-based points-to computation. Effectively, set-based pre-analysis puts the program in a normal form optimized for points-to analysis. Compared to other techniques for off-line optimization of points-to analyses in the literature, the new elements of our approach are the ability to eliminate statements, and not just variables, as well as its modularity: set-based pre-analysis can be performed on the input just once, e.g., allowing the pre-optimization of libraries that are subsequently reused many times and for different analyses. In experiments with Java programs, set-based pre-analysis eliminates 30% of the program’s local variables and 30% or more of computed context-sensitive points-to facts, over a wide set of benchmarks and analyses, resulting in a 24% average speedup (max: 103%, median: 20%).

Automatic Bytecode Disassembly

Program analysis often requires manual inspection of not just the source code but the actual IR that is passed to the static analysis tool. Most of the times this intermediate representation is a compressed format such as Java bytecode.

Thus, to open an Emacs buffer containing Java bytecode for instance, one first needs to run a command that disassembles the file, and then he may open its disassembled contents.

Things are even worse when such a file is part of an archive (e.g., inside a jar file). Then you have to extract it first too.

This is a task that can be easily automated by emacs, assuming your system is properly configured and the actual disassembler commands are included in your PATH (e.g., javap).

To that end, I wrote these two simple Emacs extensions that handle the task for (i) Java Bytecode and (ii) LLVM Bitcode .

LB Datalog Mode

Since most of the static analysis tools I have been working on use the Datalog language, and more specifically, the (proprietary) LogicBlox Engine, I have been maintaining an Emacs mode for this exact version of Datalog.

It mainly provides highlighting and indentation at the moment, but I will be pushing some new features from time to time.

Added Feature Updates:

Comment *Do What I Mean* plus movement commands (by clause or atom).
Better integration with the *compilation* buffer.
Expand-region support (clause expansion).
Workspace specific commands (attaching to a workspace, adding logic, or performing a query).
Some custom snippets.

24 Oct 2016 » Datalog Macro-Constructors
10 Dec 2015 » Compiling Autotooled projects to LLVM Bitcode

Working Address Click to open map

Dept. of Informatics and Telecommunications
National and Kapodistrian University of Athens
Ilisia, 15784, Greece

George Balatsouras

ABOUT ME

PROJECTS