Datalog Macro-Constructors
CClyzer has been my main project for some time now, and I’ve been pondering quite a while about adding context-sensitivity to it.
The merge and record functions (described in Pick Your Contexts Well: Understanding Object-Sensitivity) are a very useful abstraction to parameterize context-sensitivity. Doop has been using them extensively, to obtain limitless analysis variations (w.r.t. context-sensitivity), without requiring any changes in the core analysis logic. Since both CClyzer and Doop are written in Datalog (and are very similar in their analysis logic), this abstraction seems a very good fit.
The main idea of this abstraction is that there are primarily two places where one needs to create contexts:
- At function calls—where you should use merge
- At memory allocation instructions—where you should use record
So, by supplying the appropriate definitions of merge and record you can get various flavors of context-sensitivity (e.g., call-site sensitivity, object sensitivity, type sensitivity … you name it). I’m not gonna explain the differences of these variations here—anyone who wants to dig deeper can take a look at these slides.
This post is more about how this concept of merge and record functions can be implemented in Datalog (or to be more precise, in the LogicBlox engine’s dialect of Datalog, with its constructor extensions and so on).
Background: Context-Sensitivity in Datalog
In a context-sensitive analysis for Java, the typical Datalog rule that would apply the merge function (the case for record is similar, so I’m not gonna show it) look very close to this:
MERGE_MACRO(callerCtx, invocation, hctx, heap, calleeCtx),
CallGraph:Edge(callerCtx, invocation, calleeCtx, tomethod),
VarPointsTo(hctx, heap, calleeCtx, this)
<-
VarPointsTo(hctx, heap, callerCtx, base),
ReachableMethod(inmethod),
Instruction:Method[invocation] = inmethod,
VirtualMethodInvocation:Base[invocation] = base,
HeapAllocation:Type[heap] = heaptype,
VirtualMethodInvocation:Name[invocation] = name,
VirtualMethodInvocation:Signature[invocation] = signature,
MethodLookup[name, signature, heaptype] = tomethod,
Method:ThisVar[tomethod] = this.
whereas the corresponding context-insensitive version of this rule would look like:
CallGraph:Edge(invocation, tomethod),
VarPointsTo(heap, this)
<-
VarPointsTo(heap, base),
ReachableMethod(inmethod),
Instruction:Method[invocation] = inmethod,
VirtualMethodInvocation:Base[invocation] = base,
HeapAllocation:Type[heap] = heaptype,
VirtualMethodInvocation:Name[invocation] = name,
VirtualMethodInvocation:Signature[invocation] = signature,
MethodLookup[name, signature, heaptype] = tomethod,
Method:ThisVar[tomethod] = this.
Let’s look at the 2nd rather simplified version, to get a grip of what’s going on. This rule basically states the following:
- at a virtual method invocation instruction,
invocation
, inside methodinmethod
- which has been already inferred to be reachable by the analysis
(
ReachableMethod(inmethod)
) - if the receiver of the call (
base
) points to some heap objectheap
with dynamic typeheaptype
, and - the method being called, given its
name
andsignature
, dynamically dispatches totomethod
Then:
- we can infer that
this
will point to objectheap
, and - record a call-graph edge from the given
invocation
instruction totomethod
The difference in the context-sensitive setting is that the
VarPointsTo
and CallGraph:Edge
predicates will be augmented with
extra context arguments. So, where the analysis inferred that variable
base
points to abstract object heap
, it now instead infers that
variable base
under context callerCtx
points to the (heap, hctx)
pair. Likewise, a call-graph edge should connect an invocation
instruction under some calling context (invocation, callerCtx)
, with
the method being called and its callee context (tomethod,
calleeCtx)
.
MERGE_MACRO(callerCtx, invocation, hctx, heap, calleeCtx),
CallGraph:Edge(callerCtx, invocation, calleeCtx, tomethod), <-----------
VarPointsTo(hctx, heap, calleeCtx, this) <-----------
<-
VarPointsTo(hctx, heap, callerCtx, base), <-----------
ReachableMethod(inmethod),
Instruction:Method[invocation] = inmethod,
VirtualMethodInvocation:Base[invocation] = base,
HeapAllocation:Type[heap] = heaptype,
VirtualMethodInvocation:Name[invocation] = name,
VirtualMethodInvocation:Signature[invocation] = signature,
MethodLookup[name, signature, heaptype] = tomethod,
Method:ThisVar[tomethod] = this.
However, the callee context depends on the flavor of context-sensitivity chosen, and that’s where the merge function comes in. So, given a number of possible arguments, only some of whom will be used eventually, merge creates a possible new callee context and returns it (through its last argument).
Note here that the calleeCtx
variable is existentially-quantified:
it only appears in the rule’s head and is not bound anywhere in the
rule’s body. Such a thing is not possible with vanilla Datalog. LB
Datalog’s constructors, though, is a mechanism that can facilitate
such a need, since it can be used to create new non-existing entities
and return them via existentially-quantified variables.
LB Datalog Constructors
In LB Datalog, you define a constructor as follows:
Context(ctx) -> .
Context:New[invoc1, invoc2] = ctx ->
Instruction(invoc1), Instruction(invoc2), Context(ctx).
lang:constructor(`Context:New).
The above states the Context
is an entity predicate (a custom
defined type) and that the constructor predicate Context:New
can be
used to create a new context ctx
, given 2 invocation
instructions. Just as in algebraic data types, the
constructor and its arguments uniquely identify the entity they
create. E.g., Context:New
will create a single context at most,
given a combination of invocation instructions—after the context is
created, any subsequent calls to the constructor with the same
arguments will yield the same existing context.
In fact, one can declare multiple constructor predicates for the same
entity (e.g., Context
), and pattern match them via the constructor
that was used to create them. Most importantly, a constructor atom
in the head of a rule can create a new entity and bind it to an
existentially quantified variable.
So, suppose we were set on implementing our analysis using a fixed 2 call-site sensitive approach. The rule for virtual method invocations would become:
Context(calleeCtx), <------------
Context:New[lastinvoc, invocation] = calleeCtx <------------
CallGraph:Edge(callerCtx, invocation, calleeCtx, tomethod),
VarPointsTo(hctx, heap, calleeCtx, this)
<-
VarPointsTo(hctx, heap, callerCtx, base),
ReachableMethod(inmethod),
Instruction:Method[invocation] = inmethod,
VirtualMethodInvocation:Base[invocation] = base,
HeapAllocation:Type[heap] = heaptype,
VirtualMethodInvocation:Name[invocation] = name,
VirtualMethodInvocation:Signature[invocation] = signature,
MethodLookup[name, signature, heaptype] = tomethod,
Method:ThisVar[tomethod] = this,
Context:New[_, lastinvoc] = callerCtx. <------------
Few things to note:
calleeCtx
is existentially quantified since it is only bound in the rule’s head and not its body-
the
Context(calleeCtx)
atom is needed too, besides the constructor atom—in general new entities created by constructors must include 2 atoms in some rule’s head, using the syntax:SomeEntity(X), SomeEntity:Constructor[...] = X
Context:New
is used in the rule’s body to pattern match (caller) contexts that were created using this constructor and get the 2nd invocation (which was used as constructor argument)Context:New
is used in the rule’s head to create a (new) callee context by appending the current invocation instruction to all but the first (in the case of 2 call-site sensitivity, it is just one) invocation elements of the caller context
This is all valid LB Datalog code. However, without merge this approach lacks the genericity of supporting many different kinds of contexts, and is fixed on 2 call-site sensitivity.
The Case For Macros
Could we use constructor predicates for implementing the merge and record functions in pure LB Datalog?
First, they seem syntactically similar, since both create an entity and bind it to an existentially quantified variable. As constructors, merge and record would look like this:
Merge[callerCtx, invocation, hctx, heap] = calleeCtx ->
Context(callerCtx), Context(calleeCtx),
Instruction(invocation), HeapContext(hctx), HeapAllocation(heap).
Record[ctx, heap] = hctx ->
Context(ctx), HeapAllocation(heap), HeapContext(hctx).
However, despite their syntactic similarity they differ significantly
in their meaning. For one thing, merge and record do not need
all their arguments to uniquely identify a context. For instance, in
2 call-site sensitivity, merge only needs the callerCtx
and
invocation
. In other kinds of context-sensitivity, another subset of
arguments would be needed.
Constructors are not able to express this need: the entirety of their
arguments are used as is, to uniquely identify the context
created. Two invocations with different hctx
argument will always
create different contexts.
Also, even if we could choose which arguments to ignore, we would
probably still need more flexibility than that. To achieve 2 call-site
sensitivity, we should be able to pattern match the callerCtx
argument with its constructor (Context:New
), to extract the 2
invocations that created it, and use just the 2nd invocation along
with the current one to create our new context. This is already
demonstrated in the code of the previous section.
Since these features are not provided by LB Datalog, Doop uses the C macro preprocessor to implement this functionality.
Using macros, merge could be defined in the following way for 2 call-site sensitivity:
#define MergeMacro(callerCtx, invocation, hctx, heap, calleeCtx) \
Context(calleeCtx), \
Context:New[Context:LastInvoc[callerCtx], invocation] = calleeCtx
where Context:LastInvoc
is defined so as to return the last
invocation from the constructor arguments:
Context:LastInvoc[ctx] = invoc <-
Context:New[_, invoc] = ctx.
Many kinds of context-sensitivity can be defined by supplying the relevant macro-definitions for merge and record.
The Case Against Macros
The thing is that we want to be able to choose what context-sensitivity to use, when the analysis runs. Hence, Doop’s use of the C preprocessor is certainly out of the norm, since it is used to transform the analysis logic at runtime, by applying different macro-substitutions, depending on the choice of the user (normally, through some command-line option).
This rather hackish strategy has several drawbacks:
- Doop cannot easily compile its logic using the LB Datalog compiler, since the macro-substitutions have not been performed yet, at compile time.
- Instead of having a pure LB Datalog syntax, we end up with an unorthodox Datalog+CPP blend that breaks syntax highlighting and limits tool support.
- Line Control: if we are not too careful with our macro definitions, the error lines reported will be out of whack.
- The approach is limited, and for some cases more macro hacks are needed to support a new kind of context.
- The analysis workflow gets burdened with this extra step of applying macro-substitutions (otherwise, the analysis tool would not need a preprocessor at all).
- Since we lack tool support, we cannot use LB Datalog’s compiler and
module mechanism (via projects) to modularize the analysis
logic. The current solution? More macro hacks with
#include
directives to create giant logic files that combine many separate logic files and further mess up our lines!
This is not a comprehensive list. Since, Doop was not designed to be able to compile its analysis logic in the first place, some points did not seem to matter at the time. But having worked on Doop, not being able to compile your logic leads to huge productivity waste by waiting for several seconds for the analysis to fail at runtime with some compile error, over and over again.
Derived Predicates
There is another feature of LB Datalog that is relevant here. That is, derived predicates. When a predicate is declared to be derived, then the compiler will not store its contents to the database. Instead, it will inline its definition to its use sites.
So, the following LB Datalog program:
lang:derivationType[`Foo] = "Derived".
Foo(x,y) <-
Bar(x,z), Bar(z,y).
Foobar(x,y) <-
Foo(x,y), Bar(x,y).
will be transformed to:
Foobar(x,y) <-
Bar(x,z), Bar(z,y), Bar(x,y).
and no table will be created to store the contents of Foo
. This is
very handy when the derived predicate would be too large to store to
the database.
As one would expect, derived predicates have many restrictions: one is that constructors cannot be declared to be derived. This makes sense, since a constructor creates a new entity. It would serve no purpose if we did not store it to the database.
Beyond Derived Predicates: Macro-Constructors
Semantically, what the compiler does in the case of derived predicates is that it performs a logical equivalence. In propositional logic:
d <= b
a <= d /\ c
to
a <= d /\ c <= b /\ c
a <= b /\ c
where d
corresponds to the derived predicate, and is eventually
eliminated. This is only one of the possible logical equivalences that
could be exploited by the Datalog compiler. The following is a
generalization, just as valid:
d /\ x <= b
a <= d /\ c
to
a /\ x <= d /\ x /\ c <= b /\ c
a /\ x <= b /\ c
Now imagine that d
corresponds to the merge predicate and a
to the actual constructor. How is this so different from:
Merge[...] = calleeCtx,
CallGraph:Edge(...)
VarPointsTo(...)
<-
...body...
Context:New[lastinvoc, invocation] = calleeCtx
<-
Merge[callerCtx, invocation, ...] = calleeCtx,
Context:New[_, lastinvoc] = callerCtx.
being transformed to:
Context:New[lastinvoc, invocation] = calleeCtx,
CallGraph:Edge(...)
VarPointsTo(...)
<-
...body...
Context:New[_, lastinvoc] = callerCtx.
It’s the same logical equivalence being applied, only to constructors
as well. It allows us to declare a pseudo-constructor predicate (i.e.,
Merge
), which resembles a normal constructor but its contents are
not stored to the database (which is exactly what we want—just like
derived predicates). Instead, another rule (like the second one)
that uses an actual constructor on its head creates the entities, by
being automatically weaved into all the rules (like the first one)
that seemed to create them via the pseudo-constructor.
Syntax-wise, all we need is a new language directive that declares a predicate to be of this pseudo-constructor kind (lets call it macro-constructors for now). Other than that, everything else is syntactically valid LB Datalog code. The compiler will make the transformation under the hood, pretty much the same way it handles derived predicates.
So in the realm of static analysis, the core of the analysis itself
would comprise rules of the 1st form (the ones that include a
macro-constructor like Merge
on their head) that look like they
create new contexts. The context-sensitivity code would be decoupled
from the core of the analysis, as rules of the 2nd form (the ones that
include macro-constructors like Merge
on their body and map them to
some actual constructor on their head—the two atoms should also
return the same entity, e.g., calleeCtx
).
It is somewhat counter-intuitive though, in the sense that uses of the
Merge
macro will seem like they end up creating new contexts. But if
Merge
was stored in the database, this behavior would be exactly
what would happen: the Merge
facts would be created by the first
rule, and their creation would trigger the creation of Context:New
facts due to the second rule. The transformation achieves the same end
result semantically, without storing any Merge
facts at all.
More Contexts
By having decoupled the core analysis logic, it is very easy to create a new context-sensitivity variant, using macro-constructors.
For 2 call-site sensitivity + 1 heap context, here’s all we need:
Context(ctx) -> .
HeapContext(hctx) -> .
Context:New[invoc1, invoc2] = ctx ->
Instruction(invoc1), Instruction(invoc2), Context(ctx).
HeapContext:New[invoc] = hctx ->
Instruction(invoc), HeapContext(hctx).
lang:constructor(`Context:New).
lang:constructor(`HeapContext:New).
Context:New[lastinvoc, invocation] = calleeCtx <-
Merge[callerCtx, invocation, _, _] = calleeCtx,
Context:New[_, lastinvoc] = callerCtx.
HeapContext:New[lastinvoc] = hctx <-
Record[ctx, _] = hctx,
Context:New[_, lastinvoc] = ctx.
For 2 object sensitivity + 2 heap context:
Context(ctx) -> .
HeapContext(hctx) -> .
Context:New[heap1, heap2] = ctx ->
HeapAllocation(heap1), HeapAllocation(heap2), Context(ctx).
HeapContext:New[heap1, heap2] = hctx ->
HeapAllocation(heap1), HeapAllocation(heap2), HeapContext(hctx).
lang:constructor(`Context:New).
lang:constructor(`HeapContext:New).
Context:New[lastheap, heap] = calleeCtx <-
Merge[_, _, hctx, heap] = calleeCtx,
HeapContext:New[_, lastheap] = hctx.
HeapContext:New[heap1, heap2] = hctx <-
Record[ctx, _] = hctx,
Context:New[heap1, heap2] = ctx.
There is more to macro-constructors than just elegantly expressing all the existing Doop contexts. The macro definition of merge is much restricted, whereas with macro-constructors we can have rules of arbitrary complexity.
We could, for instance, supply many different 2nd form (macro-in-body) rules for the same macro-constructor. In such a case, the compiler should expand rules of the 1st form (macro-in-head) multiple times (one for each relevant macro-in-body rule of the given macro-constructor).
We could even have complex context-sensitivity logic that creates contexts using different constructors, depending on some arbitrary criterion. The core logic (2nd form, macro-in-head rules) stays the same: all we need is to supply multiple 1st form macro-in-body rules that use different constructors in their head, to create context. The possibilities seem endless.
In a practical scenario, suppose we were able to perform a
preprocessing step that identified factory methods (either by some
heuristic property about their structures or by some naming
convention, e.g., those prefixed by new
). It would make sense to
give heap context to objects created therein, even when all other heap
objects have no context at all.
So let’s take an ordinary 1 object sensitive analysis without heap context:
Context(ctx) -> .
HeapContext(hctx) -> .
Context:New[heap] = ctx ->
HeapAllocation(heap), Context(ctx).
HeapContext:New[] = hctx ->
HeapContext(hctx).
lang:constructor(`Context:New).
lang:constructor(`HeapContext:New).
Context:New[heap] = calleeCtx <-
Merge[_, _, hctx, heap] = calleeCtx,
HeapContext:New[] = hctx.
HeapContext:New[] = hctx <-
Record[ctx, _] = hctx,
Context:New[_] = ctx.
and add heap context selectively, just to allocations inside factory methods:
Context(ctx) -> .
HeapContext(hctx) -> .
Context:New[heap] = ctx ->
HeapAllocation(heap), Context(ctx).
HeapContext:New0[] = hctx ->
HeapContext(hctx).
HeapContext:New1[heap] = hctx ->
HeapAllocation(heap), HeapContext(hctx).
lang:constructor(`Context:New).
lang:constructor(`HeapContext:New0).
lang:constructor(`HeapContext:New1).
Context:New[heap] = calleeCtx <-
Merge[_, _, hctx, heap] = calleeCtx,
HeapContext:New[] = hctx.
# Rule a1
HeapContext:New0[] = hctx <-
Record[_, heap] = hctx,
AssignHeapAllocation:Heap[allocinstr] = heap,
Instruction:Method[allocinstr] = inmethod,
!FactoryMethod(inmethod).
# Rule a2
HeapContext:New1[lastheap] = hctx <-
Record[ctx, heap] = hctx,
AssignHeapAllocation:Heap[allocinstr] = heap,
Instruction:Method[allocinstr] = inmethod,
FactoryMethod(inmethod),
Context:New[lastheap] = ctx.
Notice that there are two actual heap context constructors
(HeapContext:New0
and HeapContext:New1
) and two corresponding
rules, let’s call them a1
and a2
, that use these constructors to
create heap contexts. The transformation should take every rule, r
,
of the program (in the core analysis logic) with Record
in its head,
and replace r
with two new rules that are the result of weaving a1
and a2
into r
.
Implementation of Macro-Constructors
There are many obvious constraints that the compiler should enforce, regarding macro-constructors. These come to mind…
For rules that contain a macro-constructor atom in their bodies:
- the head of the rule should contain an actual constructor atom
- this atom and the macro-constructor one should bind the same variable, as the returned created entity
- we could even enforce the body of the rule to be an and-expression and the head to contain only the constructor atom, just to make the transformation easier, even though more flexibility could be possible in some cases
For rules that contain a macro-constructor atom in their head:
- the macro-constructor atom should have the same restrictions as if it were an ordinary constructor
There is no major challenge in the transformation itself, but rather at the modularity of the approach. When should this transformation take place? When compiling a project or lazily at runtime? What if we add some logic that uses macro constructors, without providing the actual constructors yet?
The following approach seems to achieve a good compromise:
- During compilation, we treat macro-constructors as regular constructors, but mark them specially.
- We prohibit the addition of any logic that refers to macro-constructors, without first having loaded the logic that resolves them to the actual constructors.
- If we have already loaded such logic, we perform the transformation each time we load additional logic that uses (macro-in-head) the predefined macro-constructors, so as to completely eliminate them from the actual logic to be added to the database.
This way, our primary limitation will be that we’ll have to provide and load definitions for the macro-constructors prior to adding any more logic—just as, in conventional languages, all symbols have to be resolved at link-time to produce an executable.