Final 1.3 update #688

ivg · 2017-08-11T17:11:27Z

This PR brings multiple new features from our development branches. In general this PR focuses on three main features:

BIL correctness and normalization
BIL readability and performance
Primus interpreter

Note for reviewers and readers: I wouldn't expect you guys to check the code thoroughly, but you are welcome to ask questions, request clarifications, and discuss changes. Most likely just reading this PR message would be enough for the review.

Note: the Table of Contents doesn't work since Github doesn't emit anchors on the PR page. They will work (I hope so), when we will move this page to the release page.

Detailed Description

BIL Typechecker

We added a type checker, and now force all BIL code to be well-typed. Lifters must produce only well-typed code. So far, we will issue a warning for each piece of code that is not, along the 1.4 we are planning to fix all issues if any. The Type module now have the infer function, that will infer type for an expression, and check that will type check a BIL program.

BIL Effect Analysis

A precise effect analysis was added to ensure correctness of program transformations. The effect analysis distinguishes between different effects and coeffects:

reads (coeffect)
loads (coeffect)
writes (effect)
raises (effect)

The raises effect includes the division by zero CPU exception and a conservative AI will infer a possible presence of the division by zero in a BIL expression.
A new module, called Eff is added, that provides functions for the effect analysis.

BIL Simplification and Constant Propagation

The old constant propagation and simplification algorithm was buggy, and didn't take into account effects when removing statements and subexpressions. It was rewritten from scratch with an algorithm that is sound and parametrized by the set of effects, that may be ignored during the analysis. The new analysis treats operations accurately with respect to their algebraic properties and takes into account operations associativity. It is also now a big step analysis and no longer requires to be wrapped into the fixpoint function. Basically, we are moving from the open and semi-open recursion to the closed recursion wherever it is possible, as it is more efficient and intuitive approach. We will probably remove the fixpoint function in BAP.2.0.

BIL normal form and normalization utilities

We introduce a notion of a BIL Normal Form, that is a sub-language of BIL that disallows certain constructions, such as let and if/then/else expressions, complex memory expressions, and complex conditionals. This doesn't impose, that those expressions are not allowed in general, and lifters can still emit code that is easier to write. All such expressions are removed, i.e., are rewritten during the normalization process. The normalization procedure is described in details in the documentation, as well as in comments. Here is a brief highlight:

no let expressions
no if/then/else expressions
all memory operations are one-byte only (size and endianness can be ignored)
no anonymous memory storages - load expressions may load only from a variable
no loads and stores inside conditional expressions
no ill-typed expressions

SSA transformation plugin

This is a simple plugin that just transfers all functions into SSA form. It was moved from the bap-plugins repository.

Dead code elimination plugin

A sound analysis that will remove all unused code. This plugin is very useful, as it will remove lots of unnecessary code that are emitted by lifters. In particular, on x86 it removes flag compuations that are not used anywhere. As a result, this reduces the size of the output to 50% on x86 and makes BIL much more readable. It also highly increases the speed of interpretation. Especially, since many flag computations are quite computationally hard.

New pretty printer for bitvectors

Printers from the Z library proved to be slow, buggy, and ugly. Thus we decided to provide our own pretty-printer that will suit all the needs. A generic printer is added, that allows a user to customize the look of the textual representation of a bitvector. We also provide 9 different instantiations of the generic printer, e.g., pp_hex, pp_dec, pp_oct and pp_bin that will print a bitvector in the corresponding format. We also remove suffixes from a bitvector to make output more readable and concise. The full version is available via pp_hex_full, pp_dec_full, ... printers. We also change a default textual representation (i.e., to_string and pp) functions. The new representation now includes the signedness suffix, and moves away from using true and false for the one-bit words.

New functions for bitvectors

We added the to_int[XX]_exn family of functions, that will efficiently cast a bitvector into a corresponding OCaml type without extra allocations whenever possible. For example, Word.to_int_exn is usually an identity function w.r.t. the GC (i.e., it still has a check that value fits, and if it is, the it is an identity).

We also added the Word.unsigned function that will cast away the signedness of a bitvector.

The Unsafe module was added to the bitvector interface that provides arithmetic instructions that do not check that operands have correct widths. The module can be used for interpreting a well-typed BIL code. Moreover, these operations are included by default in the Bitvector top-level interface, thus, a Word.(a + b) will result in the unsafe (+) operator.

New pretty printer for BIL

We implemented a new pretty printer that (i) doesn't emit unnecessary parentheses anymore, and uses more concise syntax whenever it is possible. We followed the C language for the operator precedence (not OCaml). Here is the precedence table:

   +-------------------------------------------+----+
   | x[y], <cast>:<N>[x], extract:<N>:<M>[x]   | 10 |
   +-------------------------------------------+----+
   | ~x, -x                                    |  9 |
   +-------------------------------------------+----+
   | *,/,/$, %, %$                             |  8 |
   +-------------------------------------------+----+
   | +,-                                       |  7 |
   +-------------------------------------------+----+
   | <<,>>,~>>                                 |  6 |
   +-------------------------------------------+----+
   | <,>,<=,>=                                 |  5 |
   +-------------------------------------------+----+
   | =, <>                                     |  4 |
   +-------------------------------------------+----+
   | &                                         |  3 |
   +-------------------------------------------+----+
   | ^                                         |  2 |
   +-------------------------------------------+----+
   | |                                         |  1 |
   +-------------------------------------------+----+
   | let.., x with y <- z, if.., . (dot)       |  0 |
   +-------------------------------------------+----+

Higher precedence means that operators binds tighter.
Any bil statement has precedence lower than an expression

The load and store expressions are also simplifed, now it is not necessary to put the endiannes and size if an expression is byte sized.

Identifiable values for the Primus interpreter

We used the word type for values in the Primus interpreter, however, it showed that it is very hard to write a data tracking analysis without being able to identify values. For example, in the taint propagation plugin, we implemented a stack machine for tracking the taint. It is not only hard to maintain and understand, but also is inefficient. This defeats the main reason why we decided to use the word type as a value representation in the interpreter -- the efficiency.

We now represent a computation result as a pair of a word and a unique identifier. Basically the same as with the Bil.Result.t except that we are not using a variant type to represent storages and bottom values. In the Primus model, there are no such values, and everything is represented with a machine word, coined with an identifier.

To make it easier to work with the new values, we lifted most of the bitvector interface into the value interface.

New Primus interpreter

The Primus Interpreter is reimplemented and doesn't use the eval or microx code anymore (though there is still lots of code sharing since we factorized a lot of code from microx to the BIL library). The interpreter now tracks data much more closely, and provides an interface that allows code summaries, that are implemented in Lisp or in OCaml or in any other language to describe all code effect precisely. We also revisited observations that are made by the interpreter, and moved memory and environment observations into the interpreter.

New Primus Mark Visitied plugin

This plugin will mark each term that is visited (evaluated) by the Primus interpreter with the visited attribute. It is useful for tracking the coverage.

Diagnostic messages with better backtraces

There is an impedance mismatch between OCaml exceptions and the error monad. Sometimes we switch from one to another and loose the backtracing information. Now the backtraces are always stored in Or_error.t values and are printed in case of failure.

Caveats, Limitations, and Regressions

The normalized form is not represented in a type system, thus it maybe applied more than once. In future we are planning to add phantom types for that.
The dead-code elimination plugin is computationally intensive but is not cached. Thus the startup time of each binary is actually increased. We will later and the caching layer to alleviate this.

Fixes and updates for the x86 lifter

The shift operation representation was slightly incorrect, as it was using x := x instruction to represent that the value of x is unchanged. From the analysis perspective this is not entirely true, as it is an update of a variable, an unnecessary usage (that keeps DCE from removing this statement) and, in many cases, it left the right hand side left variable upper exposed, that was introducing a source of undefined values, as well as prohibited certain kind of optimizations on the interprocedural values (we were forced to treat x as an implicit input parameter of a function, where it is upper exposed).

The shift lifter also introduced unnecessary temporal variables, as well as used lots of ite. The former is removed, the latter is translated into one if statement (basically all ite were guarded by the same condition, so we can gather them, that will lead to much better CFG).

Bug fixes in ARM lifter

The ARM lifter was emitting badly typed code. Now it is fixed.

Support for tracing and other minor fixes

We added a handler for SIGINT so that gprof and other utilites, that requires a correct exit call, can work properly. We also added BAP_DEBUG=true to the Travis environment to ease the debugging on it.

TODO

Note, this is a work in progress, there are still a few minor issues that should be resolved before this branch would be ready for the merge. Here comes a list

Mark all new functions with the @since tag
Document all new functions
update all lifters with the typechecking code
remove calls to Bil.fixpoint as it is no longer necessary (in most cases)
check that all new plugins have sufficient manpages and documentation
check that all usages of Word.string_of_value are consistent, so that we will not break any integrations with external tools
update README.md with nice examples of BIL. Ideally, use the well-known strcpy as an example.
rewrite observer notifications so that flambda can inline them
cleanup the observations, make sure that we don't have any dead observation handlers.

The main change is that now there is only one main state in the machine (aka project), not a separate copy per machine. This commit also fixes leave-{blk,sub} events and adds several new observations, such as exn-raised, that occurs every time a machine switches to an exceptional control flow. The commit also adds few common method to the monad interface.

Now it is possible to attach arbitrary values to a bitvector.

also fixes a bug in Eval.binop.

This bug leads to a type error, as a result of a byte load and a half word load is stored in a 32 bit variable.

a type checker didn't held correctly shifts and the normalizer didn't properly recurse

allow a backend to choose whether to propagate consts or not.

this is temporal cludge.

BIR becomes unreadable. We will apply normalization when needed. Later, we may add a memoization or stuff like that.

this enables better diagnostics of compilation and configuration errors on travis.

the plugin will mark all terms that were visited by Primus with the visited attribute.

Should do this long ago, but finally found some time. No more unnecessary parentheses! Note: operator precedences follow the C language, not OCaml. +-------------------------------------------+----+ | x[y], <cast>:<N>[x], extract:<N>:<M>[x] | 10 | +-------------------------------------------+----+ | ~x, -x | 9 | +-------------------------------------------+----+ | *,/,/$, %, %$ | 8 | +-------------------------------------------+----+ | +,- | 7 | +-------------------------------------------+----+ | <<,>>,~>> | 6 | +-------------------------------------------+----+ | <,>,<=,>= | 5 | +-------------------------------------------+----+ | =, <> | 4 | +-------------------------------------------+----+ | & | 3 | +-------------------------------------------+----+ | ^ | 2 | +-------------------------------------------+----+ | | | 1 | +-------------------------------------------+----+ | let.., x with y <- z, if.., . (dot) | 0 | +-------------------------------------------+----+ 1) Higher precedence means that operators binds tighter. 2) Any bil statement has precedence lower than an expression.

1. at some branches the simplified didn't recurce 2. if a cast casts to the same type it is removed

it was wrong.

The ssa just translates a program into the SSA form The dead-code-elimination is a conservative deadcode elimination plugin, that helps alot with x86 binaries by removing tons of unused flags calcualations.

The string_of_value function now emits the `0x` prefix for hexnumbers, unless instructed explicitly with the `prefix:false` flag. The reason for this change is to enable a consitent handling of signed and unsinged, negative and positive values.

long time ago we forgot to make it public.

1. Now it will run until a fix point is reached 2. A simple constant propagation is added, so that more virtuals are removed

there is no need to have two different names for memory.

maverickwoo · 2017-08-15T20:51:12Z

OK, thanks for the clarifications. I now understand that BAP 1 can handle 1 thread and BAP 2 can handle 2 threads. :)

maurer · 2017-08-15T21:15:15Z

Yeah, what you wrote there was (almost) exactly what I was suggesting. The main difference is that I'm not suggesting that you put all basic block addresses there, but instead the list you assumed it was possible to jump to when you did your dead code analysis, e.g. only the immediate successor nodes for the indirect jump. Hopefully this is usually a short list... We could even package it as a separate pass that you do before any analysis that assumes a complete CFG. This would get us two wins:

If __cfi_abort or whatever we call it gets called, we can use what it got called with to revise/expand the CFG
When running code that has been deadcode-eliminated, there is no uncertainty over whether the execution is viable - any deadcode-eliminated trace that doesn't include a call to __cfi_abort is definitely equivalent to a trace of the original program.

ivg · 2017-08-15T21:40:14Z

Yeah, @maurer, I think it would be a good idea to have this pass. The main concern is that it will by default clobber the output. We can alleviate this by rewriting the test like this:


when r0 = a0 call r0
when r0 = a1 call r0
...
when r0 = aN call r0
call __cfi_abort with noreturn

It will be easier to analyse, and easier to eliminate clauses that are not feasible.

Yeah, looks nice, we will add this cfi-or-abort pass. But let's it do later, after the release.

Updated bap.mli with a documentation for new functions. Also, renamed Image.Scheme.reference to Image.Scheme.relocation.

Now it is enough just to type `make doc` and everything will work out of box, if a correct version of OCaml is used, i.e., 4.03. The Makefile recipe will pull the latest version of argot (git is required) and use it to build the documentation. BAP should be installed either from the source tree, or from opam, doesn't matter.

also removes some unnecessary stuff from it.

I've removed few unnecessary funciton, e.g., `State.modify` and also fixed the type of the call/cc function.

also fixes few bugs that were found as a result 1) a zero width word was created (should be a one bit width) 2) a bug in the typechecker that led to a stackoverflow So far no more type errors

also added parameters to the primus loader.

because my last update to primus broke it(((

The __primus_linker_unresolved_call function is called in case if a linker can resolved a call. Also adds the `pc` method to the interpreter interface and publish it as a primitive `get-current-program-counter`.

not actually in the random number generator, but rather in a casting it to a word.

just fail the machine.

Now we can specify a list of entry points, or a magic `all-subroutines` to start execution in parallel from all specified entry points. If no specified, then all subroutine terms marked with the `entry_point` attribute are entered.

The primus-limit plugin will terminate a machine after a certain amount of computations has happened.

ivg added 30 commits July 5, 2017 16:18

adds attributes to bitvectors

80e282c

Now it is possible to attach arbitrary values to a bitvector.

names the input interface of the Regular functor

f1c01cf

wip

c727172

wip

0270a94

wip

57ac0c6

wip

356644a

wip

9ffefbd

wip - now it compiles at least

b2de45e

Merge branch 'master' into identifiable-values

693ef92

ensures normalization and well-typedness

3dce937

also fixes a bug in Eval.binop.

fixes a bug in arm lifter

cc61387

This bug leads to a type error, as a result of a byte load and a half word load is stored in a 32 bit variable.

published Exp.normalization function.

5cfa5f3

fixes a bug in type checker and exp normalizer

5a1951a

a type checker didn't held correctly shifts and the normalizer didn't properly recurse

do not force constant folding

a9b0014

allow a backend to choose whether to propagate consts or not.

so far force the normalization in the exp

91fc4f3

this is temporal cludge.

removes exp normalization

a603537

BIR becomes unreadable. We will apply normalization when needed. Later, we may add a memoization or stuff like that.

sets BAP_DEBUG to true

8134dca

this enables better diagnostics of compilation and configuration errors on travis.

adds primus mark visted plugin

7ac0c14

the plugin will mark all terms that were visited by Primus with the visited attribute.

fixes simplifier, adds cast simplification

220bef3

1. at some branches the simplified didn't recurce 2. if a cast casts to the same type it is removed

apply constant folding by default

7aab046

fixes the Bitvector.unsigned function

1a08c7d

it was wrong.

Merge branch 'master' into identifiable-values

9b53efa

removes unnecessary optimizations from the IR.lift

31c9d8d

adds ssa and dead-code-elimination plugins

56ec68e

The ssa just translates a program into the SSA form The dead-code-elimination is a conservative deadcode elimination plugin, that helps alot with x86 binaries by removing tons of unused flags calcualations.

makes string_of_value more consistent

ca78a3d

The string_of_value function now emits the `0x` prefix for hexnumbers, unless instructed explicitly with the `prefix:false` flag. The reason for this change is to enable a consitent handling of signed and unsinged, negative and positive values.

publishes Exp.substitute function

dfa249f

long time ago we forgot to make it public.

enhances dead-code elimination plugin

3c91a53

1. Now it will run until a fix point is reached 2. A simple constant propagation is added, so that more virtuals are removed

renames mem32 and mem64 in x86 lifter to mem

1a8788d

there is no need to have two different names for memory.

maurer approved these changes Aug 15, 2017

View reviewed changes

ivg added 2 commits August 16, 2017 08:14

docs update and few renamings

ad61176

Updated bap.mli with a documentation for new functions. Also, renamed Image.Scheme.reference to Image.Scheme.relocation.

updated testsuite

ff1f1cb

gitoleg approved these changes Aug 16, 2017

View reviewed changes

ivg added 20 commits August 16, 2017 09:20

fix dash/underscore issue

38368bd

updates Primus docs

f63ac4c

also removes some unnecessary stuff from it.

adds monads interfaces

63c7883

updates monads docs

04b7d1d

cleans and documents the monads library

ef26b3b

I've removed few unnecessary funciton, e.g., `State.modify` and also fixed the type of the call/cc function.

removes extra blank lines from bap.mli

9779116

adds type checker to backends

a93d7d7

also fixes few bugs that were found as a result 1) a zero width word was created (should be a one bit width) 2) a bug in the typechecker that led to a stackoverflow So far no more type errors

removes unnecessary calls to Bil.fixpoint

6d71e01

Merge branch 'master' into final-1.3-update

99ae4d0

adds missing manpages

f37bee9

also added parameters to the primus loader.

typo in the docs

f296ce4

gone crazy and rewrote the taint propagation plugin

a05087c

because my last update to primus broke it(((

makes oasy little bit less hungry

a99e870

dispatches linker unresolved calls to a lisp stub

70752f3

The __primus_linker_unresolved_call function is called in case if a linker can resolved a call. Also adds the `pc` method to the interpreter interface and publish it as a primitive `get-current-program-counter`.

fixes a bug in random number generation

66ee85c

not actually in the random number generator, but rather in a casting it to a word.

don't fail the whole program on a division by zero

66f3c6c

just fail the machine.

enhances the run plugin with the multi-entry mode

6f4e4f2

Now we can specify a list of entry points, or a magic `all-subroutines` to start execution in parallel from all specified entry points. If no specified, then all subroutine terms marked with the `entry_point` attribute are entered.

flushes output from primus after each character

9aef93b

adds a Primus component that ensures termination

87ad64f

The primus-limit plugin will terminate a machine after a certain amount of computations has happened.

gitoleg merged commit 817eac8 into BinaryAnalysisPlatform:master Aug 24, 2017

ivg deleted the final-1.3-update branch March 7, 2018 15:38

ivg mentioned this pull request Mar 12, 2018

fixes BIL normalization procedure #789

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Final 1.3 update #688

Final 1.3 update #688

ivg commented Aug 11, 2017 •

edited

Loading

maverickwoo commented Aug 15, 2017

maurer commented Aug 15, 2017 •

edited

Loading

ivg commented Aug 15, 2017

Final 1.3 update #688

Final 1.3 update #688

Conversation

ivg commented Aug 11, 2017 • edited Loading

Table of Contents

Detailed Description

BIL Typechecker

BIL Effect Analysis

BIL Simplification and Constant Propagation

BIL normal form and normalization utilities

SSA transformation plugin

Dead code elimination plugin

New pretty printer for bitvectors

New functions for bitvectors

New pretty printer for BIL

Identifiable values for the Primus interpreter

New Primus interpreter

New Primus Mark Visitied plugin

Diagnostic messages with better backtraces

Caveats, Limitations, and Regressions

Fixes and updates for the x86 lifter

Bug fixes in ARM lifter

Support for tracing and other minor fixes

TODO

maverickwoo commented Aug 15, 2017

maurer commented Aug 15, 2017 • edited Loading

ivg commented Aug 15, 2017

ivg commented Aug 11, 2017 •

edited

Loading

maurer commented Aug 15, 2017 •

edited

Loading