Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

knowledge base improvements #1217

Merged
merged 3 commits into from
Sep 10, 2020

Conversation

ivg
Copy link
Member

@ivg ivg commented Sep 10, 2020

This PR brings a few improvements to BAP that are summarized in the
following demo:

asciicast

Important highlights of the PR:

  • a REPL for querying and modifying the knowledge base
  • a portable and efficient representation of the knowledge base

REPL

The REPL is using lineoise and features completion (hit TAB),
context-dependent hints (prints what the grammar expects as you
type) and is, of course, extensible, i.e., it is possible to implement
your own commands and call them from the REPL. The script mode as well
as direct input of the commands from the command-line is also
supported.

Efficient KB Representation

The KB representation is more efficient (more than x2 improvement in
space) and is portable across different versions of bap (and the
representation is itself versioned).

To enable such speed up we changed the representation of the
Knowledge.Name into an interned form using a hash function with low
probability of collisions. Much like the polymorphic variants in
OCaml except that we use 63 bits instead of 31. Of course, hash
collisions are captured and properly reported.

This also slightly improved performance and memory footprint of bap in
general as names were used everywhere in BAP, in variables, in sorts,
etc.

Although the representation is using bin_prot it is designed to enable
interaction with other languages as well as extensibility. Each
property is stored as <ID> <LEN> <PAYLOAD> where <ID> is the name
of the property (interned), <LEN> is the length of the payload (so
that it can be skipped if it is not supported by the parser), and
is the string of bytes in the format specific to the
property serializer (which itself may include a version tag).

Optimized Loading And Storing

Both loading and storing of the cache is now made via memory
mapping (that means that the knowledge base should be a regular
file). Since all the information is now stored in the knowledge base,
just loading it is enough to get the project, which makes loading the
project x20 or x25 faster than it was before. This affects both
loading from the cache and loading from the specified knowledge base.

Interaction With The Cache

The cache as before, along with other data, stores a knowledge base per each
file, indexed by the digest of the input file and all parameters that
affect the disassembly. The only thing that changed is that now the
result of disassembly is also stored in the knowledge base (previously
it was stored as a separate file). When no project is specified (or the
project file doesn't exist) the file is loaded from the cache. This
enables fast extraction of the file's KB from the cache, e.g.,

bap /bin/ls --project ls.proj --update

will load /bin/ls from the cache and immediately store it in the
ls.proj, provided that ls.proj didn't exist.

Lazy Project

The project data structure includes a lot of fat data representation,
such as whole program CFG, Symtab that includes a CFG per each
function, and the program data structure. This information takes a lot
of space both on disk and in RAM and was computed even if it was never
used. Moreover, it is easily computable from KB, which uses a much
more efficient representation. To address this we made the
abovementioned data structures lazy, i.e., if you don't use the
program IR then it will not be computed. This saves space and time a
lot.

New API

The following API were added:

  • [Project.State] that represents the disassembled binary;
  • [Project.Analysis] for writing your own KB analyses.

Minor Tweaks

Tweaks the pretty-printing representation of the knowledge, BIR, and
BIL. It is now much more readable, concise, and properly indented.

Bug Fixes

Fixes #1216
Fixes #1169
Fixes #1168

ivg added 3 commits September 10, 2020 13:09
This PR brings a few improvements to BAP that are summarized in the
following demo:

https://asciinema.org/a/358996

Important highlights of the PR:

- a REPL for querying and modifying the knowledge base
- a portable and efficient representation of the knowledge base

REPL
----

The REPL is using lineoise and features completion (hit TAB),
context-dependent hints (prints what the grammar expects as you
type) and is, of course, extensible, i.e., it is possible to implement
your own commands and call them from the REPL. The script mode as well
as direct input of the commands from the command-line is also
supported.

Efficient KB Representation
---------------------------

The KB representation is more efficient (more that x2 improvement in
space) and is portable across different versions of bap (and the
representation is itself versioned).

To enable such speed up we changed the representation of the
Knowledge.Name into an interned form using a hash function with low
probablility of collisions. Much like the polymoprhic variants in
OCaml except that we use 63 bits instead of 31. Of course, hash
collisions are captured and properly reported.

This also slightly improved perfomance and memory footprint of bap in
general as names were used everywhere in BAP, in variables, in sorts,
etc.

Although the representation is using bin_prot it is designed to enable
interaction with other languages as well as extensibility. Each
property is stored as `<ID> <LEN> <PAYLOAD>` where `<ID>` is the name
of the property (interned), `<LEN>` is the length of the payload (so
that it can be skipped if it is not supported by the parser), and
<PAYLOAD> is the string of bytes in the format specific to the
property serializer (which itself may include a version tag).

Optimized Loading And Storing
-----------------------------

Both loading and storing of the cache is now made via memory
mapping (that means that the knowledge base should be a regular
file). Since all the information is now stored in the knowledge base,
just loading it is enough to get the project, that makes loading the
project x20 or x25 faster than it was before. This affects both
loading from the cache and loading from the specified knowledge base.

Interaction With The Cache
--------------------------

The cache as before, along other data, stores a knowledge base per each
file, indexed by the digest of the input file and all parameters that
affect the disassembly. The only thing that changed is that now the
result of disassembly is also stored in the knowledge base (previously
it was stored as a separate file). When no project is specified (or the
project file doesn't exist) the file is loaded from cache. This
enables fast extraction of the file's KB from the cache, e.g.,

```
bap /bin/ls --project ls.proj --update
```

will load `/bin/ls` from cache and immediately store it in the
`ls.proj`, provided that `ls.proj` didn't exist.

Lazy Project
------------

The project data structure includes a lot of fat data representation,
such as whole program CFG, Symtab that includes a CFG per each
function, and the program data structure. This information takes a lot
of space both on disk and in RAM and was computed even if it was never
used. Moreover it is easily computatble from KB, which uses a much
more efficient representation. To address this we made the
abovementioned data structures lazy, i.e., if you don't use the
program IR then it will not be computed. This saves space and time a
lot.

New API
-------

The following API were added:
- [Project.State] that represents the disassembled binary;
- [Project.Analysis] for writing your own KB analyses.

Minor Tweaks
------------

Tweaks the pretty-printing representation of the knowledge, BIR, and
BIL. It is now much more readable, concise, and properly indented.

Bug Fixes
---------

Fixes BinaryAnalysisPlatform#1216
Fixes BinaryAnalysisPlatform#1169
Fixes BinaryAnalysisPlatform#1168
as not all subroutines have names.
We have changed the binary representation so the traces are no longer
valid and for technical reasons we can't create new tests in the near
future so the only solution is to temporary disable them.
@ivg ivg merged commit 168ca30 into BinaryAnalysisPlatform:master Sep 10, 2020
@ivg ivg deleted the knowledge-base-improvements branch December 1, 2021 19:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant