Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR brings a few improvements to BAP that are summarized in the
following demo:
Important highlights of the PR:
REPL
The REPL is using lineoise and features completion (hit TAB),
context-dependent hints (prints what the grammar expects as you
type) and is, of course, extensible, i.e., it is possible to implement
your own commands and call them from the REPL. The script mode as well
as direct input of the commands from the command-line is also
supported.
Efficient KB Representation
The KB representation is more efficient (more than x2 improvement in
space) and is portable across different versions of bap (and the
representation is itself versioned).
To enable such speed up we changed the representation of the
Knowledge.Name into an interned form using a hash function with low
probability of collisions. Much like the polymorphic variants in
OCaml except that we use 63 bits instead of 31. Of course, hash
collisions are captured and properly reported.
This also slightly improved performance and memory footprint of bap in
general as names were used everywhere in BAP, in variables, in sorts,
etc.
Although the representation is using bin_prot it is designed to enable
interaction with other languages as well as extensibility. Each
property is stored as
<ID> <LEN> <PAYLOAD>
where<ID>
is the nameof the property (interned),
<LEN>
is the length of the payload (sothat it can be skipped if it is not supported by the parser), and
is the string of bytes in the format specific to the
property serializer (which itself may include a version tag).
Optimized Loading And Storing
Both loading and storing of the cache is now made via memory
mapping (that means that the knowledge base should be a regular
file). Since all the information is now stored in the knowledge base,
just loading it is enough to get the project, which makes loading the
project x20 or x25 faster than it was before. This affects both
loading from the cache and loading from the specified knowledge base.
Interaction With The Cache
The cache as before, along with other data, stores a knowledge base per each
file, indexed by the digest of the input file and all parameters that
affect the disassembly. The only thing that changed is that now the
result of disassembly is also stored in the knowledge base (previously
it was stored as a separate file). When no project is specified (or the
project file doesn't exist) the file is loaded from the cache. This
enables fast extraction of the file's KB from the cache, e.g.,
will load
/bin/ls
from the cache and immediately store it in thels.proj
, provided thatls.proj
didn't exist.Lazy Project
The project data structure includes a lot of fat data representation,
such as whole program CFG, Symtab that includes a CFG per each
function, and the program data structure. This information takes a lot
of space both on disk and in RAM and was computed even if it was never
used. Moreover, it is easily computable from KB, which uses a much
more efficient representation. To address this we made the
abovementioned data structures lazy, i.e., if you don't use the
program IR then it will not be computed. This saves space and time a
lot.
New API
The following API were added:
Minor Tweaks
Tweaks the pretty-printing representation of the knowledge, BIR, and
BIL. It is now much more readable, concise, and properly indented.
Bug Fixes
Fixes #1216
Fixes #1169
Fixes #1168