Artifact for the Learning Type Inference for Enhanced Dataflow Analysis paper
This repository provides means to add neural type inference to the code analysis platform Joern.
The newly introduced pass makes use of a Large Language Model during the usual post-processing passes for the jssrc2cpg
language frontend to infer additional type information where it is missing.
For this process to make use of the neural type inference server, the JoernTI backend must be installed first.
You can initialize the joernti
submodule by running:
git submodule update --init --recursive
Before running the type inference passes with Joern, follow its install instructions and start the backend server:
joernti codetidal5 --run-as-server
You can then proceed to use JoernTI together with Joern:
sbt stage astGenDlTask
./joernti-codetidal5 <target_source_directory> -Dlog4j.configurationFile=log4j2.xml
While the default values are usually all that is necessary, there are additional configurations available:
=== JoernTI x CodeTIDAL5 ===
Usage: joernti-codetidal5 [options] input
--help
input source code directory (JavaScript or TypeScript)
-o, --output <value> output path for the CPG (Default 'cpg.bin')
-h, --hostname <value> JoernTI server hostname (Default 'localhost')
-p, --port <value> JoernTI server port (Default 1337)
--typeDeclDir <value> the TypeScript type declaration files to improve type info of the analysis
--logTypeInference log the slice based type inference results (Default false for performance)
-m, --min-calls <value> the minimum number of calls required for a usage slice (Default 1)
--exclude-op-calls excludes <operator> calls from the slices, e.g. <operator>.add, <operator>.assignment, etc.
One notable configuration is to set --typeDeclDir ./type_decl_es5
which checks for type constraint violations
according to the ES5 standard library types.
For validating this artifact with the results of the paper, a good combination would be:
./joernti-codetidal5 <target_source_directory> --logTypeInference --typeDeclDir ./type_decl_es5
The argument logTypeInference
will provide CSVs listing what was inferred and print any schema violating inferences.
Note: This demo is aimed at version v0.0.44
of JoernTI.
We make a CodeTIDAL5 checkpoint available on Hugging Face: https://huggingface.co/joernio/codetidal5
The current version is fine-tuned for 175k steps on the adjusted (cf. Experiments) ManyTypes4TypeScript dataset. We plan on uploading refined versions in the future.
For experimenting with the ML model and the datasets used in ./experiments
, install the dependencies incl. CUDA and
PyTorch 2.0 (GPU required):
cd ./experiments
./install_cuda_pytorch.sh
You can find scripts and instructions how to generate a training dataset for type inference with a decoder model such as CodeT5 in ./experiments/training_dataset
.
We also publish a dataset of object usage slices for ~300k TypeScript programs, extracted with Joern Slice. The slices have been obtained from open source programs in the The Stack dataset.
An example can be found in ./testcode/test_slice
.
If you use JoernTI / CodeTIDAL5 in your research or wish to refer to the baseline results, we kindly ask you to cite us:
@inproceedings{joernti2023,
title={Learning Type Inference for Enhanced Dataflow Analysis},
author={Seidel, Lukas and {Baker Effendi}, David and Pinho, Xavier and Rieck, Konrad and {van der Merwe}, Brink and Yamaguchi, Fabian},
booktitle={28th European Symposium on
Research in Computer Security (ESORICS)},
year={2023}
}
Some code and graphics in this repository are part of the work first published in the 28th European Symposium on Research in Computer Security by Springer Nature.