This repo contains the code, data set and trained models for the paper Cerebro: Static Subsuming Mutant Selection, published in IEEE Transactions on Software Engineering (TSE).
The bib entry for citing the paper is available here:
The dataset is composed of the following:
-
Codebase gathered for the 48 GNU Coreutils [1] programs in C language and 10 projects in Java from Apache Commons Proper [2], Joda-Time [3], and Jsoup [4];
-
Mutant infomation in json file format for every program/project with Mutant ID, Source Code File Name, Mutation Type, and Line #;
-
Subsuming Mutant Label information in json file format with mapping to every mutant on ID basis for every program/project;
-
Abstracted Code for every original source code file and mutant for every program/project; and
-
Mutant Annotation Sequences in pairs of lhs (input) and rhs (expected output) for all mutants in every project/program, with mappings between Sequence File Indexes and Mutant IDs, and Sequences and Original Code File Indexes.
Tools/dependencies that we require before executing the code:
- Apache Maven ( available here: https://maven.apache.org/download.cgi )
- srcML ( available here: https://www.srcml.org/ )
NOTE: please do not forget to modify below variables in data.java file to specify your desired repository locations and/or dependencies
static String dirDataset = "D:/ag/github/Cerebro/dataset";
Commands to execute:
mvn clean package
java -jar D:/ag/github/Cerebro/code/target/cerebro-1.0.jar [arguments]
options based on tasks:
to prepare dataset for model training:
java -jar D:/ag/github/Cerebro/code/target/cerebro-1.0.jar prep [language] [sequence-length] [abstraction-level]
where,
available options for [language] are c or java
[sequence-length] is the desired number of tokens in a sequence (numeric value) e.g. 25 / 50 / 100
available options for [abstraction-level] are full and partial
so, to create dataset for projects in java, of sequence length 100 with abstraction, below command should be executed:
java -jar D:/ag/github/Cerebro/code/target/cerebro-1.0.jar prep java 100 full
to create dataset for projects in c, of sequence length 50 with no abstraction (only code comments removed), below command should be executed:
java -jar D:/ag/github/Cerebro/code/target/cerebro-1.0.jar prep c 50 partial
to test the performance of model by evaluating the model generated sequences:
java -jar D:/ag/github/Cerebro/code/target/cerebro-1.0.jar test [language] [sequence-length] [abstraction-level]
values for [language], [sequence-length], and [abstraction-level] follow the same as described above.
to generate XMLs for input in simulation:
java -jar D:/ag/github/Cerebro/code/target/cerebro-1.0.jar combinetosimulate [language] [sequence-length] [abstraction-level]
values for [language], [sequence-length], and [abstraction-level] follow the same as described above.
Where to find trained models in the repo?
the trained models are available as below:
dataset/subsuming-mutant-prediction-[language]/smp/smp-[language]-[sequence-length]-[fold#]/model
e.g. model trained for java projects with abstracted sequences of length 100 is available below:
dataset/subsuming-mutant-prediction-java/smp/smp-java-100-01/model
Tools/dependencies that we require to train/test the models:
- seq2seq ( available here: https://google.github.io/seq2seq/getting_started/#download-setup )
- Tkinter (available here: https://docs.python.org/3.8/library/tkinter.html )
- TensorFlow ( available here: https://www.tensorflow.org/install/pip )
- PyYAML ( available here: https://pyyaml.org/wiki/LibYAML )
- Perl (available here: https://www.cpan.org/modules/INSTALL.html )
for model training:
please refer to the script train.sh available at Cerebro/dataset/subsuming-mutant-prediction-java/smp/seq2seq/train.sh
./train.sh [dirpath] [training-samples-num * epoch-num] [dirpath]/model [config] 1 [training-samples-num] [training-samples-num] 0
below is a sample usage for training a model till 10 epochs for projects in java with sequence length 50 having 135,903 training samples:
./train.sh ../smp-java-50-01 1359030 ../smp-java-50-01/model length_51-g-1-2 1 135903 135903 0
please refer to configurations available in directory Cerebro/dataset/subsuming-mutant-prediction-java/smp/seq2seq/configs.
for sequence length 25, 50, and 100, please use length_26-g-1-2, length_51-g-1-2, and length_101-g-1-2
for model testing:
please refer to the script test.sh available at Cerebro/dataset/subsuming-mutant-prediction-java/smp/seq2seq/test.sh
./test.sh [dirpath]/test [dirpath]/model [desired-generated-sequences-file-name]
below is a sample usage for using the trained model available at location - (../smp-java-50-01/model) and test set available at location - (../smp-java-50-01/test) to generate sequences in file genrhs-smp-java-50-01.txt:
./test.sh ../smp-java-50-01/test ../smp-java-50-01/model genrhs-smp-java-50-01.txt
note:
please note that few models were larger than 100MB in size, hence they were split in 2 files to be able to check-in. below are those models:
dataset/subsuming-mutant-prediction-java/smp/pa-smp-java-50-01/model/model.ckpt.data-00000-of-00001
dataset/subsuming-mutant-prediction-java/smp/pa-smp-java-50-02/model/model.ckpt.data-00000-of-00001
dataset/subsuming-mutant-prediction-java/smp/pa-smp-java-50-03/model/model.ckpt.data-00000-of-00001
dataset/subsuming-mutant-prediction-java/smp/pa-smp-java-50-04/model/model.ckpt.data-00000-of-00001
dataset/subsuming-mutant-prediction-java/smp/pa-smp-java-50-05/model/model.ckpt.data-00000-of-00001
in aforementioned cases, model.ckpt.data-00000-of-00001 was divided in model.ckpt.data-00000-of-00001.001 and model.ckpt.data-00000-of-00001.002
[1] GNU Coreutils. https://www.gnu.org/software/coreutils/, (last accessed April 24, 2021).
[2] Apache Commons Proper. https://commons.apache.org, (last accessed April 24, 2021).
[3] Joda-Time. https://github.com/JodaOrg/joda-time/, (last accessed April 24, 2021).
[4] Jsoup. https://github.com/jhy/jsoup, (last accessed April 24, 2021).