The current repository is associated with the article "Study of Deep Learning Techniques for Side-Channel Analysis and Introduction to ASCAD Database" available on the eprints.
Databases, Neural Networks models as well as scripts are provided here as a complementary material to the article: please refer to it for various explanations and details about SCA and Deep Learning.
Copyright (C) 2018, ANSSI and CEA
The databases, the Deep Learning models and the companion python scripts of this repository are placed into the public domain.
- Ryad BENADJILA (mailto:ryad.benadjila@ssi.gouv.fr)
- Eleonora CAGLI (mailto:eleonora.cagli@cea.fr)
- Cécile DUMAS (mailto:cecile.dumas@cea.fr)
- Emmanuel PROUFF (mailto:emmanuel.prouff@ssi.gouv.fr)
- Rémi STRULLU (mailto:remi.strullu@ssi.gouv.fr)
This work has been partially funded through the H2020 project REASSURE.
The scripts and the data are split in two places mainly because git is not suited for large files.
In order to get everything up and running, here are the steps to follow (we provide the steps using a Unix shell syntax, but you can adapt this and use your favorite shell of course):
- Clone the current repository to get the scripts:
$ git clone https://github.com/ANSSI-FR/ASCAD.git
- In the new
ASCAD
folder, download and decompress the data package with the raw data by using:
$ cd ASCAD $ wget https://www.data.gouv.fr/s/resources/ascad/20180530-163000/ASCAD_data.zip $ unzip ASCAD_data.zip
Please be aware that this last step should download around 4.2 GB, and the decompression will generate around 7.3 GB of useful data.
Now you should be able to use the provided python scripts.
If you have the pip
Python package manager installed, getting
the scripts dependencies is as simple as:
$ pip install keras numpy h5py matplotlib tensorflow for Python 2, or: $ pip3 install keras numpy h5py matplotlib tensorflow for Python 3.
For GPU acceleration, you might also want to install tensorflow-gpu
:
$ pip install tensorflow-gpu for Python 2, or: $ pip3 install tensorflow-gpu for Python3.
A quick test to check that everything is properly installed and working fine can be done by launching the ASCAD_test_models.py script from the main ASCAD folder:
$ python ASCAD_test_models.py
More details are provided in the dedicated section.
The current repository only contains scripts: the raw data that are manipulated by these scripts can be found here: ASCAD_data.
The zip file SHA-256 hash value is:
ASCAD_data.zip
a6884faf97133f9397aeb1af247dc71ab7616f3c181190f127ea4c474a0ad72c
We also provide the SHA-256 hash values of the sub-files when this zip archive is decompressed:
ASCAD_databases/ASCAD.h5:
f56625977fb6db8075ab620b1f3ef49a2a349ae75511097505855376e9684f91
ASCAD_databases/ASCAD_desync50.h5:
8716a01d4aea2df0650636504803af57dd597623854facfa75beae5a563c0937
ASCAD_databases/ASCAD_desync100.h5:
f6b9e967af287e82f0a152320e58f8f0ded35cd74d499b5f7b1505a5ce338b8e
ASCAD_databases/ATMega8515_raw_traces.h5:
51e722f6c63a590ce2c4633c9a9534e8e1b22a9cde8e4532e32c11ac089d4625
ASCAD_trained_models/mlp_best_ascad_desync0_node200_layernb6_epochs200_classes256_batchsize100.h5:
d97a6e0f742744d0854752fce506b4a0612e0b86d0ec81a1144aada4b6fb35a3
ASCAD_trained_models/mlp_best_ascad_desync50_node200_layernb6_epochs200_classes256_batchsize100.h5
582a590c69df625fd072f837c98e147a83e4e20e04465ff48ca233b02bc75925
ASCAD_trained_models/mlp_best_ascad_desync100_node200_layernb6_epochs200_classes256_batchsize100.h5:
9f4d761197b91b135ba24dd84104752b7e32f192ceed338c26ddba08725663a9
ASCAD_trained_models/cnn_best_ascad_desync0_epochs75_classes256_batchsize200.h5:
11ff0613d71ccd026751cb90c2043aff24f98adb769cb7467e9daf47567645be
ASCAD_trained_models/cnn_best_ascad_desync50_epochs75_classes256_batchsize200.h5:
be9045672095a094d70d2ee1f5a76277cab6a902c51e4ebf769282f464828a11
ASCAD_trained_models/cnn_best_ascad_desync100_epochs75_classes256_batchsize200.h5:
866d3ea0e357e09ff30fdc9c39b6ef3096262c50cebd42018a119b1190339fcc
WARNING: all the paths and examples that are provided below suppose that you have downloaded and decompressed the raw data file as explained in the previous section.
ANSSI has provided source code implementations of two masked AES on the ATMega8515 MCU target, which can be found on the following github repository: ANSSI-FR/secAES-ATmega8515. ATMega8515 uses external clocking: the acquisitions have been performed using a smartcard reader providing a 4 MHz clock, compatible with the ISO7816-3 standard default values.
The ASCAD databases correspond to the first version (v1) of the masked AES (the second version has improved security and is currently kept for further studies). The v1 implementation allowed us to perform the acquisition of EM (ElectroMagnetic) measurements in traces of 100,000 time samples (at 2 giga-samples per second) on an ATMega8515 based WB Electronics 64 Kbit ATMega chipcard (see the secAES-ATmega8515 material for more information). The traces are synchronized, and no specific hardware countermeasure has been activated on the ATMega8515.
An extract of 60,000 traces from the acquisition campaign has been compiled in one
HDF5 file
of 5.6 GB named ATMega8515_raw_traces.h5
. The structure of this HDF5 file is
described in the article "Study of Deep Learning Techniques for Side-Channel
Analysis and Introduction to ASCAD Database".
ASCAD (ANSSI SCA Database) is a set of databases that aims at providing a benchmarking reference for the SCA community: the purpose is to have something similar to the MNIST database that the Machine Learning community has been using for quite a while now to evaluate classification algorithms performances.
The databases, which are HDF5 files, basically contain two labelled datasets:
- A 50,000 traces profiling dataset that is used to train the (deep) Neural Networks models.
- A 10,000 traces attack dataset that is used to check the performance of the trained models after the profiling phase.
The details of the ASCAD HDF5 structure are given in the article, as well as a thorough discussion about the elements that need to be addressed when applying Deep Learning techniques to SCA.
The ASCAD database is in fact extracted from the ATMega8515_raw_traces.h5
file containing raw traces: in order to avoid useless heavy
data processing, only the 700 samples of interest are kept
(these samples correspond to the time window of the leaking
operation under attack, see the article for details).
The ASCAD_generate.py script is used to generate ASCAD from the ATMega8515 original database. Actually, the repository contains three HDF5 ASCAD databases:
ASCAD_data/ASCAD_databases/ASCAD.h5
: this corresponds to the original traces extracted without modification.ASCAD_data/ASCAD_databases/ASCAD_desync50.h5
: this contains traces desynchronized with a 50 samples maximum window.ASCAD_data/ASCAD_databases/ASCAD_desync100.h5
: this contains traces desynchronized with a 100 samples maximum window.
The traces desynchronization is simulated artificially (and can be tuned) by the python script ASCAD_generate.py that generates the database: this allows us to test the efficiency of Neural Networks against jitter. See the section dedicated to the generation script for details on how to customize the desynchronization parameter.
The best trained CNN and MLP models that we have obtained are provided in the
ASCAD_data/ASCAD_trained_models/
folder. Six models have
been selected: best CNNs for desynchronizations 0, 50 and 100,
best MLPs for desynchronization values of 0, 50, and 100
time samples.
WARNING: these models are the best ones we have obtained through the methodology described in the article. We certainly do not pretend that they are the optimal models across all the possible ones. The main purpose of sharing ASCAD is precisely to explore and evaluate new models.
We have performed our training sessions on two setups:
- The first platform is composed of one gamer market Nvidia GeForce GTX 1080 Ti.
- The second platform is composed of one professional computing market Nvidia Tesla K80.
Both setups were running an Ubuntu 16.04 distro with Keras 2.1.1 and TensorFlow-GPU 1.2.1. When using the GPU acceleration, the computation should not be very CPU and RAM intensive (at most one CPU core work load and 1 to 2 GB of RAM).
See below for how to test these trained models.
In order to use ASCAD companion scripts, here is the list of dependencies that need to be installed in your python setup:
- The
h5py
HDF5 library (http://www.h5py.org/). - The
keras
Deep Learning library (https://keras.io/, tests have been performed on version 2.2.1). - The
numpy
scientific computing library (http://www.numpy.org/). - The
matplotlib
plotting library (https://matplotlib.org/).
Note that these libraries are generally packaged in most of Linux distributions,
and/or are available through the pip
Python package manager. The case of the keras
library is a bit special since many backends can be used (TensorFlow, Theano, ...)
and depending on the target platform, CPU or GPU acceleration may be configured
and used or not. For ASCAD scripts, we strongly suggest (specifically for the
profiling/training phase) to use a GPU backed configuration. Configuring
keras
backends and GPU acceleration won't be detailed here: please refer
to this Keras
and this TensorFlow
resources for more details on the topic (you will also certainly need to
handle Nvidia CUDA drivers
and libraries for you platform).
Finally, the scripts should work with Python 2 as well as Python 3.
The ASCAD_generate.py script is used to generate ASCAD
databases from the ATMega8515 raw traces database. This script is versatile
and the extract_traces
function accepts some parameters:
def extract_traces(traces_file, labeled_traces_file, profiling_index = [n for n in range(0, 50000)], attack_index = [n for n in range(50000, 60000)], target_points=[n for n in range(45400, 46100)], profiling_desync=0, attack_desync=0)
traces_file
: this is the file name of the HDF5 raw traces with metadata database,ATMega8515_raw_traces.h5
in our case.labeled_traces_file
: this is the name of the HDF5 output file.profiling_index
: this is a list corresponding to the index of profiling traces (default is [0 .. 49999])attack_index
: this is a list corresponding to the index of attack traces (default is [50000 .. 59999]).target_points
: this is the window of interest in the traces, default is [45400 .. 46099] since this is the leaking spot for the value of interest in the case described in the article.profiling_desync
: this is the maximum desychronization applied to the profiling original traces, following uniformly randomly chosen values below this maximum for each trace.attack_desync
: this is the maximum desychronization applied to the attack original traces, following uniformly randomly chosen values below this maximum for each trace.
The labelize
function is also of interest in the script: tuning it enables to
generate databases that focus on other leaking spots of the masked AES (say
byte 5 of the first round, byte 10 of the second round, and so on ...).
By tuning all these parameters, one is able to generate multiple ASCAD databases specialized in various values of interest, with customized desynchronization as well as customized profiling and attacking traces.
The trained models in ASCAD_data/ASCAD_trained_models/
can be tested using
the ASCAD_test_models.py script.
The script computes the ranking of the real key byte among the
256 possible candidate bytes depending on the number of attack traces
the trained model takes as input for prediction: this is a
classical classification algorithm efficiency check in SCA
(see the article for a more formal definition of the rank).
The evolution of the rank with respect to the number of traces is
plotted using matplotlib
.
Without any argument, the script will compute the rank on all the trained models (CNN_Best for desynchronizations 0, 50, 100 and MLP_Best for desynchronizations 0, 50, 100) for 2000 traces. One can also modify this number of traces with one argument:
$ python ASCAD_test_models.py or: $ python ASCAD_test_models.py 5000
Optionally, the script takes two or three arguments as inputs:
- An already trained model HDF5 file (for instance those in the
ASCAD_data/ASCAD_trained_models/
folder). - An ASCAD database one wants to check the trained model on.
- A third optional parameter is the maximum number of attack traces to process.
$ python ASCAD_test_models.py ASCAD_data/ASCAD_train_models/cnn_best_ascad_desync0.h5 ASCAD_data/ASCAD_databases/ASCAD.h5 or: $ python ASCAD_test_models.py ASCAD_data/ASCAD_trained_models/cnn_best_ascad_desync0.h5 ASCAD_data/ASCAD_databases/ASCAD.h5 5000
The script uses the attack traces set in a hardcoded way (since we want to check the model efficiency on traces that have not been used for training). However, it is pretty straightforward to tune it to compute the ranking on the ASCAD profiling traces to further confirm the cross-validation results obtained in the article.
The six trained CNNs and MLPs that we provide are all derived from one CNN architecture and one MLP architecture with architectural hyper-parameters discussed in the article (the main difference is the training that is performed on the three desynchronized {0, 50, 100} ASCAD databases).
We provide the ASCAD_train_models.py
script in order to train the models with
the training hyper-parameters that we explore and analyze in
the article: trained models should yield in similar
performances compared to what we provide in
ASCAD_data/ASCAD_trained_models/
.
The training is performed on the 50,000 profiling traces, but one can easily tune the script to modify this for including other sets of traces if necessary.