In this project created for EPFL's CS-433: Machine Learning, we explore the use of a transformer model for page-prefetching.
You can find the requirements in the requirements.txt
file under models/transformer/
. To install them, run the following command:
pip install -r requirements.txt
The project is structured as follows:
data/
├── prepro/
├── processed/
└── raw/
dataset_gathering
models/
├── config.py
├── train.py
├── infer.py
├── model.py
├── data_parser.py
├── dataset.py
├── make_tokens.py
├── runs/
└── trainings/
- In
data
you can find our different raw data. The raw data directly collected can be found inraw
, the preprocessed data inprepro
and the processed, fully-cleaned data which we use in training inprocessed
. dataset_gathering
contains the code used to collect the raw data.models
contains the code used to train, evaluate and the actual code of the transformer model.config.py
which can be used to tweak its parameters, e.g. the number of layers, the number of heads, the number of epochs, etc.train.py
contains the code to train the modelinfer.py
contains the code to use the model for inferencemodel.py
contains the code of the model itselfdata_parser.py
contains the code to parse the datadataset.py
contains the code to create the dataset, parsing the raw data in a structure usable by our model and tokenizing itmake_tokens.py
contains the tokenizer code, both for the input and the outputruns
contains the tensorboard logs, which you can use to visualize the trainingtrainings
contains the saved models, which you can use for inference and for further training
In `config.py' you can tweak many of the model's parameters, such as the number of layers, the number of heads, the number of epochs, etc, but also parameters for the tokenizers. Here, we explain how these parameters will affect the model and what values they can take
DATA_PATH
: the folder where the data file isOBJDUMP_PATH
: folder whereobjdump
output for libraries used by traced programs isGENERATOR_PREFIX
: prefix of the folder where the generator will be savedSEED_FN
: the seed to use for the generatorSTATE_FN
: the name of the state to use for the generator state saving / loadingTRACE_TYPE
: the type of trace to use, eitherfltrace
orbpftrace
BPF_FEATURES
is a list of features used to train / infer with the model, collected with the BPF method. It contains:
prev_faults
which is an hex-address list, containing the addresses of the previous page faultsflags
which is a bitmap, containing the flags of the page fault, includesrW
.ip
which is the instruction pointer of the CPUregs
which is a list of the values of the registers of the CPU
FL_FEATURES
is a list of features used to train / infer with the model, collected with the fltrace
. It contains:
prev_faults
which is an hex-address list, containing the addresses of the previous page faultsrW
: whether the page was read from or written toips
: stack trace of the program
OUTPUT_FEATURES
contains the output features of the model, which is by default only one element: the hex-addresses of the next pages to prefetch.
Transformer parameters are set in the the TransformerModelParams
class and Meta Transformer parameters are set in MetaTransformerParams
, both in config.py
. They are:
d_model
: the dimension of the modelT
: the number of transformer block layersH
: the numbe of attention heads per transformer layerdropout
: the dropout rated_ff
: the dimension of the feed-forward layer
In the get_config
function in config.py
, we create the configuration, you can modify most parameters there.
bpe_special_tokens
: the special tokens used by the tokenizers. Default:[UNK]
pad_token
: the padding token used by the tokenizers. Default:[[PAD]]
list_elem_separation_token
: the token used to separate elements in a list. Default:TokenizerWrapper
ofspecial_tokenizers.py
feature_separation_token
: the token used to separate features in a list. Default:[FSP]
start_stop_generating_tokens
: the tokens used to indicate the start and the end of a sentence. Default:[GTR]
and[GTP]
batch_size
: the batch size to use for training. Default: 16num_epochs
: the number of epochs to train for. Default: 10lr
: the learning rate to use for training. Default 10^(-4)trace_type
: the type of trace to use, seeTRACE_TYPE
above. Default:fltrace
, choose withbpftrace
train_on_trace
: forfltrace
only, whether we train on one trace or multipledatasource
: name of the benchmark on which we trainsubsample
: the subsampling rate to use for the data. Default: 0.2objdump_path
: seeOBJDUMP_PATH
abovemodel_folder
: the folder where the model will be saved. Default:models
preload
: which version of the model to preload, defaultlatest
(takes the highest epoch number)tokenizer_files
: format string path to the the tokenizer files. Default:trained_tokenizers/[TRACETYPE]/tokenizer_[src,tgt].json
train_test_split
: the train / test split to use. Default: 0.75attention_model
: the type of model to use for attention. Default:transformer
, choose withretnet
attention_model_params
: the parameters of the attention model. Default:TransformerModelParams
, not needed for RetNetdecode_algorithm
: the algorithm to use for decoding. Default:beam
(beam search), choose withgreedy
(greedy decode)beam_size
: ifdecode_algorithm
isbeam
, the beam size to use. Default: 4past_window
: the size of the past window of previous faults to use. Default: 10k_predictions
: the number of predictions to make. Default: 10code_window
: tuple of number of instructions before and after the instruction pointer, i.e. code window around IP. Default (1,2)input_features
: the features to use as input. Default:BPF_FEATURES
, choose withFL_FEATURES
output_features
: the features to use as output. Default:OUTPUT_FEATURES
base_tokenizer
: the base tokenizer to use. Default:hextet
, choose withbpe
,text
. See tokenizers section.embedding_technique
: the embedding technique used on the tokens. See embeddings. Default:tok_concat
, choose withonetext
,meta_transformer
andembed_concat
meta_transformer_params
: the parameters of the meta transformer. Default:MetaTransformerParams
page_masked
: forbpftrace
only, map all address accesses to their page numbers.max_weight_save_history
: used whenmass_train == True
in training. Defines how many epochs we should save at most. Default: 3
In models/trained_tokenizers/special_tokenizers.py
, we define generic classes of tokenizers, which are then trained on a specific vocabulary.
We have three generic classes:
SimpleCustomVocabTokenizer
TokenizerWrapper
ConcatTokenizer
Details can be found in the docstrings of the classes.
To train the model on our dataset, simply run the train.py
script, i.e.:
python train.py
Important! this assumes you have already copied the dataset, in the right folders, as specified above.
You can tweak the parameters of the model in the config.py
file.
To use the model for inference, simply run the infer.py
script, i.e.:
python infer.py
You can define your input string in the infer.py
file (data
parameter) and the maximum length of the output (max_length
parameter).
Victor Garvalov @vigarov, Alex Müller @ktotam1, Thibault Czarniak @t1b00.
Thank you to Professors Martin Jaggi, Nicolas Flammarion, Sasha, and our wonderful TAs.