The project is a python module that facilitates BERT pretraining. The current existing open source solution for training this specific model is convoluted. We have simplified the procedure. The project's goal is to open the code to the wider Machine Learning community to help ML practitioners train their own BERT models using their data. The code was created to train the latest iteration of VMware's BERT model (vBERT) to help Machine Learning and Natural Language Processing Researchers within VMware.
The Demo notebook is located within demo folder
Setup a Python 3.7 or 3.8 virtual env
and install the requirements using
pip install .
from within the root folder
or
pip install git+https://github.com/vmware-labs/bert-pretraining
Create the pretraining data using create_pretraining_data.py from https://github.com/google-research/bert .
You can create a seperate eval file if you want to evaluate your model's MLM and NSP accuracies on a seperate eval set during training.
You can also split a single file into training and eval vectors by using the split_ratio parameter in the config object.
The pretraining parameters are handled through the Pretraining_Config class. Please follow the Demo.ipynb to run the a sample bert pretraining.
PRETRAINING_CONFIG PARAMS
Parameter | Default Value | Description |
---|---|---|
model_name | DEMOBERT | Model name |
is_base | True | Boolean to select between BERT-Base and BERT-Large |
max_seq_length | 128 | MSL, should be consistent with the tfrecord file (generate 2 seperate files if you want to pretrain BERT with different MSLs eg: 128, 512) |
max_predictions_per_seq | 20 | Number of tokens masked for MLM, should be consistent with the tfrecord file |
num_train_steps | 1000 | Number of steps to train the model for, terminates if we reach the end of tfrecord file (meaningful pretraining would require more training steps) |
num_warmup_steps | 10 | Number of warmup steps, BERT uses 1% of training steps as warmup steps |
learning_rate | 1e-05 | Model Learning rate |
train_batch_size | 32 | Training batch size (split across GPUs) |
save_intermediate_checkpoints | True | Save checkpoints for every 'x 'training steps decided by the save_checkpoint_steps. Checkpoint will always be saved at the end of training |
save_intermediate_checkpoint_steps | 25000 | Saves checkpoint after every 'x' training steps (not including warmup steps) |
eval_batch_size | 32 | Evaluation batch size (split across GPUs) |
max_eval_steps | 1000 | Number of steps to perform evaluation on when there is no seperate eval file. If a seperate eval file is provided or if split_ratio is provided, the entire eval dataset is used for evaluation |
eval_point | 1000 | Performs evaluation for every 'x' training steps |
split_ratio | None | Percent of training dataset to use for evaluation if you want to split training tfrecord into train, eval datasets. If no split ratio is provided, the training file will be used for evaulation (number of eval steps is controlled by the max_eval_steps parameter) |
init_checkpoint | None | If you are resuming training provide the path to previous checkpoint. If you are initializing the training from a non default checkpoint(BERT-Base, BERT-Large), provide the model checkpoint name/path). |
input_file | ./input/demo_MSL128.tfrecord | Input tfrecord file created using create_pretraining_data.py from https://github.com/google-research/bert |
eval_file | None | If you want to use seperate eval dataset, provide the input tfrecord file created using create_pretraining_data.py from https://github.com/google-research/bert |
log_csv | ./eval_results.csv | File which stores the evaluation results ** |
output_dir | ./ckpts | Directory to store the checkpoints |
num_gpu | 3 | Number of GPUs to use for training |
** The output log_csv file records the hyperparameters and evaluation results
The demo.tfrecord file was created from the wikicorpus dataset.
The bert-pretraining project team welcomes contributions from the community. Before you start working with this project please read and sign our Contributor License Agreement (https://cla.vmware.com/cla/1/preview). If you wish to contribute code and you have not signed our Contributor Licence Agreement (CLA), our bot will prompt you to do so when you open a Pull Request. For any questions about the CLA process, please refer to our CONTRIBUTING.md.
Apache-2.0