This example load a DistilBERT model and confirm its accuracy and speed based on GLUE data.
pip install neural-compressor
pip install -r requirements.txt
Note: Validated ONNX Runtime Version.
download the GLUE data with prepare_data.sh
script.
export GLUE_DIR=path/to/glue_data
export TASK_NAME=MRPC
bash prepare_data.sh --data_dir=$GLUE_DIR --task_name=$TASK_NAME
Please refer to Bert-GLUE_OnnxRuntime_quantization guide for detailed model export. The following is a simple example.
Use Huggingface Transformers to fine-tune the model based on the MRPC example with command like:
python prepare_model.py --input_model='distilbert-base-uncased' --output_model=bert.onnx
Static quantization with QDQ format:
bash run_quant.sh --input_model=path/to/model \ # model path as *.onnx
--output_model=path/to/model_tune \ # model path as *.onnx
--dataset_location=path/to/glue_data \
--quant_format="QDQ"
bash run_benchmark.sh --input_model=path/to/model \ # model path as *.onnx
--dataset_location=path/to/glue_data \
--batch_size=batch_size \
--mode=performance # or accuracy