From 25599294c942cfed2c6f8329e14791e4a2f91539 Mon Sep 17 00:00:00 2001 From: Daniel King <43149077+dakinggg@users.noreply.github.com> Date: Mon, 5 Feb 2024 10:48:29 -0800 Subject: [PATCH] Update lora docs (#941) --- TUTORIAL.md | 24 +++++++++++++++--------- 1 file changed, 15 insertions(+), 9 deletions(-) diff --git a/TUTORIAL.md b/TUTORIAL.md index cb66007b96..dccbe933ef 100644 --- a/TUTORIAL.md +++ b/TUTORIAL.md @@ -357,19 +357,25 @@ Currently we support [Learned Positional Embeddings](https://arxiv.org/pdf/1706. | RoPE (Hugging Face Implementation) |
model:
attn_config:
rope: True
rope_impl: hf
| 62.3 | | ### Can I finetune using PEFT / LoRA? -- The LLM Foundry codebase does not directly have examples of PEFT or LORA workflows. However, our MPT model is a subclass of HuggingFace `PretrainedModel`, and https://github.com/mosaicml/llm-foundry/pull/346 added required features to enable HuggingFace’s [PEFT](https://huggingface.co/docs/peft/index) / [LORA](https://huggingface.co/docs/peft/conceptual_guides/lora) workflows for MPT. MPT models with LoRA modules can be trained either using LLM Foundry or Hugging Face's [accelerate](https://huggingface.co/docs/accelerate/index). Within LLM Foundry, run (`scripts/train/train.py`), adding `lora` arguments to the config `.yaml`, like so: +- LLM Foundry does support LoRA via an integration with the [PEFT](https://github.com/huggingface/peft) library. Within LLM Foundry, run (`scripts/train/train.py`), adding `peft_config` arguments to the `model` section of the config `.yaml`, like so: ```yaml -lora: - args: - r: 16 - lora_alpha: 32 - lora_dropout: 0.05 - target_modules: ['Wqkv'] +model: + ... + peft_config: + r: 16 + peft_type: LORA + task_type: CAUSAL_LM + lora_alpha: 32 + lora_dropout: 0.05 + target_modules: + - q_proj + - k_proj + target_modules: + - 'Wqkv' ``` -- In the current release, these features have Beta support. - For efficiency, The MPT model concatenates the `Q`, `K`, and `V` matrices in each attention block into a single `Wqkv` matrix that is three times wider. Currently, LoRA supports a low-rank approximation to this `Wqkv` matrix. -- When evaluating with PEFT / LoRA seperated weight, just set `pretrained_lora_id_or_path` in `model`(Find an example [here](scripts/eval/yamls/hf_lora_eval.yml#L19)). +- When evaluating with PEFT / LoRA separated weight, just set `pretrained_lora_id_or_path` in `model`(Find an example [here](scripts/eval/yamls/hf_lora_eval.yml#L19)). ### Can I quantize these models and/or run on CPU? - The LLM Foundry codebase does not directly have examples of quantization or limited-resource inference. But you can check out [GGML](https://github.com/ggerganov/ggml) (same library that powers llama.cpp) which has built support for efficiently running MPT models on CPU! You _can_ load your model in 8-bit precision for inference using the [bitsandbytes library](https://github.com/TimDettmers/bitsandbytes) and Hugging Face's [accelerate](https://huggingface.co/docs/accelerate/index) via `load model = AutoModelForCausalLM.from_pretrained(model_name, load_in_8bit=True, device_map="auto", trust_remote_code=True)`, although we have not extensively benchmarked the performance (see the Hugging Face [quantization documentation](https://huggingface.co/docs/transformers/main/main_classes/quantization) for more detail).