address comments

huggingface · Oct 12, 2023 · 8afc304 · 8afc304
1 parent 6618083
commit 8afc304
Showing 1 changed file with 1 addition and 1 deletion.
diff --git a/docs/source/usage_guides/fsdp.md b/docs/source/usage_guides/fsdp.md
@@ -96,7 +96,7 @@ all-gather while executing in the forward pass. only use with Static graphs.
 Useful in cases such as parameter-efficient fine-tuning. 
 Please refer this [blog](https://dev-discuss.pytorch.org/t/rethinking-pytorch-fully-sharded-data-parallel-fsdp-from-first-principles/1019)
 
-`CPU RAM Efficient Model loading`: If True, only the first process loads the pretrained model checkoint while all other processes have empty weights. Only applicable for 🤗 Transformers models. When using this, `Sync Module States` needs to True else all the processes expect the main process would have random empty weights leading to unexpected behaviour during training.
+`CPU RAM Efficient Model loading`: If True, only the first process loads the pretrained model checkoint while all other processes have empty weights. Only applicable for 🤗 Transformers models. This should be set to False if experience errors when loading the pretrained model via `from_pretrained`. When using this, `Sync Module States` needs to be True else all the processes expect the main process would have random empty weights leading to unexpected behaviour during training.
 
 `Sync Module States`: If True, each individually wrapped FSDP unit will broadcast module parameters from rank 0
 ```