Enhancements for Efficient Utilization and Optimization in Fine-tuning Llama 2 70B Example #7

adamlin120 · 2023-10-04T09:38:55Z

Firstly, thank you for the well-detailed article! I am writing to provide some feedback and seek clarification.

Optimizer Selection:
- The blog post demonstrates the use of a particular optimizer, "paged_adamw_32bit". However, upon altering this to "adamw_torch", I encountered an Out Of Memory (OOM) issue. Could you elucidate on the critical role the default optimizer plays in the successful execution of the example provided? Any insight into why the memory issue arises with "adamw_torch" would be highly valuable.
GPU Utilization:
- In attempting to replicate the described setup on a 2 nodes x 8 H100s machine, I observed a relatively low GPU utilization rate of around 20% with the GPUs drawing only about ~200 Watts. Is there any recommendation on how to elevate the GPU utilization rate, to potentially expedite the training process and maximize the computational resources at hand?

Your guidance will be immensely beneficial!

Thank you for your time.

Provide feedback