-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DDP 8xMI300X] GPT2-1.5B FP8 is 25% slower than BF16 & OOMs on the same batch size #76
Comments
@OrenLeung This issue was due to that our dev branch does not have all the recent optimizations on DDP and FSDP from NVTE yet. We have a PR in review that would be merged soon that could resolve this issue (#66). 8xMI300X DDP FP8 TE (batch size 28): 315TFLOPs |
Hi @wenchenvincent , Thanks for looking into this. This results look much better. Can you provide the To be competitive with H100 on a perf per TCO basis, MI300x needs to hit 398 TFLOP/s/GPU. Any other PRs or optimizations you have in the pipeline? cc: @hliuca Here is my Nvidia preliminary results for gpt2-1.5B fp8 full training:
Full Response in the llama3 70B proxy gh issue #78 (comment) |
after #66 merged to main, i now get 322TFLOP/s/GPU on this model in our internal codebase After 32 Warmup: Mean TFLOP/s: 322.79 Mean MFU: 12.37% similar to @wenchenvincent 's TFLOP/s |
Problem Description
Even with
NVTE_USE_HIPBLASLT=1
& Installing TE while inside the container instead of throughDockerfile
as suggested by #74 (comment), FP8 is 25% slower than BF16. Furthermore, it even OOMs on the same batch size as what bf16 can fit. On Nvidia H100 Transformer Engine, usually I can even fit more batches than bf16 & it never OOMs on the same batch size.The command to run this is
python ./train_gpt_ddp_reprod.py
using the reprod script & TE Install Instructions Below.cc: @hliuca
preliminary Results TFLOP/s/GPU:
on this model with DDP, H100 saw an 16% increase in TFLOP/s/GPU from using FP8.
Operating System
Ubuntu
CPU
AMD CPU
GPU
MI300X
ROCm Version
ROCm 6.2.0
ROCm Component
No response
Steps to Reproduce
Docker Image
FROM rocm/pytorch:rocm6.2_ubuntu22.04_py3.10_pytorch_release_2.3.0 RUN apt install nano RUN pip install uv RUN uv pip install --system ipython pytest fire pydantic pybind11 RUN pip3 uninstall -y torch RUN pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/rocm6.2 WORKDIR /workspace/llm-train-bench/ CMD ["/usr/bin/bash"]
TE install Instructions (done inside docker container)
Reprod Script
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
The text was updated successfully, but these errors were encountered: