Skip to content

Commit

Permalink
Merge branch 'main' into soft_cap_attn
Browse files Browse the repository at this point in the history
  • Loading branch information
ShashankMosaicML authored Sep 20, 2024
2 parents bf5e94e + 2e3d14f commit 9a68b8f
Show file tree
Hide file tree
Showing 34 changed files with 664 additions and 178 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/code-quality.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ jobs:
uses: actions/checkout@v3
with:
repository: mosaicml/ci-testing
ref: v0.2.0
ref: v0.2.2
path: ./ci-testing
- uses: ./ci-testing/.github/actions/code-quality
with:
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/coverage.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ jobs:
uses: actions/checkout@v3
with:
repository: mosaicml/ci-testing
ref: v0.2.0
ref: v0.2.2
path: ./ci-testing
- uses: ./ci-testing/.github/actions/coverage
with:
Expand Down
11 changes: 7 additions & 4 deletions .github/workflows/docker.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,12 +17,14 @@ jobs:
strategy:
matrix:
include:
- name: "2.3.1_cu121"
base_image: mosaicml/pytorch:2.3.1_cu121-python3.11-ubuntu20.04
- name: "2.4.0_cu124"
base_image: mosaicml/pytorch:2.4.0_cu124-python3.11-ubuntu20.04
dep_groups: "[all]"
- name: "2.3.1_cu121_aws"
base_image: mosaicml/pytorch:2.3.1_cu121-python3.11-ubuntu20.04-aws
te_commit: 901e5d2
- name: "2.4.0_cu124_aws"
base_image: mosaicml/pytorch:2.4.0_cu124-python3.11-ubuntu20.04-aws
dep_groups: "[all]"
te_commit: 901e5d2
steps:

- name: Checkout
Expand Down Expand Up @@ -89,3 +91,4 @@ jobs:
BRANCH_NAME=${{ github.head_ref || github.ref_name }}
BASE_IMAGE=${{ matrix.base_image }}
DEP_GROUPS=${{ matrix.dep_groups }}
TE_COMMIT=${{ matrix.te_commit }}
9 changes: 4 additions & 5 deletions .github/workflows/pr-cpu.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,19 +17,18 @@ jobs:
pytest-cpu:
name: ${{ matrix.name }}
runs-on: ubuntu-latest
container: ${{ matrix.container }}
strategy:
matrix:
include:
- name: "cpu-2.3.1"
- name: "cpu-2.4.0"
pip_deps: "[all-cpu]"
container: mosaicml/pytorch:2.3.1_cpu-python3.11-ubuntu20.04
container: mosaicml/pytorch:2.4.0_cpu-python3.11-ubuntu20.04
markers: "not gpu"
pytest_command: "coverage run -m pytest"
steps:
- name: Checkout code
uses: actions/checkout@v2
- name: Run PR CPU Tests
uses: mosaicml/ci-testing/.github/actions/[email protected].0
uses: mosaicml/ci-testing/.github/actions/[email protected].2
with:
name: ${{ matrix.name }}
container: ${{ matrix.container }}
Expand Down
24 changes: 12 additions & 12 deletions .github/workflows/pr-gpu.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,15 +22,15 @@ jobs:
fail-fast: false
matrix:
include:
- name: "gpu-2.3.1-1"
container: mosaicml/llm-foundry:2.3.1_cu121-latest
- name: "gpu-2.4.0-1"
container: mosaicml/llm-foundry:2.4.0_cu124-latest
markers: "gpu"
pip_deps: "[all]"
pytest_command: "coverage run -m pytest"
ci_repo_gpu_test_ref: v0.2.0
ci_repo_gpu_test_ref: v0.2.2
steps:
- name: Run PR GPU Tests
uses: mosaicml/ci-testing/.github/actions/[email protected].0
uses: mosaicml/ci-testing/.github/actions/[email protected].2
with:
container: ${{ matrix.container }}
git_repo: mosaicml/llm-foundry
Expand All @@ -51,15 +51,15 @@ jobs:
fail-fast: false
matrix:
include:
- name: "gpu-2.3.1-2"
container: mosaicml/llm-foundry:2.3.1_cu121-latest
- name: "gpu-2.4.0-2"
container: mosaicml/llm-foundry:2.4.0_cu124-latest
markers: "gpu"
pip_deps: "[all]"
pytest_command: "coverage run -m pytest"
ci_repo_gpu_test_ref: v0.2.0
ci_repo_gpu_test_ref: v0.2.2
steps:
- name: Run PR GPU Tests
uses: mosaicml/ci-testing/.github/actions/[email protected].0
uses: mosaicml/ci-testing/.github/actions/[email protected].2
with:
container: ${{ matrix.container }}
git_repo: mosaicml/llm-foundry
Expand All @@ -80,15 +80,15 @@ jobs:
fail-fast: false
matrix:
include:
- name: "gpu-2.3.1-4"
container: mosaicml/llm-foundry:2.3.1_cu121-latest
- name: "gpu-2.4.0-4"
container: mosaicml/llm-foundry:2.4.0_cu124-latest
markers: "gpu"
pip_deps: "[all]"
pytest_command: "coverage run -m pytest"
ci_repo_gpu_test_ref: v0.2.0
ci_repo_gpu_test_ref: v0.2.2
steps:
- name: Run PR GPU Tests
uses: mosaicml/ci-testing/.github/actions/[email protected].0
uses: mosaicml/ci-testing/.github/actions/[email protected].2
with:
container: ${{ matrix.container }}
git_repo: mosaicml/llm-foundry
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/smoketest.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ jobs:
uses: actions/checkout@v3
with:
repository: mosaicml/ci-testing
ref: v0.2.0
ref: v0.2.2
path: ./ci-testing
- uses: ./ci-testing/.github/actions/smoketest
with:
Expand Down
3 changes: 2 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ FROM $BASE_IMAGE

ARG BRANCH_NAME
ARG DEP_GROUPS
ARG TE_COMMIT

ENV TORCH_CUDA_ARCH_LIST="8.0 8.6 8.7 8.9 9.0"

Expand All @@ -15,7 +16,7 @@ ADD https://raw.githubusercontent.com/mosaicml/llm-foundry/$BRANCH_NAME/setup.py
RUN rm setup.py

# Install TransformerEngine
RUN NVTE_FRAMEWORK=pytorch CMAKE_BUILD_PARALLEL_LEVEL=4 MAX_JOBS=4 pip install git+https://github.com/NVIDIA/TransformerEngine.git@b5a7c9f
RUN NVTE_FRAMEWORK=pytorch CMAKE_BUILD_PARALLEL_LEVEL=4 MAX_JOBS=4 pip install git+https://github.com/NVIDIA/TransformerEngine.git@$TE_COMMIT

# Install and uninstall foundry to cache foundry requirements
RUN git clone -b $BRANCH_NAME https://github.com/mosaicml/llm-foundry.git
Expand Down
14 changes: 7 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,30 +107,30 @@ Something missing? Contribute with a PR!


# Hardware and Software Requirements
This codebase has been tested with PyTorch 2.2 with NVIDIA A100s and H100s.
This codebase has been tested with PyTorch 2.4 with NVIDIA A100s and H100s.
This codebase may also work on systems with other devices, such as consumer NVIDIA cards and AMD cards, but we are not actively testing these systems.
If you have success/failure using LLM Foundry on other systems, please let us know in a Github issue and we will update the support matrix!

| Device | Torch Version | Cuda Version | Status |
| -------------- | ------------- | ------------ | ---------------------------- |
| A100-40GB/80GB | 2.3.1 | 12.1 | :white_check_mark: Supported |
| H100-80GB | 2.3.1 | 12.1 | :white_check_mark: Supported |
| A100-40GB/80GB | 2.4.0 | 12.4 | :white_check_mark: Supported |
| H100-80GB | 2.4.0 | 12.4 | :white_check_mark: Supported |

## MosaicML Docker Images
We highly recommend using our prebuilt Docker images. You can find them here: https://hub.docker.com/orgs/mosaicml/repositories.

The `mosaicml/pytorch` images are pinned to specific PyTorch and CUDA versions, and are stable and rarely updated.

The `mosaicml/llm-foundry` images are built with new tags upon every commit to the `main` branch.
You can select a specific commit hash such as `mosaicml/llm-foundry:2.3.1_cu121-36ab1ba` or take the latest one using `mosaicml/llm-foundry:2.3.1_cu121-latest`.
You can select a specific commit hash such as `mosaicml/llm-foundry:2.4.0_cu124-36ab1ba` or take the latest one using `mosaicml/llm-foundry:2.4.0_cu124-latest`.

**Please Note:** The `mosaicml/llm-foundry` images do not come with the `llm-foundry` package preinstalled, just the dependencies. You will still need to `pip install llm-foundry` either from PyPi or from source.

| Docker Image | Torch Version | Cuda Version | LLM Foundry dependencies installed? |
| ------------------------------------------------------ | ------------- | ----------------- | ----------------------------------- |
| `mosaicml/pytorch:2.3.1_cu121-python3.11-ubuntu20.04` | 2.3.1 | 12.1 (Infiniband) | No |
| `mosaicml/llm-foundry:2.3.1_cu121-latest` | 2.3.1 | 12.1 (Infiniband) | Yes |
| `mosaicml/llm-foundry:2.3.1_cu121_aws-latest` | 2.3.1 | 12.1 (EFA) | Yes |
| `mosaicml/pytorch:2.4.0_cu124-python3.11-ubuntu20.04` | 2.4.0 | 12.4 (Infiniband) | No |
| `mosaicml/llm-foundry:2.4.0_cu124-latest` | 2.4.0 | 12.4 (Infiniband) | Yes |
| `mosaicml/llm-foundry:2.4.0_cu124_aws-latest` | 2.4.0 | 12.4 (EFA) | Yes |


# Installation
Expand Down
Loading

0 comments on commit 9a68b8f

Please sign in to comment.