feat: move model loader functionality to augmentation #119

willmj · 2025-01-07T19:56:52Z

Description
Step 1 of 3 for enabling LoRA on ScatterMoE: move model loader functionality to augmentation. This makes it so the plugin doesn't have to be standalone as well.

Testing
Testing on fms-hf-tuning with augmentation function instead of model loader shows similar results to #390:

      {
          "model_name_or_path": "/ibm_dmf_lakehouse/models/base_training/shared/granite-3.0-3b-a800m-base/r240924a",
          "training_data_path": "/testing/tuning/input/cc_tone_sft_format_1000_train.json",
          "output_dir": "/testing/tuning/output/granite-3b-moe/ft/20240107_1014-tone-FAST",
          "save_model_dir": "/testing/tuning/output/granite-3b-moe/ft/20240107_1014-tone-FAST/save_model",
          "num_train_epochs": 10.0,
          "per_device_train_batch_size": 2,
          "gradient_accumulation_steps": 1,
          "learning_rate": 1e-5,
          "response_template": "\n### Response:",
          "dataset_text_field": "output",
          "fast_moe": 1
      }

Results:

{'loss': 0.834, 'grad_norm': 326.0, 'learning_rate': 9e-06, 'epoch': 1.0}
{'loss': 0.4279, 'grad_norm': 0.076171875, 'learning_rate': 8.000000000000001e-06, 'epoch': 2.0}
{'loss': 0.1377, 'grad_norm': 3.78125, 'learning_rate': 7e-06, 'epoch': 3.0}
{'loss': 0.0384, 'grad_norm': 0.81640625, 'learning_rate': 6e-06, 'epoch': 4.0}
{'loss': 0.0031, 'grad_norm': 0.003997802734375, 'learning_rate': 5e-06, 'epoch': 5.0}
{'loss': 0.0006, 'grad_norm': 0.002044677734375, 'learning_rate': 4.000000000000001e-06, 'epoch': 6.0}
{'loss': 0.0002, 'grad_norm': 0.0032196044921875, 'learning_rate': 3e-06, 'epoch': 7.0}
{'loss': 0.0001, 'grad_norm': 0.002288818359375, 'learning_rate': 2.0000000000000003e-06, 'epoch': 8.0}
{'loss': 0.0001, 'grad_norm': 0.0087890625, 'learning_rate': 1.0000000000000002e-06, 'epoch': 9.0}
{'loss': 0.0001, 'grad_norm': 0.0115966796875, 'learning_rate': 0.0, 'epoch': 10.0}
{'train_runtime': 2125.7018, 'train_samples_per_second': 4.704, 'train_steps_per_second': 2.352, 'train_loss': 0.14420232288464904, 'epoch': 10.0}

model location: /testing/tuning/output/granite-3b-moe/ft/20240107_1014-tone/save_model

Signed-off-by: Will Johnson <[email protected]>

fabianlim · 2025-01-07T23:48:41Z

plugins/accelerated-moe/src/fms_acceleration_moe/framework_plugin_scattermoe.py

        rank, world_size = 0, 1
        if torch.distributed.is_initialized():
            world_size = torch.distributed.get_world_size()
            rank = torch.distributed.get_rank()

-        # shard the MOE, and store the component names, eventually needed
-        # to configure the FSDP
+        model_name = model.config.name_or_path


I would say add a check for the prescence of name_or_path in model.config, and if not there, raise a ValueError explaining that for scattermoe, we require a name_or_path to point to the model in the config

Signed-off-by: Will Johnson <[email protected]>

willmj added 2 commits January 3, 2025 11:07

feat: add augmentation, comment out model loader (first draft)

c5d6200

Signed-off-by: Will Johnson <[email protected]>

feat: move model loader functionality to augmentation

b154026

Signed-off-by: Will Johnson <[email protected]>

willmj requested a review from fabianlim as a code owner January 7, 2025 19:56

willmj added 2 commits January 7, 2025 15:04

lint: remove unused import

c1e866a

Signed-off-by: Will Johnson <[email protected]>

fmt

17d0d6c

Signed-off-by: Will Johnson <[email protected]>

fabianlim requested changes Jan 7, 2025

View reviewed changes

fix: raise error if no model name

2f7df93

Signed-off-by: Will Johnson <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: move model loader functionality to augmentation #119

feat: move model loader functionality to augmentation #119

willmj commented Jan 7, 2025 •

edited

Loading

fabianlim Jan 7, 2025

feat: move model loader functionality to augmentation #119

Are you sure you want to change the base?

feat: move model loader functionality to augmentation #119

Conversation

willmj commented Jan 7, 2025 • edited Loading

fabianlim Jan 7, 2025

Choose a reason for hiding this comment

willmj commented Jan 7, 2025 •

edited

Loading