MosaicBERT Migration #481

cojennin · 2023-07-21T21:36:14Z

This PR migrates MosaicBERT (blogpost, workshop paper) from mosaicml/examples --> mosaicml/llmfoundry. The goal of the PR is to replicate the functionality of MosaicBERT pretraining and finetuning with minimal changes to the MosaicBERT code.

This addition is significant, as it expands the ambit of the llmfoundry repo beyond autoregressive models and the MPT architecture. This also lays the groundwork for BERT-like embedding models.

In the original mosaicml/examples repo, you can pretrain both BERT and MosaicBERT. You can also finetune composer checkpoints of BERT and MosaicBERT on the GLUE benchmark. This port maintains these capabilities.

File Structure

Here's how the files have roughly been copied over:

mosaicml/examples/benchmarks/	mosaicml/llm-foundry
bert/src/bert_layers.py	llmfoundry/layers/mosaicbert_layers.py
bert/src/bert_padding.py	llmfoundry/models/utils/bert_padding.py
bert/src/hf_bert.py	llmfoundry/models/hf/hf_bert.py
bert/src/mosaic_bert.py	llmfoundry/layers/mosaicbert/modeling_mosaicbert.py
bert/src/configuation_bert.py	llmfoundry/layers/mosaicbert/config_mosaicbert.py
bert/tests/... (6 files)	llm-foundry/tests/ ... (6 files)

Pretraining

Both HF BERT and MosaicBERT pretraining can be run with

cd llm-foundry/scripts
composer train/train.py /mnt/config/parameters.yaml

Within the llmfoundry/models folder, we added hf_bert.py to the hf subfolder and additionally create a new folder mosaicbert with the files configuration_mosaicbert.py and modeling_mosaicbert.py.
We also added ComposerMosaicBertForMaskedLM, ComposerMosaicBertForSequenceClassification, ComposerHFBertForMaskedLM and ComposerHFBertForSequenceClassification to the model_registry.py
Note that we also changed ComposerBertForMaskedLM to ComposerMosaicBertForMaskedLM to make it clear exactly when we use MosaicBERT versus the classic BERT.
We changed model_cofig to resolved_om_model_config
LanguageCrossEntropy(ignore_index=-100, vocab_size=model.config.vocab_size) was changed to LanguageCrossEntropy(ignore_index=-100) for updated Composer compatibility
mosaicbert_layers.py is kept in the layers folder instead of the models/mosaicbert folder. The mosaicbert_layers are modified versions of the original transformers BERT layers, and this structure is slightly different from how we have designed the layers for llm-foundry. In order to change as few files as possible, we have kept the mosaicbert_layers.py the same.

Finetuning

In the scripts folder, we specify the files necessary for running GLUE finetuning in the train/finetune_bert_glue/.

Instead of creating the function create_hf_bert_classification(), as in the mosaicml/examples repo, we simply define build_composer_model(model_cfg, tokenizer) to match the model name (e.g. mosaicbert_masked_lm, hf_bert_masked_lm etc.) to the model registry.

A few other small changes were made to glue.py.

Experiments

HuggingFace BERT pretraining (wandb and wandb)
MosaicBERT pretraining (wandb)
HuggingFace BERT finetuning on 1 GPU (wandb)
HuggingFace BERT finetuning on 8 GPUs (wandb)
MosaicBERT finetuning on 1 GPU (wandb)
MosaicBERT finetuning on 8 GPUs (wandb)

TO DO

port over test_glue.py, test_classification.py, test_main.py

Nice to Have

README for pretraining
README for finetuning

Potential To Dos once this PR is approved:

The logic in class ComposerMosaicBertForMaskedLM(HuggingFaceModel) could be consolidated, because now it inherits from HuggingFaceModel

Other changes:

pretrained_checkpoint = resolved_om_model_config.get('pretrained_checkpoint') was removed from hf_bert.py

…rt init

…o bert-v0 merge bert-v0 remote

…o bert-v0 update files

jacobfulano · 2023-08-29T17:44:36Z

@dakinggg @alextrott16 this PR is ready for review

alextrott16

Overall, this looks very solid. I do think some YAMLs got overlooked and need to be updated to reflect how everything is landing here in llm-foundry. I think that needs to be addressed before stamping Approve.

Also, please weigh in on the "signature" of the model constructors, ie the config structure that the new BERT model constructors expect. I'm thinking they should be as unified as possible with the existing constructors. If unifying seems dumb, please say so. Or, if it's just out of scope for now, then I'm OK with scoping a refactor into a future PR.

Finally, let's get all the tests/sanity checks cleared before we merge. For the record, the PR summary (which includes that checklist) is absolutely fantastic and is so appreciated by this humble reviewer!

alextrott16 · 2023-08-29T19:08:26Z

mcli/mcli-bert.yaml

+image: mosaicml/pytorch:1.13.1_cu117-python3.10-ubuntu20.04
+integrations:
+- integration_type: git_repo
+  git_repo: mosaicml/examples


Seems like MCLI YAML this hasn't been updated for llm-foundry yet.

alextrott16 · 2023-08-29T19:10:54Z

scripts/train/finetune_bert_glue/__init__.py

+# Copyright 2022 MosaicML LLM Foundry authors
+# SPDX-License-Identifier: Apache-2.0
+
+# Copyright 2022 MosaicML Examples authors


Double header

alextrott16 · 2023-08-29T19:12:17Z

scripts/train/finetune_bert_glue/data.py

+# SPDX-License-Identifier: Apache-2.0
+
+# Copyright 2022 MosaicML Examples authors
+# SPDX-License-Identifier: Apache-2.0


Double header.. Although they aren't exact duplicates, so maybe this is intentional? Well if not intentional, please clean up the headers :)

alextrott16 · 2023-08-29T19:17:59Z

scripts/train/yamls/finetune/bert-hf-base-uncased.yaml

@@ -0,0 +1,81 @@
+# This YAML is built to work with the `sequence_classification.py` starter script!


I don't think this starter script has been ported over (which is fine), so I'm guessing this YAML might not be fully updated to reflect llm-foundry. Please make sure these YAML are up-to-date and the comments make sense.

Also, all the new finetune YAMLs seem to use a different naming convention than has been established throughout. For example, the YAML filenames use "bert-hf" and "bert-mosaic" instead of "hf-bert" and "mosaicbert". Please rename them for consistency.

Actually, these non-GLUE finetuning YAMLs might not really even work here without the sequence_classification.py starter script. As the comment references, that starter script includes a build_my_dataloader function, which is meant to be edited to set up your custom dataset at runtime. I don't think there are any plans to re-introduce such a starter script, so these YAMLs won't really work.

With this PR, will we actually have support for finetuning a BERT model on your own, non-GLUE dataset? If not, that should motivate a follow-up PR to add that in!

alextrott16 · 2023-08-29T19:23:27Z

scripts/train/yamls/finetune/bert-mosaic-base-uncased.yaml

+# This YAML is built to work with the `sequence_classification.py` starter script!
+#
+#   Follow the instructions in that script to modify the `build_my_dataloader` function
+#   and fine-tune a BERT model on your own dataset!


See comments for HF version of this YAML.

alextrott16 · 2023-08-30T22:31:02Z

llmfoundry/models/layers/mosaicbert_layers.py

+Currently, MosaicBERT is available for masked language modeling :class:`BertForMaskedLM` and sequence
+classification :class:`BertForSequenceClassification`. We aim to expand this catalogue in future releases.
+
+See :file:`./mosaic_bert.py` for utilities to simplify working with MosaicBERT in Composer, and for example usage


Out of date reference path here.

alextrott16 · 2023-08-30T23:59:24Z

llmfoundry/models/hf/hf_bert.py

+            architecture of a Hugging Face model.
+        tokenizer_name (str, optional): Tokenizer name used to preprocess the dataset and validate the models inputs.
+        gradient_checkpointing (bool, optional): Use gradient checkpointing. Default: ``False``.
+


Please update the docstring.

I think the signature for these HF BERTs is designed to match the signature for Mosaic BERT. I wonder if that is the right approach here... The alternative is for all the hf_X models to have a the same signature. So, this would have the same signature as hf_causal_lm, for instance.

In any case, I think we should think carefully about how to unify the config structures that these model wrapper classes in llm-foundry expect.

My suggestion is for all of the Composer wrapper classes to have the same signature, which, for now, takes an om_model_config and a tokenizer. Chuck is slowly removing all of the direct config passing, so at some point that signature will get changed, but lets just be consistent for now.

dakinggg

A couple high level comments. We need tests with this PR and we need documentation with this PR. I suggest a small section of train readme about bert, and a readme in finetune_bert_glue

dakinggg · 2023-08-31T01:28:28Z

llmfoundry/models/hf/hf_bert.py

+# Copyright 2022 MosaicML LLM Foundry authors
+# SPDX-License-Identifier: Apache-2.0
+
+# Copyright 2022 MosaicML Examples authors


drop examples license?

dakinggg · 2023-08-31T01:33:31Z

llmfoundry/models/hf/hf_bert.py

+            architecture of a Hugging Face model.
+        tokenizer_name (str, optional): Tokenizer name used to preprocess the dataset and validate the models inputs.
+        gradient_checkpointing (bool, optional): Use gradient checkpointing. Default: ``False``.
+


My suggestion is for all of the Composer wrapper classes to have the same signature, which, for now, takes an om_model_config and a tokenizer. Chuck is slowly removing all of the direct config passing, so at some point that signature will get changed, but lets just be consistent for now.

dakinggg · 2023-08-31T01:34:38Z

llmfoundry/models/hf/hf_bert.py

+        resolved_om_model_config: Any = om.to_container(om_model_config,
+                                                        resolve=True)
+
+        try:


This can be removed and import moved to top of file. You can't install foundry without installing transformers.

dakinggg · 2023-08-31T01:35:46Z

llmfoundry/models/hf/hf_bert.py

+Tokenizer = Union[PreTrainedTokenizer, PreTrainedTokenizerFast]
+
+
+class ComposerHFBertForMaskedLM(HuggingFaceModel):


This is just a generic maskedlm as far as i can tell, except for the one line that defaults the pretrained model to bert-base-uncased. Can we drop the "bert" from the class name? Then this more directly mirrors ComposerHFCausalLM, as ComposerHFMaskedLM

dakinggg · 2023-08-31T01:36:03Z

llmfoundry/models/hf/hf_bert.py

+
+    For more information, see `Transformers <https://huggingface.co/transformers/>`_.
+
+    Args:


Update docstring

dakinggg · 2023-08-31T01:57:04Z

scripts/train/finetune_bert_glue/finetuning_jobs.py

+                                   metric_names=['SpearmanCorrCoef'])
+        self.evaluators = [stsb_evaluator]
+
+        # Hardcoded for STSB due to a bug (Can be removed once torchmetrics fixes https://github.com/Lightning-AI/metrics/issues/1294)


this can be removed now i believe

dakinggg · 2023-08-31T01:57:13Z

scripts/train/finetune_bert_glue/glue.py

+# Copyright 2022 MosaicML LLM Foundry authors
+# SPDX-License-Identifier: Apache-2.0
+
+# Copyright 2022 MosaicML Examples authors


dakinggg · 2023-08-31T01:57:34Z

scripts/train/finetune_bert_glue/glue.py

+}
+
+
+def build_algorithm(name: str, kwargs: Any):


all these builders should be moved to the llmfoundry builders file

dakinggg · 2023-08-31T01:59:55Z

scripts/train/yamls/pretrain/hf-bert-base-uncased.yaml

@@ -0,0 +1,110 @@
+# Note that some of the fields in this template haven't been filled in yet.


we should adapt the yaml structure to match the llm structure. the difference i notice in this one is model_config instead of config_overrides

dakinggg · 2023-08-31T02:00:08Z

scripts/train/yamls/pretrain/mosaicbert-base-uncased.yaml

+  # Mosaic BERT 'base' generally uses the default architecture values for from the Hugging Face BertConfig object
+  # Note: if using the pretrained_checkpoint argument to create a model from an existing checkpoint, make sure
+  # the model_config settings match the architecture of the existing model
+  model_config:


dakinggg · 2024-02-02T00:18:45Z

@cojennin should we close this?

dakinggg · 2024-03-06T20:03:54Z

Closing as this is not being actively worked on.

cojennin and others added 30 commits July 21, 2023 17:35

BERT

f27a755

Fixes to layers

93dcf19

Test

dff4cfc

Fixes to FSDP validation

2ebff92

Update builders.py

e9b314c

Update builders.py

e9425bc

Update builders.py

ce1ef9e

Fix

64c3e2f

keep layers folder structure flat with mosaicbert_layers.py

ceb668e

correct import for mosaicbert_layers

ca49a4b

correct imports for models.layers.mosaicbert_layers

0d997eb

remove commented code relating to tokenizer

1cfaa5d

change class name to ComposerMosaicBertForMaskedLM

058bb3b

change naming convention of ComposerMosaicBertForMaskedLM in mosaicbe…

86cb4a2

…rt init

change Mosaic BERT to MosaicBERT in comments of mosaicbert_layers.py

d261d34

add ComposerHFBertForMaskedLM etc. to hf init file

b2e7c9a

add HF BERT to model registry

e18be30

add ComposerHFBertForMaskedLM etc to model init file

5f444f8

correct typo

3c5f79c

correct typo

b77b7d6

clean up ComposerHFBert... in hf_bert by removing model_config

c1b39d9

fix add pretrained_model_name

73031ef

correct typo with composer LanguageCrossEntropy

8377dae

small typo in super.__init__

49fe53d

test

cf490fa

test

8ee549d

test;

67cbc30

rename mosaicbert-base-uncased yaml

12e4988

Merge branch 'bert-v0' of https://github.com/cojennin/llm-foundry int…

bf7687a

…o bert-v0 merge bert-v0 remote

remove test directory

d0d05c2

jacobfulano requested a review from alextrott16 August 18, 2023 21:15

jacobfulano and others added 15 commits August 18, 2023 21:25

pyright

cf5c720

precommit clean

057b658

pyright clean

5e8d68d

Merge branch 'main' into bert-v0

3b59aa4

clean up pyright issue

a89b2f5

pre-commit hooks

408dd81

pyright for hf_bert.py

69e4e41

pyright cleanup bert_padding.py

ade08b8

precommit cleanup

642edad

pyright bert_padding

b3feecf

pre-commit clean

57041c9

pyright ignore

ab836e9

pyright clean

290d46c

pyright ignore

4b1a5b7

update dataloader.py

9f3e242

jacobfulano changed the title ~~DRAFT: BERT Migration~~ MosaicBERT Migration Aug 29, 2023

jacobfulano and others added 5 commits August 29, 2023 10:07

pre-commit clean up

8584b2c

Merge branch 'main' into bert-v0

dd7c889

test

0c8f23c

Merge branch 'bert-v0' of https://github.com/cojennin/llm-foundry int…

24e821c

…o bert-v0 update files

remove .conda_config

0083bca

Merge branch 'main' into bert-v0

89c7c01

alextrott16 suggested changes Aug 31, 2023

View reviewed changes

dakinggg reviewed Aug 31, 2023

View reviewed changes

Fix up a few things

c793a51

dakinggg closed this Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MosaicBERT Migration #481

MosaicBERT Migration #481

cojennin commented Jul 21, 2023 •

edited by jacobfulano

Loading

jacobfulano commented Aug 29, 2023

alextrott16 left a comment

alextrott16 Aug 29, 2023

alextrott16 Aug 29, 2023

alextrott16 Aug 29, 2023

alextrott16 Aug 29, 2023

alextrott16 Aug 29, 2023

alextrott16 Aug 29, 2023

alextrott16 Aug 30, 2023

alextrott16 Aug 30, 2023

dakinggg Aug 31, 2023

dakinggg left a comment

dakinggg Aug 31, 2023

dakinggg Aug 31, 2023

dakinggg Aug 31, 2023

dakinggg Aug 31, 2023

dakinggg Aug 31, 2023

dakinggg Aug 31, 2023

dakinggg Aug 31, 2023

dakinggg Aug 31, 2023

dakinggg Aug 31, 2023

dakinggg Aug 31, 2023

dakinggg commented Feb 2, 2024

dakinggg commented Mar 6, 2024

		@@ -0,0 +1,81 @@
		# This YAML is built to work with the `sequence_classification.py` starter script!

		Tokenizer = Union[PreTrainedTokenizer, PreTrainedTokenizerFast]


		class ComposerHFBertForMaskedLM(HuggingFaceModel):


		For more information, see `Transformers <https://huggingface.co/transformers/>`_.

		Args:

		@@ -0,0 +1,110 @@
		# Note that some of the fields in this template haven't been filled in yet.

MosaicBERT Migration #481

MosaicBERT Migration #481

Conversation

cojennin commented Jul 21, 2023 • edited by jacobfulano Loading

File Structure

Pretraining

Finetuning

Experiments

TO DO

Nice to Have

Potential To Dos once this PR is approved:

Other changes:

jacobfulano commented Aug 29, 2023

alextrott16 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dakinggg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dakinggg commented Feb 2, 2024

dakinggg commented Mar 6, 2024

cojennin commented Jul 21, 2023 •

edited by jacobfulano

Loading