Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MosaicBERT Migration #481

Closed
wants to merge 82 commits into from
Closed

MosaicBERT Migration #481

wants to merge 82 commits into from

Conversation

cojennin
Copy link

@cojennin cojennin commented Jul 21, 2023

This PR migrates MosaicBERT (blogpost, workshop paper) from mosaicml/examples --> mosaicml/llmfoundry. The goal of the PR is to replicate the functionality of MosaicBERT pretraining and finetuning with minimal changes to the MosaicBERT code.

This addition is significant, as it expands the ambit of the llmfoundry repo beyond autoregressive models and the MPT architecture. This also lays the groundwork for BERT-like embedding models.

In the original mosaicml/examples repo, you can pretrain both BERT and MosaicBERT. You can also finetune composer checkpoints of BERT and MosaicBERT on the GLUE benchmark. This port maintains these capabilities.

File Structure

Here's how the files have roughly been copied over:

mosaicml/examples/benchmarks/ mosaicml/llm-foundry
bert/src/bert_layers.py llmfoundry/layers/mosaicbert_layers.py
bert/src/bert_padding.py llmfoundry/models/utils/bert_padding.py
bert/src/hf_bert.py llmfoundry/models/hf/hf_bert.py
bert/src/mosaic_bert.py llmfoundry/layers/mosaicbert/modeling_mosaicbert.py
bert/src/configuation_bert.py llmfoundry/layers/mosaicbert/config_mosaicbert.py
bert/tests/... (6 files) llm-foundry/tests/ ... (6 files)

Pretraining

Both HF BERT and MosaicBERT pretraining can be run with

cd llm-foundry/scripts
composer train/train.py /mnt/config/parameters.yaml
  • Within the llmfoundry/models folder, we added hf_bert.py to the hf subfolder and additionally create a new folder mosaicbert with the files configuration_mosaicbert.py and modeling_mosaicbert.py.

  • We also added ComposerMosaicBertForMaskedLM, ComposerMosaicBertForSequenceClassification, ComposerHFBertForMaskedLM and ComposerHFBertForSequenceClassification to the model_registry.py

  • Note that we also changed ComposerBertForMaskedLM to ComposerMosaicBertForMaskedLM to make it clear exactly when we use MosaicBERT versus the classic BERT.

  • We changed model_cofig to resolved_om_model_config

  • LanguageCrossEntropy(ignore_index=-100, vocab_size=model.config.vocab_size) was changed to LanguageCrossEntropy(ignore_index=-100) for updated Composer compatibility

  • mosaicbert_layers.py is kept in the layers folder instead of the models/mosaicbert folder. The mosaicbert_layers are modified versions of the original transformers BERT layers, and this structure is slightly different from how we have designed the layers for llm-foundry. In order to change as few files as possible, we have kept the mosaicbert_layers.py the same.

Finetuning

In the scripts folder, we specify the files necessary for running GLUE finetuning in the train/finetune_bert_glue/.

Instead of creating the function create_hf_bert_classification(), as in the mosaicml/examples repo, we simply define build_composer_model(model_cfg, tokenizer) to match the model name (e.g. mosaicbert_masked_lm, hf_bert_masked_lm etc.) to the model registry.

A few other small changes were made to glue.py.

Experiments

  • HuggingFace BERT pretraining (wandb and wandb)

  • MosaicBERT pretraining (wandb)

  • HuggingFace BERT finetuning on 1 GPU (wandb)

  • HuggingFace BERT finetuning on 8 GPUs (wandb)

  • MosaicBERT finetuning on 1 GPU (wandb)

  • MosaicBERT finetuning on 8 GPUs (wandb)

TO DO

  • port over test_glue.py, test_classification.py, test_main.py

Nice to Have

  • README for pretraining
  • README for finetuning

Potential To Dos once this PR is approved:

  • The logic in class ComposerMosaicBertForMaskedLM(HuggingFaceModel) could be consolidated, because now it inherits from HuggingFaceModel

Other changes:

  • pretrained_checkpoint = resolved_om_model_config.get('pretrained_checkpoint') was removed from hf_bert.py

@jacobfulano jacobfulano changed the title DRAFT: BERT Migration MosaicBERT Migration Aug 29, 2023
@jacobfulano
Copy link
Contributor

@dakinggg @alextrott16 this PR is ready for review

Copy link
Contributor

@alextrott16 alextrott16 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks very solid. I do think some YAMLs got overlooked and need to be updated to reflect how everything is landing here in llm-foundry. I think that needs to be addressed before stamping Approve.

Also, please weigh in on the "signature" of the model constructors, ie the config structure that the new BERT model constructors expect. I'm thinking they should be as unified as possible with the existing constructors. If unifying seems dumb, please say so. Or, if it's just out of scope for now, then I'm OK with scoping a refactor into a future PR.

Finally, let's get all the tests/sanity checks cleared before we merge. For the record, the PR summary (which includes that checklist) is absolutely fantastic and is so appreciated by this humble reviewer!

image: mosaicml/pytorch:1.13.1_cu117-python3.10-ubuntu20.04
integrations:
- integration_type: git_repo
git_repo: mosaicml/examples
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like MCLI YAML this hasn't been updated for llm-foundry yet.

# Copyright 2022 MosaicML LLM Foundry authors
# SPDX-License-Identifier: Apache-2.0

# Copyright 2022 MosaicML Examples authors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double header

# SPDX-License-Identifier: Apache-2.0

# Copyright 2022 MosaicML Examples authors
# SPDX-License-Identifier: Apache-2.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double header.. Although they aren't exact duplicates, so maybe this is intentional? Well if not intentional, please clean up the headers :)

@@ -0,0 +1,81 @@
# This YAML is built to work with the `sequence_classification.py` starter script!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this starter script has been ported over (which is fine), so I'm guessing this YAML might not be fully updated to reflect llm-foundry. Please make sure these YAML are up-to-date and the comments make sense.

Also, all the new finetune YAMLs seem to use a different naming convention than has been established throughout. For example, the YAML filenames use "bert-hf" and "bert-mosaic" instead of "hf-bert" and "mosaicbert". Please rename them for consistency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, these non-GLUE finetuning YAMLs might not really even work here without the sequence_classification.py starter script. As the comment references, that starter script includes a build_my_dataloader function, which is meant to be edited to set up your custom dataset at runtime. I don't think there are any plans to re-introduce such a starter script, so these YAMLs won't really work.

With this PR, will we actually have support for finetuning a BERT model on your own, non-GLUE dataset? If not, that should motivate a follow-up PR to add that in!

# This YAML is built to work with the `sequence_classification.py` starter script!
#
# Follow the instructions in that script to modify the `build_my_dataloader` function
# and fine-tune a BERT model on your own dataset!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments for HF version of this YAML.

Currently, MosaicBERT is available for masked language modeling :class:`BertForMaskedLM` and sequence
classification :class:`BertForSequenceClassification`. We aim to expand this catalogue in future releases.

See :file:`./mosaic_bert.py` for utilities to simplify working with MosaicBERT in Composer, and for example usage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of date reference path here.

architecture of a Hugging Face model.
tokenizer_name (str, optional): Tokenizer name used to preprocess the dataset and validate the models inputs.
gradient_checkpointing (bool, optional): Use gradient checkpointing. Default: ``False``.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the docstring.

I think the signature for these HF BERTs is designed to match the signature for Mosaic BERT. I wonder if that is the right approach here... The alternative is for all the hf_X models to have a the same signature. So, this would have the same signature as hf_causal_lm, for instance.

In any case, I think we should think carefully about how to unify the config structures that these model wrapper classes in llm-foundry expect.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion is for all of the Composer wrapper classes to have the same signature, which, for now, takes an om_model_config and a tokenizer. Chuck is slowly removing all of the direct config passing, so at some point that signature will get changed, but lets just be consistent for now.

Copy link
Collaborator

@dakinggg dakinggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple high level comments. We need tests with this PR and we need documentation with this PR. I suggest a small section of train readme about bert, and a readme in finetune_bert_glue

# Copyright 2022 MosaicML LLM Foundry authors
# SPDX-License-Identifier: Apache-2.0

# Copyright 2022 MosaicML Examples authors
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop examples license?

architecture of a Hugging Face model.
tokenizer_name (str, optional): Tokenizer name used to preprocess the dataset and validate the models inputs.
gradient_checkpointing (bool, optional): Use gradient checkpointing. Default: ``False``.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion is for all of the Composer wrapper classes to have the same signature, which, for now, takes an om_model_config and a tokenizer. Chuck is slowly removing all of the direct config passing, so at some point that signature will get changed, but lets just be consistent for now.

resolved_om_model_config: Any = om.to_container(om_model_config,
resolve=True)

try:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed and import moved to top of file. You can't install foundry without installing transformers.

Tokenizer = Union[PreTrainedTokenizer, PreTrainedTokenizerFast]


class ComposerHFBertForMaskedLM(HuggingFaceModel):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a generic maskedlm as far as i can tell, except for the one line that defaults the pretrained model to bert-base-uncased. Can we drop the "bert" from the class name? Then this more directly mirrors ComposerHFCausalLM, as ComposerHFMaskedLM


For more information, see `Transformers <https://huggingface.co/transformers/>`_.

Args:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update docstring

metric_names=['SpearmanCorrCoef'])
self.evaluators = [stsb_evaluator]

# Hardcoded for STSB due to a bug (Can be removed once torchmetrics fixes https://github.com/Lightning-AI/metrics/issues/1294)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be removed now i believe

# Copyright 2022 MosaicML LLM Foundry authors
# SPDX-License-Identifier: Apache-2.0

# Copyright 2022 MosaicML Examples authors
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

license

}


def build_algorithm(name: str, kwargs: Any):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all these builders should be moved to the llmfoundry builders file

@@ -0,0 +1,110 @@
# Note that some of the fields in this template haven't been filled in yet.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should adapt the yaml structure to match the llm structure. the difference i notice in this one is model_config instead of config_overrides

# Mosaic BERT 'base' generally uses the default architecture values for from the Hugging Face BertConfig object
# Note: if using the pretrained_checkpoint argument to create a model from an existing checkpoint, make sure
# the model_config settings match the architecture of the existing model
model_config:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@dakinggg
Copy link
Collaborator

dakinggg commented Feb 2, 2024

@cojennin should we close this?

@dakinggg
Copy link
Collaborator

dakinggg commented Mar 6, 2024

Closing as this is not being actively worked on.

@dakinggg dakinggg closed this Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants