-
Notifications
You must be signed in to change notification settings - Fork 537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MosaicBERT Migration #481
MosaicBERT Migration #481
Conversation
…o bert-v0 merge bert-v0 remote
@dakinggg @alextrott16 this PR is ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this looks very solid. I do think some YAMLs got overlooked and need to be updated to reflect how everything is landing here in llm-foundry
. I think that needs to be addressed before stamping Approve.
Also, please weigh in on the "signature" of the model constructors, ie the config structure that the new BERT model constructors expect. I'm thinking they should be as unified as possible with the existing constructors. If unifying seems dumb, please say so. Or, if it's just out of scope for now, then I'm OK with scoping a refactor into a future PR.
Finally, let's get all the tests/sanity checks cleared before we merge. For the record, the PR summary (which includes that checklist) is absolutely fantastic and is so appreciated by this humble reviewer!
image: mosaicml/pytorch:1.13.1_cu117-python3.10-ubuntu20.04 | ||
integrations: | ||
- integration_type: git_repo | ||
git_repo: mosaicml/examples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like MCLI YAML this hasn't been updated for llm-foundry
yet.
# Copyright 2022 MosaicML LLM Foundry authors | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
# Copyright 2022 MosaicML Examples authors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double header
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
# Copyright 2022 MosaicML Examples authors | ||
# SPDX-License-Identifier: Apache-2.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double header.. Although they aren't exact duplicates, so maybe this is intentional? Well if not intentional, please clean up the headers :)
@@ -0,0 +1,81 @@ | |||
# This YAML is built to work with the `sequence_classification.py` starter script! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this starter script has been ported over (which is fine), so I'm guessing this YAML might not be fully updated to reflect llm-foundry
. Please make sure these YAML are up-to-date and the comments make sense.
Also, all the new finetune YAMLs seem to use a different naming convention than has been established throughout. For example, the YAML filenames use "bert-hf" and "bert-mosaic" instead of "hf-bert" and "mosaicbert". Please rename them for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, these non-GLUE finetuning YAMLs might not really even work here without the sequence_classification.py
starter script. As the comment references, that starter script includes a build_my_dataloader
function, which is meant to be edited to set up your custom dataset at runtime. I don't think there are any plans to re-introduce such a starter script, so these YAMLs won't really work.
With this PR, will we actually have support for finetuning a BERT model on your own, non-GLUE dataset? If not, that should motivate a follow-up PR to add that in!
# This YAML is built to work with the `sequence_classification.py` starter script! | ||
# | ||
# Follow the instructions in that script to modify the `build_my_dataloader` function | ||
# and fine-tune a BERT model on your own dataset! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments for HF version of this YAML.
Currently, MosaicBERT is available for masked language modeling :class:`BertForMaskedLM` and sequence | ||
classification :class:`BertForSequenceClassification`. We aim to expand this catalogue in future releases. | ||
|
||
See :file:`./mosaic_bert.py` for utilities to simplify working with MosaicBERT in Composer, and for example usage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of date reference path here.
architecture of a Hugging Face model. | ||
tokenizer_name (str, optional): Tokenizer name used to preprocess the dataset and validate the models inputs. | ||
gradient_checkpointing (bool, optional): Use gradient checkpointing. Default: ``False``. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the docstring.
I think the signature for these HF BERTs is designed to match the signature for Mosaic BERT. I wonder if that is the right approach here... The alternative is for all the hf_X
models to have a the same signature. So, this would have the same signature as hf_causal_lm
, for instance.
In any case, I think we should think carefully about how to unify the config structures that these model wrapper classes in llm-foundry
expect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion is for all of the Composer wrapper classes to have the same signature, which, for now, takes an om_model_config
and a tokenizer
. Chuck is slowly removing all of the direct config passing, so at some point that signature will get changed, but lets just be consistent for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple high level comments. We need tests with this PR and we need documentation with this PR. I suggest a small section of train readme about bert, and a readme in finetune_bert_glue
# Copyright 2022 MosaicML LLM Foundry authors | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
# Copyright 2022 MosaicML Examples authors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drop examples license?
architecture of a Hugging Face model. | ||
tokenizer_name (str, optional): Tokenizer name used to preprocess the dataset and validate the models inputs. | ||
gradient_checkpointing (bool, optional): Use gradient checkpointing. Default: ``False``. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion is for all of the Composer wrapper classes to have the same signature, which, for now, takes an om_model_config
and a tokenizer
. Chuck is slowly removing all of the direct config passing, so at some point that signature will get changed, but lets just be consistent for now.
resolved_om_model_config: Any = om.to_container(om_model_config, | ||
resolve=True) | ||
|
||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be removed and import moved to top of file. You can't install foundry without installing transformers.
Tokenizer = Union[PreTrainedTokenizer, PreTrainedTokenizerFast] | ||
|
||
|
||
class ComposerHFBertForMaskedLM(HuggingFaceModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a generic maskedlm as far as i can tell, except for the one line that defaults the pretrained model to bert-base-uncased. Can we drop the "bert" from the class name? Then this more directly mirrors ComposerHFCausalLM
, as ComposerHFMaskedLM
|
||
For more information, see `Transformers <https://huggingface.co/transformers/>`_. | ||
|
||
Args: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update docstring
metric_names=['SpearmanCorrCoef']) | ||
self.evaluators = [stsb_evaluator] | ||
|
||
# Hardcoded for STSB due to a bug (Can be removed once torchmetrics fixes https://github.com/Lightning-AI/metrics/issues/1294) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be removed now i believe
# Copyright 2022 MosaicML LLM Foundry authors | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
# Copyright 2022 MosaicML Examples authors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
license
} | ||
|
||
|
||
def build_algorithm(name: str, kwargs: Any): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all these builders should be moved to the llmfoundry builders file
@@ -0,0 +1,110 @@ | |||
# Note that some of the fields in this template haven't been filled in yet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should adapt the yaml structure to match the llm structure. the difference i notice in this one is model_config
instead of config_overrides
# Mosaic BERT 'base' generally uses the default architecture values for from the Hugging Face BertConfig object | ||
# Note: if using the pretrained_checkpoint argument to create a model from an existing checkpoint, make sure | ||
# the model_config settings match the architecture of the existing model | ||
model_config: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
@cojennin should we close this? |
Closing as this is not being actively worked on. |
This PR migrates MosaicBERT (blogpost, workshop paper) from
mosaicml/examples
-->mosaicml/llmfoundry
. The goal of the PR is to replicate the functionality of MosaicBERT pretraining and finetuning with minimal changes to the MosaicBERT code.This addition is significant, as it expands the ambit of the llmfoundry repo beyond autoregressive models and the MPT architecture. This also lays the groundwork for BERT-like embedding models.
In the original
mosaicml/examples
repo, you can pretrain both BERT and MosaicBERT. You can also finetune composer checkpoints of BERT and MosaicBERT on the GLUE benchmark. This port maintains these capabilities.File Structure
Here's how the files have roughly been copied over:
Pretraining
Both HF BERT and MosaicBERT pretraining can be run with
Within the
llmfoundry/models
folder, we addedhf_bert.py
to thehf
subfolder and additionally create a new foldermosaicbert
with the filesconfiguration_mosaicbert.py
andmodeling_mosaicbert.py
.We also added
ComposerMosaicBertForMaskedLM
,ComposerMosaicBertForSequenceClassification
,ComposerHFBertForMaskedLM
andComposerHFBertForSequenceClassification
to themodel_registry.py
Note that we also changed
ComposerBertForMaskedLM
toComposerMosaicBertForMaskedLM
to make it clear exactly when we use MosaicBERT versus the classic BERT.We changed
model_cofig
toresolved_om_model_config
LanguageCrossEntropy(ignore_index=-100, vocab_size=model.config.vocab_size)
was changed toLanguageCrossEntropy(ignore_index=-100)
for updated Composer compatibilitymosaicbert_layers.py
is kept in the layers folder instead of themodels/mosaicbert
folder. The mosaicbert_layers are modified versions of the original transformers BERT layers, and this structure is slightly different from how we have designed the layers for llm-foundry. In order to change as few files as possible, we have kept themosaicbert_layers.py
the same.Finetuning
In the
scripts
folder, we specify the files necessary for running GLUE finetuning in thetrain/finetune_bert_glue/
.Instead of creating the function
create_hf_bert_classification()
, as in the mosaicml/examples repo, we simply definebuild_composer_model(model_cfg, tokenizer)
to match the model name (e.g.mosaicbert_masked_lm
,hf_bert_masked_lm
etc.) to the model registry.A few other small changes were made to
glue.py
.Experiments
HuggingFace BERT pretraining (wandb and wandb)
MosaicBERT pretraining (wandb)
HuggingFace BERT finetuning on 1 GPU (wandb)
HuggingFace BERT finetuning on 8 GPUs (wandb)
MosaicBERT finetuning on 1 GPU (wandb)
MosaicBERT finetuning on 8 GPUs (wandb)
TO DO
test_glue.py
,test_classification.py
,test_main.py
Nice to Have
Potential To Dos once this PR is approved:
ComposerMosaicBertForMaskedLM(HuggingFaceModel)
could be consolidated, because now it inherits fromHuggingFaceModel
Other changes:
pretrained_checkpoint = resolved_om_model_config.get('pretrained_checkpoint')
was removed fromhf_bert.py