-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Fix transformers vocab not being saved to files #4023
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -28,7 +28,7 @@ class PretrainedTransformerIndexer(TokenIndexer): | |||
|
|||
model_name : `str` | |||
The name of the `transformers` model to use. | |||
namespace : `str`, optional (default=`tags`) | |||
namespace : `str`, optional (default=`from_transformers`) | |||
We will add the tokens in the pytorch_transformer vocabulary to this vocabulary namespace. | |||
We use a somewhat confusing default value of `tags` so that we do not add padding or UNK |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use a somewhat confusing default value of `tags` so that we do not add padding or UNK | |
We use `from_transformers` as the default namespace, which gets special treatment in our | |
`Vocabulary` class to not get padding or UNK tokens. |
Actually, how are UNK
and padding handled here? Is it possible to set them using the transformer? We probably do want UNK handling and padding tokens, we just need them set in the right way, both here and during loading. This might need some additional changes to how we save and load vocabularies.
And, while we're at it, we probably want the padding and unk tokens to be namespace-dependent.
self, | ||
encoding_dictionary: Dict[str, int], | ||
namespace: str = "from_transformers", | ||
resave_to_files: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is getting to be a bigger change, but it's probably necessary to really fix this the right way. But I think this function should take arguments for setting padding and oov tokens, and those should set them in a namespace-specific way. And the padding and oov tokens then would need to be serialized with the vocabulary and read from the serialized files. Then we wouldn't need the from_transformers
namespace added to the non-padded namespaces.
I would wait to do any work on this, though, until @dirkgr has had a chance to glance at it and see if he agrees. I think this is the right approach, but I could be missing something.
It's a minimal fix, but I am very unhappy about the amount of magic going on. The way we do vocabularies already involves an untenable amount of magic, and this adds more. Saving state like this, especially the name of a file that gets overwritten, is just asking for trouble the next time this is used in a scenario we didn't think of. I sketched an alternative here: #4034. I'll add some discussion over there. It is not as minimal as this is, but it has some other advantages. |
Fixes #3097. Here we add a method to vocab to populate its namespace from transformers encoding (which is fetched in indexer) with an option to resave vocab to files (which is the main problem of #3097).
I reserved another
non_padded_namespace
family:*from_transformers
to use it by default for transformers instead oftags
.Not saving transformers vocab to disc prevents
from_archive
model loading (because vocab folder can be empty) with all the implications.I know this fix is not ideal, but I think it is an improvement over what we have currently. We can also open a separate issue dedicated to figuring out transformers <-> allennlp relationships.