-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Get a Vocabulary
object from the reader
#4034
Conversation
Looking conceptually, it makes sense to me. Right now there is already a unidirectional link between In the particular case of
@dirkgr how would you know which labels to use then? In my understanding, you still need to make a whole run through the data to build the labels vocabulary. |
That's what this code does as it's written. But usually, the reader already knows the full vocabulary for tags, because it's a small, closed set. So it could just put that into the |
@@ -229,6 +229,12 @@ def _read(self, file_path: str) -> Iterable[Instance]: | |||
""" | |||
raise NotImplementedError | |||
|
|||
def get_vocabulary(self) -> Optional[Vocabulary]: | |||
"""Returns the vocabulary used in the created instances. By default, this | |||
returns `None`, which causes the vocabulary to be automatically discovered |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't actually the behavior of the current code in train.py
. But we can assume that we fix that to match this description. What I would like to wrap my head around is what happens in the following cases:
- A glove vocabulary
- A BERT QA model (no tag vocab needed)
- A glove tagging model
- A BERT tagging model
What do you imagine a person specifying where, and how does vocabulary creation happen? Putting this method here is less scary than I originally thought, but I'm still somewhat skeptical that we really gain what you're hoping. Outlining what code / parameters a person has to specify in the above cases would help clarify that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1:
def get_vocabulary(self):
# if we write a utility function
return Vocabulary.from_glove(d=100)
# if we don't write a utility function
vocabulary = Vocabulary.empty()
with open(cached_path("http://.../vocab.txt")) as f:
vocabulary.add_tokens_to_namespace([t.strip() for t in f], "tokens")
return vocabulary
# or we could write a utility function for loading from txt files
return Vocabulary.from_txt("http://.../vocab.txt")
This is assuming that Glove ships with a vocab.txt where each line contains a vocab item, and the order matches the order of the vectors.
2:
def get_vocabulary(self):
return Vocabulary.from_transformers(self.transformer_model)
3:
def get_vocabulary(self):
vocabulary = Vocabulary.from_glove(d=100)
vocabulary.add_tokens_to_namespace(self.possible_tags, "tags")
return vocabulary
4:
def get_vocabulary(self):
vocabulary = Vocabulary.from_transformers(self.transformer_model)
vocabulary.add_tokens_to_namespace(self.possible_tags, "tags")
return vocabulary
How does the current mechanism make sure that in the case of Glove, we get the right indices for the right tokens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For case 1, we typically don't load the whole glove vocabulary, because that's massive. We just load the embeddings for the tokens that we've seen, saving a ton of space. You can't do that without looking through your instances and computing a vocabulary.
For cases 3 and 4, if you're saying that the reader has to hard-code a list of tags / labels, this seems problematic. It means that I can't have a generic "classification csv file" dataset reader, because the reader would have to know the label set.
Do you think this can also solve this #3960 (comment)? In which
|
Keeping some notes: @roys174 is working on a project where he has to control which index gets assigned to which label. Returning a |
You can already do that by specifying the vocab manually. If you only want it for some namespaces, you can do that and extend the vocab. There still could be benefits to having this specified in code rather than a separate vocab file, just saying that it's not necessary, as we already handle this case. |
So you would write a custom vocab.txt and specify it in the config? |
Yes, if I wanted to solve this problem with current code that's what I'd do. |
We're handling this in a different way in Tango. |
This expands on the idea of getting the vocabulary from the reader, while being compatible with the current automatic discovery of the vocab.
I think @matt-gardner's comments in #4023 (review) still stand and would need to be addressed here as well.
@matt-gardner, you said in #3097 (comment) that you think vocab and reader should not be coupled. Could you expand on that? In my understanding, the reader would be the best place to couple this, since the reader presumably knows the vocabulary it draws from. This would be true for closed vocab sets such as POS tags or BIO tags, and it's also true in the case of a vocabulary drawn from a transformer.
In fact, if those are the only cases encountered, we don't need to run vocab discovery at all, saving a whole run through the data. I believe this is a common case, especially on architectures that are all-in on transformers and their tokenizers.
@maksym-del, I stole some of your code from #4023. Thank you!