Get a `Vocabulary` object from the reader #4034

dirkgr · 2020-04-06T23:31:14Z

This expands on the idea of getting the vocabulary from the reader, while being compatible with the current automatic discovery of the vocab.

I think @matt-gardner's comments in #4023 (review) still stand and would need to be addressed here as well.

@matt-gardner, you said in #3097 (comment) that you think vocab and reader should not be coupled. Could you expand on that? In my understanding, the reader would be the best place to couple this, since the reader presumably knows the vocabulary it draws from. This would be true for closed vocab sets such as POS tags or BIO tags, and it's also true in the case of a vocabulary drawn from a transformer.

In fact, if those are the only cases encountered, we don't need to run vocab discovery at all, saving a whole run through the data. I believe this is a common case, especially on architectures that are all-in on transformers and their tokenizers.

@maksym-del, I stole some of your code from #4023. Thank you!

MaksymDel · 2020-04-07T01:15:06Z

Looking conceptually, it makes sense to me.

Right now there is already a unidirectional link between DatasetReader and Vocabulary: reader produces instances that are to be consumed by vocab.

In the particular case of HF's transformers, the reader produces another piece of information for the vocab, which is a token<->id mapping. Vocab consumes it the same way it consumed instances and uses it as it needs to instantiate. Curious what @matt-gardner thinks of it.

In fact, if those are the only cases encountered, we don't need to run vocab discovery at all, saving a whole run through the data. I believe this is a common case, especially on architectures that are all-in on transformers and their tokenizers.

@dirkgr how would you know which labels to use then? In my understanding, you still need to make a whole run through the data to build the labels vocabulary.

dirkgr · 2020-04-07T22:23:16Z

In fact, if those are the only cases encountered, we don't need to run vocab discovery at all, saving a whole run through the data. I believe this is a common case, especially on architectures that are all-in on transformers and their tokenizers.

@dirkgr how would you know which labels to use then? In my understanding, you still need to make a whole run through the data to build the labels vocabulary.

That's what this code does as it's written. But usually, the reader already knows the full vocabulary for tags, because it's a small, closed set. So it could just put that into the Vocabulary returned from get_vocabulary(), and then we don't have to run through the data before training. We just need to add a way of telling the training code that the vocabulary is already complete and there is nothing to be done.

matt-gardner · 2020-04-07T23:16:29Z

allennlp/data/dataset_readers/dataset_reader.py

@@ -229,6 +229,12 @@ def _read(self, file_path: str) -> Iterable[Instance]:
        """
        raise NotImplementedError

+    def get_vocabulary(self) -> Optional[Vocabulary]:
+        """Returns the vocabulary used in the created instances. By default, this
+        returns `None`, which causes the vocabulary to be automatically discovered


This isn't actually the behavior of the current code in train.py. But we can assume that we fix that to match this description. What I would like to wrap my head around is what happens in the following cases:

A glove vocabulary

A BERT QA model (no tag vocab needed)

A glove tagging model

A BERT tagging model

What do you imagine a person specifying where, and how does vocabulary creation happen? Putting this method here is less scary than I originally thought, but I'm still somewhat skeptical that we really gain what you're hoping. Outlining what code / parameters a person has to specify in the above cases would help clarify that.

1:

def get_vocabulary(self): # if we write a utility function return Vocabulary.from_glove(d=100) # if we don't write a utility function vocabulary = Vocabulary.empty() with open(cached_path("http://.../vocab.txt")) as f: vocabulary.add_tokens_to_namespace([t.strip() for t in f], "tokens") return vocabulary # or we could write a utility function for loading from txt files return Vocabulary.from_txt("http://.../vocab.txt")

This is assuming that Glove ships with a vocab.txt where each line contains a vocab item, and the order matches the order of the vectors.

2:

def get_vocabulary(self): return Vocabulary.from_transformers(self.transformer_model)

3:

def get_vocabulary(self): vocabulary = Vocabulary.from_glove(d=100) vocabulary.add_tokens_to_namespace(self.possible_tags, "tags") return vocabulary

4:

def get_vocabulary(self): vocabulary = Vocabulary.from_transformers(self.transformer_model) vocabulary.add_tokens_to_namespace(self.possible_tags, "tags") return vocabulary

How does the current mechanism make sure that in the case of Glove, we get the right indices for the right tokens?

For case 1, we typically don't load the whole glove vocabulary, because that's massive. We just load the embeddings for the tokens that we've seen, saving a ton of space. You can't do that without looking through your instances and computing a vocabulary.

For cases 3 and 4, if you're saying that the reader has to hard-code a list of tags / labels, this seems problematic. It means that I can't have a generic "classification csv file" dataset reader, because the reader would have to know the label set.

bryant1410 · 2020-04-22T23:57:59Z

Do you think this can also solve this #3960 (comment)? In which BucketBatchSampler needs a Vocabulary but only has a Dataset that necessarily doesn't have one?

allennlp/allennlp/data/samplers/bucket_batch_sampler.py

Line 75 in f5a4f4f

self.vocab = data_source.vocab

dirkgr · 2020-05-13T01:04:01Z

Keeping some notes: @roys174 is working on a project where he has to control which index gets assigned to which label. Returning a Vocabulary from the reader should allow us to do that.

matt-gardner · 2020-05-13T02:31:58Z

You can already do that by specifying the vocab manually. If you only want it for some namespaces, you can do that and extend the vocab.

There still could be benefits to having this specified in code rather than a separate vocab file, just saying that it's not necessary, as we already handle this case.

dirkgr · 2020-05-13T03:57:25Z

So you would write a custom vocab.txt and specify it in the config?

matt-gardner · 2020-05-13T04:21:41Z

Yes, if I wanted to solve this problem with current code that's what I'd do.

dirkgr · 2021-07-13T18:32:40Z

We're handling this in a different way in Tango.

Get a vocab object from the reader

2d1e782

dirkgr mentioned this pull request Apr 6, 2020

Fix transformers vocab not being saved to files #4023

Open

matt-gardner reviewed Apr 7, 2020

View reviewed changes

Merge branch 'master' into VocabFromReader

936fccb

matt-gardner mentioned this pull request May 4, 2020

Prediction Error when doing Textual Entailment using roberta #4192

Closed

schmmd changed the base branch from master to main December 23, 2020 18:48

dirkgr closed this Jul 13, 2021

dirkgr deleted the VocabFromReader branch July 13, 2021 18:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get a `Vocabulary` object from the reader #4034

Get a `Vocabulary` object from the reader #4034

dirkgr commented Apr 6, 2020 •

edited

Loading

MaksymDel commented Apr 7, 2020 •

edited

Loading

dirkgr commented Apr 7, 2020

matt-gardner Apr 7, 2020 •

edited

Loading

dirkgr Apr 8, 2020

matt-gardner Apr 9, 2020

bryant1410 commented Apr 22, 2020 •

edited

Loading

dirkgr commented May 13, 2020

matt-gardner commented May 13, 2020 •

edited

Loading

dirkgr commented May 13, 2020

matt-gardner commented May 13, 2020

dirkgr commented Jul 13, 2021

Get a Vocabulary object from the reader #4034

Get a Vocabulary object from the reader #4034

Conversation

dirkgr commented Apr 6, 2020 • edited Loading

MaksymDel commented Apr 7, 2020 • edited Loading

dirkgr commented Apr 7, 2020

matt-gardner Apr 7, 2020 • edited Loading

Choose a reason for hiding this comment

dirkgr Apr 8, 2020

Choose a reason for hiding this comment

matt-gardner Apr 9, 2020

Choose a reason for hiding this comment

bryant1410 commented Apr 22, 2020 • edited Loading

dirkgr commented May 13, 2020

matt-gardner commented May 13, 2020 • edited Loading

dirkgr commented May 13, 2020

matt-gardner commented May 13, 2020

dirkgr commented Jul 13, 2021

Get a `Vocabulary` object from the reader #4034

Get a `Vocabulary` object from the reader #4034

dirkgr commented Apr 6, 2020 •

edited

Loading

MaksymDel commented Apr 7, 2020 •

edited

Loading

matt-gardner Apr 7, 2020 •

edited

Loading

bryant1410 commented Apr 22, 2020 •

edited

Loading

matt-gardner commented May 13, 2020 •

edited

Loading