Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Get a Vocabulary object from the reader #4034

Closed
wants to merge 2 commits into from
Closed

Get a Vocabulary object from the reader #4034

wants to merge 2 commits into from

Conversation

dirkgr
Copy link
Member

@dirkgr dirkgr commented Apr 6, 2020

This expands on the idea of getting the vocabulary from the reader, while being compatible with the current automatic discovery of the vocab.

I think @matt-gardner's comments in #4023 (review) still stand and would need to be addressed here as well.

@matt-gardner, you said in #3097 (comment) that you think vocab and reader should not be coupled. Could you expand on that? In my understanding, the reader would be the best place to couple this, since the reader presumably knows the vocabulary it draws from. This would be true for closed vocab sets such as POS tags or BIO tags, and it's also true in the case of a vocabulary drawn from a transformer.

In fact, if those are the only cases encountered, we don't need to run vocab discovery at all, saving a whole run through the data. I believe this is a common case, especially on architectures that are all-in on transformers and their tokenizers.

@maksym-del, I stole some of your code from #4023. Thank you!

@MaksymDel
Copy link
Contributor

MaksymDel commented Apr 7, 2020

Looking conceptually, it makes sense to me.

Right now there is already a unidirectional link between DatasetReader and Vocabulary: reader produces instances that are to be consumed by vocab.

In the particular case of HF's transformers, the reader produces another piece of information for the vocab, which is a token<->id mapping. Vocab consumes it the same way it consumed instances and uses it as it needs to instantiate. Curious what @matt-gardner thinks of it.


In fact, if those are the only cases encountered, we don't need to run vocab discovery at all, saving a whole run through the data. I believe this is a common case, especially on architectures that are all-in on transformers and their tokenizers.

@dirkgr how would you know which labels to use then? In my understanding, you still need to make a whole run through the data to build the labels vocabulary.

@dirkgr
Copy link
Member Author

dirkgr commented Apr 7, 2020

In fact, if those are the only cases encountered, we don't need to run vocab discovery at all, saving a whole run through the data. I believe this is a common case, especially on architectures that are all-in on transformers and their tokenizers.

@dirkgr how would you know which labels to use then? In my understanding, you still need to make a whole run through the data to build the labels vocabulary.

That's what this code does as it's written. But usually, the reader already knows the full vocabulary for tags, because it's a small, closed set. So it could just put that into the Vocabulary returned from get_vocabulary(), and then we don't have to run through the data before training. We just need to add a way of telling the training code that the vocabulary is already complete and there is nothing to be done.

@@ -229,6 +229,12 @@ def _read(self, file_path: str) -> Iterable[Instance]:
"""
raise NotImplementedError

def get_vocabulary(self) -> Optional[Vocabulary]:
"""Returns the vocabulary used in the created instances. By default, this
returns `None`, which causes the vocabulary to be automatically discovered
Copy link
Contributor

@matt-gardner matt-gardner Apr 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't actually the behavior of the current code in train.py. But we can assume that we fix that to match this description. What I would like to wrap my head around is what happens in the following cases:

  1. A glove vocabulary
  2. A BERT QA model (no tag vocab needed)
  3. A glove tagging model
  4. A BERT tagging model

What do you imagine a person specifying where, and how does vocabulary creation happen? Putting this method here is less scary than I originally thought, but I'm still somewhat skeptical that we really gain what you're hoping. Outlining what code / parameters a person has to specify in the above cases would help clarify that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1:

def get_vocabulary(self):
    # if we write a utility function
    return Vocabulary.from_glove(d=100)
    
    # if we don't write a utility function
    vocabulary = Vocabulary.empty()
    with open(cached_path("http://.../vocab.txt")) as f:
        vocabulary.add_tokens_to_namespace([t.strip() for t in f], "tokens")
    return vocabulary

    # or we could write a utility function for loading from txt files
    return Vocabulary.from_txt("http://.../vocab.txt")

This is assuming that Glove ships with a vocab.txt where each line contains a vocab item, and the order matches the order of the vectors.

2:

def get_vocabulary(self):
    return Vocabulary.from_transformers(self.transformer_model)

3:

def get_vocabulary(self):
    vocabulary = Vocabulary.from_glove(d=100)
    vocabulary.add_tokens_to_namespace(self.possible_tags, "tags")
    return vocabulary

4:

def get_vocabulary(self):
    vocabulary = Vocabulary.from_transformers(self.transformer_model)
    vocabulary.add_tokens_to_namespace(self.possible_tags, "tags")
    return vocabulary

How does the current mechanism make sure that in the case of Glove, we get the right indices for the right tokens?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For case 1, we typically don't load the whole glove vocabulary, because that's massive. We just load the embeddings for the tokens that we've seen, saving a ton of space. You can't do that without looking through your instances and computing a vocabulary.

For cases 3 and 4, if you're saying that the reader has to hard-code a list of tags / labels, this seems problematic. It means that I can't have a generic "classification csv file" dataset reader, because the reader would have to know the label set.

@bryant1410
Copy link
Contributor

bryant1410 commented Apr 22, 2020

Do you think this can also solve this #3960 (comment)? In which BucketBatchSampler needs a Vocabulary but only has a Dataset that necessarily doesn't have one?

self.vocab = data_source.vocab

@dirkgr
Copy link
Member Author

dirkgr commented May 13, 2020

Keeping some notes: @roys174 is working on a project where he has to control which index gets assigned to which label. Returning a Vocabulary from the reader should allow us to do that.

@matt-gardner
Copy link
Contributor

matt-gardner commented May 13, 2020

You can already do that by specifying the vocab manually. If you only want it for some namespaces, you can do that and extend the vocab.

There still could be benefits to having this specified in code rather than a separate vocab file, just saying that it's not necessary, as we already handle this case.

@dirkgr
Copy link
Member Author

dirkgr commented May 13, 2020

So you would write a custom vocab.txt and specify it in the config?

@matt-gardner
Copy link
Contributor

Yes, if I wanted to solve this problem with current code that's what I'd do.

@schmmd schmmd changed the base branch from master to main December 23, 2020 18:48
@dirkgr
Copy link
Member Author

dirkgr commented Jul 13, 2021

We're handling this in a different way in Tango.

@dirkgr dirkgr closed this Jul 13, 2021
@dirkgr dirkgr deleted the VocabFromReader branch July 13, 2021 18:32
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants