-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to create a trie label? #3
Comments
Hello, thank you for your interest in our project. You are talking about representing your labels as a tree-like structure at a high level, while we represent labels at a token level to help the model select the right tokens that belong to our label space during generation. To combine both approaches, one variant can be to transform your labels to the following textual format: "emotion -> sadness" or "attitude-neutral" and then initialize these labels trie with our classifier object. |
In fact, I did several tests; labels = tokenized_labels = ['emotion-sadness', [[0, 13868, 18, 7, 9, 26, 655, 1], These are bad examples for the tree-like structure at a high level, because the trie created does not capture the "right word" and right structure. Here are two ways I think it good for the tree-like structure at a high level [[0, 13868, 24784, 1], ['emotion -> sadness', [[0, 13868, 3, 13114, 24784, 1], However, the way different tree-like structure at a high level have different performance, 'emotion sadness' etc.
macro avg 0.4448 0.4126 0.3223 2000 'emotion -> sadness' etc. emotion -> sadness 0.4155 0.8296 0.5537 581
The accuracy drops a lot compared with the example provided in https://blog.knowledgator.com/how-to-classify-text-into-millions-of-classes-68aee1de3802 |
It is amazing that you have done this experiment! It looks like the first ways to write labels fail because of tokenization, which fails to capture the right words and just splits text on less meaningful tokens. In the last two examples, it performs better, and I think playing with prompts can increase the results. Also, such formats can be unusual for a model, and it's recommended additionally to fine-tune the model on datasets that represent labels in such a way. With some time, it's easy to artificially transform existing datasets to such label format. |
For the wiki article title prediction task, I think you just use zero shot for the t5 right? I am still wondering why the unlimited classifier performs so good in the task of wiki article title prediction (around 0.9 acc) while only have 0.65 acc for the emition classification task. |
Yes, we used zero-shot with the T5 model for the wiki article title prediction. Emotion classification tasks look more subjective, so they are more difficult for zero-shot models. You can revisit the emotion dataset, and it's sometimes challenging to unambiguously assign the text to one of the classes. Regarding wiki article title prediction, despite that this task presented as 6 million labels classification, it's simpler for a model because often the title is an entity that is the most frequent in the text. However, when we prompted T5 to generate a title without setting generation constraints, we got pure results. So, if a model can understand the semantics of the article, limiting possible variants of generation to true wiki article titles can help a model select the right title even from large amounts of labels. |
Dear Knowledgator:
As I am reading the blog. I notice that you say "We can represent possible generation outputs as a trie data structure, where the node is a token." And you say "We represent labels as a tree of tokens".
Is it possible that our labels are already a trie and your model can select the correct label from the trie-like labels?
For example:
{'emotion': ['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'],
'attitude': ['positive', 'negative', 'neutral']}
Based on the content, we will say the label should be "emotion" -"love"
The text was updated successfully, but these errors were encountered: