Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to create a trie label? #3

Open
HuangruiChu opened this issue Apr 12, 2024 · 5 comments
Open

How to create a trie label? #3

HuangruiChu opened this issue Apr 12, 2024 · 5 comments

Comments

@HuangruiChu
Copy link

HuangruiChu commented Apr 12, 2024

Dear Knowledgator:

As I am reading the blog. I notice that you say "We can represent possible generation outputs as a trie data structure, where the node is a token." And you say "We represent labels as a tree of tokens".

Is it possible that our labels are already a trie and your model can select the correct label from the trie-like labels?

For example:

{'emotion': ['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'],
'attitude': ['positive', 'negative', 'neutral']}

Based on the content, we will say the label should be "emotion" -"love"

@Ingvarstep
Copy link
Contributor

Hello, thank you for your interest in our project.

You are talking about representing your labels as a tree-like structure at a high level, while we represent labels at a token level to help the model select the right tokens that belong to our label space during generation. To combine both approaches, one variant can be to transform your labels to the following textual format: "emotion -> sadness" or "attitude-neutral" and then initialize these labels trie with our classifier object.

@HuangruiChu
Copy link
Author

In fact, I did several tests;

labels =
['emotion_sadness',
'emotion_joy',
'emotion_love',
'emotion_anger',
'emotion_fear',
'emotion_surprise',
'attitude_positive',
'attitude_negative',
'attitude_neutral']

tokenized_labels =
[[0, 13868, 834, 7, 9, 26, 655, 1],
[0, 13868, 834, 1927, 63, 1],
[0, 13868, 834, 5850, 15, 1],
[0, 13868, 834, 9, 9369, 1],
[0, 13868, 834, 89, 2741, 1],
[0, 13868, 834, 3042, 102, 7854, 1],
[0, 7525, 834, 26093, 1],
[0, 7525, 834, 31600, 1],
[0, 7525, 834, 8992, 8792, 1]]

['emotion-sadness',
'emotion-joy',
'emotion-love',
'emotion-anger',
'emotion-fear',
'emotion-surprise',
'attitude-positive',
'attitude-negative',
'attitude-neutral']

[[0, 13868, 18, 7, 9, 26, 655, 1],
[0, 13868, 18, 1927, 63, 1],
[0, 13868, 18, 5850, 15, 1],
[0, 13868, 18, 9, 9369, 1],
[0, 13868, 18, 89, 2741, 1],
[0, 13868, 18, 3042, 102, 7854, 1],
[0, 7525, 18, 26093, 1],
[0, 7525, 18, 31600, 1],
[0, 7525, 18, 8992, 8792, 1]]

These are bad examples for the tree-like structure at a high level, because the trie created does not capture the "right word" and right structure.
The result for the emotion classification drops to
image

Here are two ways I think it good for the tree-like structure at a high level
['emotion sadness',
'emotion joy',
'emotion love',
'emotion anger',
'emotion fear',
'emotion surprise',
'attitude positive',
'attitude negative',
'attitude neutral']

[[0, 13868, 24784, 1],
[0, 13868, 3922, 1],
[0, 13868, 333, 1],
[0, 13868, 11213, 1],
[0, 13868, 2971, 1],
[0, 13868, 4158, 1],
[0, 7525, 1465, 1],
[0, 7525, 2841, 1],
[0, 7525, 7163, 1]]

['emotion -> sadness',
'emotion -> joy',
'emotion -> love',
'emotion -> anger',
'emotion -> fear',
'emotion -> surprise',
'attitude -> positive',
'attitude -> negative',
'attitude -> neutral']

[[0, 13868, 3, 13114, 24784, 1],
[0, 13868, 3, 13114, 3922, 1],
[0, 13868, 3, 13114, 333, 1],
[0, 13868, 3, 13114, 11213, 1],
[0, 13868, 3, 13114, 2971, 1],
[0, 13868, 3, 13114, 4158, 1],
[0, 7525, 3, 13114, 1465, 1],
[0, 7525, 3, 13114, 2841, 1],
[0, 7525, 3, 13114, 7163, 1]]

However, the way different tree-like structure at a high level have different performance,

'emotion sadness' etc.
precision recall f1-score support

 sadness     0.6955    0.5542    0.6169       581
     joy     0.6667    0.0029    0.0057       695
    love     0.1325    0.9057    0.2311       159
   anger     0.5159    0.5309    0.5233       275
    fear     0.6585    0.4821    0.5567       224
surprise     0.0000    0.0000    0.0000        66

accuracy                         0.3610      2000

macro avg 0.4448 0.4126 0.3223 2000
weighted avg 0.5889 0.3610 0.3339 2000

'emotion -> sadness' etc.
precision recall f1-score support

emotion -> sadness 0.4155 0.8296 0.5537 581
emotion -> joy 0.8257 0.1295 0.2239 695
emotion -> love 0.4444 0.0252 0.0476 159
emotion -> anger 0.8000 0.0436 0.0828 275
emotion -> fear 0.2868 0.7054 0.4077 224
emotion -> surprise 0.1667 0.3939 0.2342 66

       accuracy                         0.3860      2000
      macro avg     0.4898    0.3545    0.2583      2000
   weighted avg     0.5906    0.3860    0.3072      2000

The accuracy drops a lot compared with the example provided in https://blog.knowledgator.com/how-to-classify-text-into-millions-of-classes-68aee1de3802

@Ingvarstep
Copy link
Contributor

It is amazing that you have done this experiment!

It looks like the first ways to write labels fail because of tokenization, which fails to capture the right words and just splits text on less meaningful tokens. In the last two examples, it performs better, and I think playing with prompts can increase the results.

Also, such formats can be unusual for a model, and it's recommended additionally to fine-tune the model on datasets that represent labels in such a way. With some time, it's easy to artificially transform existing datasets to such label format.

@HuangruiChu
Copy link
Author

For the wiki article title prediction task, I think you just use zero shot for the t5 right?

I am still wondering why the unlimited classifier performs so good in the task of wiki article title prediction (around 0.9 acc) while only have 0.65 acc for the emition classification task.

@Ingvarstep
Copy link
Contributor

Yes, we used zero-shot with the T5 model for the wiki article title prediction.

Emotion classification tasks look more subjective, so they are more difficult for zero-shot models. You can revisit the emotion dataset, and it's sometimes challenging to unambiguously assign the text to one of the classes.

Regarding wiki article title prediction, despite that this task presented as 6 million labels classification, it's simpler for a model because often the title is an entity that is the most frequent in the text. However, when we prompted T5 to generate a title without setting generation constraints, we got pure results. So, if a model can understand the semantics of the article, limiting possible variants of generation to true wiki article titles can help a model select the right title even from large amounts of labels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants