Umibench compares the performance of various models on 2 tasks: detection of factuality (objectivity vs subjectivity) and detection of sentiment (positive, negative or neutral).
It generates a leaderboard which is visible at the bottom of this page.
I will take care of everything. Please send:
- for a labelled dataset: a link to a place where I can download it.
- for a model: a url to the API of the model. If an API key is necessary, please provide it. Otherwise I will do my best to run it at my own expense.
- Thesis_Titan (paper, api)
- TimeLMs (paper, api)
- gpt-3.5-turbo-basic-prompt (paper, api)
- OpenHermes-2-Mistral-7B-advanced-prompt (paper, api)
- gpt-3.5-turbo-advanced-prompt (paper, api)
- OpenHermes-2-Mistral-7B-basic-prompt (paper, api)
- umigon (paper, api)
Supplementary information on these models
- mpqa (paper, data source)
a set of newswire articles on international politics annotated for factuality and sentiment - kaggle-headlines (paper, data source)
a set of headlines from US newspapers annotated for factuality - xfact (paper, data source)
a database of factual statements labeled for veracity by expert fact-checkers - subjqa (paper, data source)
a set of consumer reviews on goods from the “electronics” product category - apple (paper, data source)
a set of tweets mentioning Apple and annotated for neutral, positive and negative sentiment - alexa (paper, data source)
a set of factual statements curated by the Amazon Alexa development team - carblacac (paper, data source)
a set of 200 tweets from anonymous individuals on their daily lives annotated for negative and positive sentiment - clef2023 (paper, data source)
a set of phrases extracted from news articles annotated for factuality and subjectivity
Find supplementary information on each dataset here
Umigon and TimeLMs are models for sentiment analysis. We test them on factuality by considering that a prediction for "neutral sentiment" is equivalent to an objective statement, while a predicition for a positive or negative sentiment is equivalent to predicting a subjective statement.
Weighted F1 values:
alexa | apple | carblacac | clef2023 | kaggle-headlines | mpqa | subjqa | xfact | |
---|---|---|---|---|---|---|---|---|
OpenHermes-2-Mistral-7B-advanced-prompt | 0,986 | 0,609 | 0,961 | 0,641 | 0,738 | 0,784 | 0,974 | 0,792 |
OpenHermes-2-Mistral-7B-basic-prompt | 0,969 | 0,707 | 0,916 | 0,563 | 0,544 | 0,678 | 0,883 | 0,388 |
Thesis_Titan | 0,964 | 0,612 | 0,860 | 0,821 | 0,857 | 0,877 | 0,789 | 0,960 |
TimeLMs | 0,872 | 0,614 | 0,956 | 0,610 | 0,719 | 0,706 | 0,948 | 0,671 |
umigon | 0,960 | 0,656 | 0,788 | 0,606 | 0,945 | 0,783 | 0,957 | 0,978 |
The values for each model are the sums of the weighted F1 scores for each dataset, weighted by the number of entries of each dataset.
umigon | Thesis_Titan | OpenHermes-2-Mistral-7B-advanced-prompt | TimeLMs | OpenHermes-2-Mistral-7B-basic-prompt | |
---|---|---|---|---|---|
overall score | 0,907 | 0,906 | 0,790 | 0,699 | 0,517 |
rank | 1 | 2 | 3 | 4 | 5 |
here we do not test the model "Thesis Titan" as it is a model for the task of factuality categorization, which cannot be extended or adapted to sentiment analysis
also, the models are tested against just one dataset: MPQA. The reason is that AFAIK this is the only annotated dataset in existence which makes a rigorous distinction between different sentiment valences all while annotating texts for their subjective or objective character
Weighted F1 values:
apple | carblacac | mpqa | |
---|---|---|---|
OpenHermes-2-Mistral-7B-advanced-prompt | 0,598 | 0,867 | 0,847 |
OpenHermes-2-Mistral-7B-basic-prompt | 0,690 | 0,883 | 0,708 |
TimeLMs | 0,623 | 0,862 | 0,762 |
gpt-3.5-turbo-advanced-prompt | 0,724 | 0,820 | 0,867 |
gpt-3.5-turbo-basic-prompt | 0,663 | 0,832 | 0,827 |
umigon | 0,614 | 0,700 | 0,862 |
The values for each model are the sums of the weighted F1 scores for each dataset, weighted by the number of entries of each dataset.
gpt-3.5-turbo-advanced-prompt | umigon | OpenHermes-2-Mistral-7B-advanced-prompt | gpt-3.5-turbo-basic-prompt | TimeLMs | OpenHermes-2-Mistral-7B-basic-prompt | |
---|---|---|---|---|---|---|
overall score | 0,832 | 0,798 | 0,791 | 0,790 | 0,735 | 0,712 |
rank | 1 | 2 | 3 | 4 | 5 | 6 |
Umibench is programmed in Java.
- you need Java 17 or later.
- clone this repo, open it in your fav IDE
- in the directory
private
, renameexample-properties.txt
toproperties.txt
and change the API keys in it. - navigate to the package
src/main/java/net/clementlevallois/umigon/eval/models
and open the classes where the API calls to Huggingface are made. Replace the endpoints with the endpoints of the models you want to test. Public endpoints don't have enough capacity. You need to spin your own endpoints. - run the main class of the project (
Controller.java
)
Most if not all models for sentiment analysis classify the 2 following statements as being SUBJECTIVE / NEGATIVE, which I put in question:
- "State terrorism is a political concept": should be OBJECTIVE, NOT SUBJECTIVE / NEGATIVE
- "State terrorism is not the best course of action": should be SUBJECTIVE / NEGATIVE
This distinction, which can appear as a subtle one, is actually very important to maintain when it comes to the analysis of discourse in media, politics and culture in general. Otherwise, the examination of opinions and debates on topics ladden with a positive or negative factual valence will be tainted by this lack of a distinction.
I developed Umigon, a model for sentiment analysis, which thrives to maintain this distinction. Annotated datasets which would include labels on whether the text entries are objective from the point of view of the locutor are rare. Actually, only the MPQA dataset (listed above) maintains this distinction.
To extend the range of tests, I have included datasets which are carefully annotated for factuality: objective or subjective? The expectation is that a good model in sentiment analysis (maintaining the distinction made above) has to perform well on this factuality test as well - otherwise this means that its definition of "sentiment" makes a confusion between actual sentiment, and positively or negatively ladden factuals.
Clement Levallois, [email protected]
This readme file and the leaderboard it includes has been generated on 2024-02-27T16:17:58.352201800