All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog and this project adheres to Semantic Versioning.
### Added
- Added more label synonyms for the DKHate dataset.
- Added support for
flax
models. Thanks to @versae for contributing!
- Changed the anonymisation procedure for the tweet datasets
angry-tweets
andtwitter-sent
, now replacing user names by @USER and links by [LINK].
- Now removing all empty documents from datasets, as well as catching
KeyError
when trying to remove empty documents from dataset.
- Now explicitly removing empty tokenisations from the dataset.
### Fixed
- Now catching all
CUDA error
exceptions and treating them as running out of memory. No harm done if this is not the case, however, as the script will simply decrease the batch size until it reaches 1, and if CUDA errors persist then it will skip that benchmark.
### Fixed
- When benchmarking a token classification dataset with a model whose tokenizer
does not have a fast variant yet, this raised an error as the
word_ids
method ofBatchEncoding
objects only works when the tokenizer is fast. In that case these word IDs are now computed manually. This can currently handle WordPiece and SentencePiece prefixes (i.e.,##
and▁
), and will raise an error if the manual alignment of words and tokens fail. - Catch the CUDA error
CUDA error: CUBLAS_STATUS_ALLOC_FAILED
, which in this case is due to OOM.
- Deal with CUDA OOM errors when they occur on a replica, when multiple cores are used.
- Remove reference to
trainer
when CUDA OOM error is dealt with.
- Only try to to merge the
id2label
andlabel2id
conversions if the model is finetuned. This caused some errors when a model was not finetuned but somehow still had conversion dictionaries.
- Deal with models with tasks
feature-extraction
orsentence-similarity
as if they werefill-mask
, meaning assume that they are merely pretrained models, rather than finetuned.
- Fixed bug when evaluating a finetuned model.
### Changed
- Added progress bar description when evaluating models without finetuning them first.
- Lowered the package requirements to the earliest possible versions.
- Removed support for TensorFlow and Jax models, due to them not working properly anyway. They might be included at a later point, properly.
## [v1.4.0] - 2021-11-25
- Now also outputting aggregated metrics in the resulting
scandeval_benchmark_results.json
file. Thisjson
file now has keysraw_metrics
andtotal
, withraw_metrics
containing the previous (raw) scores, and the value of the newtotal
key has aggregated scores (means and standard errors).
- All training/evaluation progress bars are now removed when they are finished, and the training progress bar has no total anymore, as it was misleading.
### Fixed
- Removed
transformers
logging during evaluation as well.
### Changed
- Now only updating the list of benchmarks in the
Benchmark
during initialisation, and also logs it. This should make subsequent calls to thebenchmark
method faster.
- Removed
transformers
logging properly.
- Set the number of warmup steps to be the intended one training set pass, where previously it was effectively 8x that amount, due to gradient accumulation.
- Added the NER label synonyms
OBJORG=ORG
,LOCPRS=LOC
,LOCORG=LOC
andORGPRS=ORG
. - Explicitly added
numpy
to theinstall_requires
list. This is normally not a problem, as it's a requirement for other required packages, but this depends on the order in which the requirements are installed. This avoids such errors caused by misordering the requirements.
- Indexing error during synonym setup of finetuned models.
- When a finetuned model has labels which are synonyms of each other, they are
now properly treated as synonyms, where previously this caused the model to
have misaligned
id2label
andlabel2id
conversion dictionaries.
### Fixed
- Added the NER label synonyms
GPE_LOC=LOC
,GPE_ORG=ORG
,LOC/ORG=LOC
,ORG/PRS=ORG
,OBJ/ORG=ORG
, as Norwegian and Swedish models tend to use these.
### Fixed
- Fixed a bug in label synonyms when benchmarking a finetuned spaCy for NER.
### Added
- Added label synonyms for NER benchmarking, which will enforce a more fair
comparison of finetuned NER models, if the models have been trained on
datasets with different labelling (e.g.,
Person
instead ofPER
).
- Properly removed the Icelandic WikiANN-IS data files. It was removed from the package, but the underlying files were still lying in the repository.
### Added
- Added the Icelandic NER dataset MIM-GOLD-NER. This can now be loaded as
mim-gold-ner
in theBenchmark
class and through the CLI.
- Removed the Icelandic WikiANN-IS dataset, as this has now been replaced by the MIM-GOLD-NER dataset.
- Added truncation and padding when tokenising token classification datasets.
- Missing dependency parsing tags.
### Fixed
- Reduce validation batch size if CUDA runs out of memory, rather than only reducing training batch size.
- Added Icelandic and Faroese translations of the Norwegian
NoReC
sentiment analysis dataset. These can be loaded asnorec-is
andnorec-fo
, respectively.
- When loading datasets with
load_dataset
, the result is now four dataframes, rather than dictionaries. As the data can be accessed in the same way as with dictionaries, this maintains backwards compatibility. - If a finetuned NER model has been trained on NER tags not present amongst the
ones in the dataset, then these are either converted to
MISC
tags (if these are present in the dataset) and otherwiseO
tags. This will make the benchmarking of finetuned diverse NER models more fair.
- There was an error when a SpaCy model was benchmarked on a dataset that it
was not trained on. It now raises an appropriate
InvalidBenchmark
exception, and will be skipped in the CLI and with theBenchmark
class.
### Fixed
- Replaced abbreviations with spaces, such as "o s v" in the SDT corpus, with their proper version "o.s.v.".
### Fixed
- The URLs for the
wikiann-is
andwikiann-fo
were wrong and have been corrected.
- Added the Icelandic and Faroese WikiANN datasets, for NER evaluation. They
can be loaded as
wikiann-is
andwikiann-fo
in the CLI and via theBenchmark
class. - Added the Icelandic and Faroese parts of the Universal Dependencies datasets,
containing POS and dependency parsing tags. They can be loaded as
idt-pos
,idt-dep
,fdt-pos
andfdt-dep
, respectively.
- Added the Dataset for Linguistic Acceptability Judgments (DaLaJ) dataset,
which is here used as a binary classification dataset, in which sentences
have to be classified as correct Swedish or not. It can be loaded as
dalaj
in the CLI and via theBenchmark
class. - Added the ABSAbank-Imm dataset, which is an aspect-based sentiment analysis
dataset in Swedish, namely, the sentiment towards immigration. The original
dataset featured a floating point score between 0 and 5, which has been
reduced to a classifical three-way classification (
negative
,neutral
andpositive
). It can be loaded asabsabank-imm
in the CLI and via theBenchmark
class. - Added the POS and dependency parsing parts of the Swedish Dependency Treebank
(SDT). They can be loaded as
sdt-pos
andsdt-dep
in the CLI and via theBenchmark
class. - Added the Stockholm-Umeå corpus 3.0 (SUC 3.0), a Swedish NER dataset. It can
be loaded as
suc3
in the CLI and via theBenchmark
class. - Added abstract
NerBenchmark
,PosBenchmark
andDepBenchmark
classes, to ensure uniformity.
- Uniformised all the NER datasets. They now all only have the NER tags
PER
,LOC
,ORG
andMISC
. - Uniformised all the dependency parsing datasets. They now all only have the
main dependency parsing tags, without the subtags (so
acl:cleft
has been changed toacl
, for instance). - Changed the columns in all text classification datasets to
text
andlabel
, to make it more uniform.
- Upped the number index tokens for dependency parsing from 100 to 512. This will need to be done better in the future, but is a fix for now.
- Added the random models
random-roberta-sequence-clf
andrandom-roberta-token-clf
to the default list of model IDs when benchmarking all models.
- The list of dependency tags in the
ndt-nb-dep
andndt-nn-dep
were wrong. They have now been changed to all the tags occurring in the training sets. - The
europarl_sent
data folder has now been renamed toeuroparl
, so that it can be loaded correctly withload_dataset
.
- Added the Bokmål and Nynorsk POS and DEP parts of the Norwegian Dependency
Treebank dataset (NDT). They can be loaded as
ndt-nb-pos
,ndt-nn-pos
,ndt-nb-dep
andndt-nn-dep
, respectively, from the CLI and theBenchmark
class.
- Removed the
EuroparlSubj
andTwitterSubj
datasets, as they were too easy and did not really differentiate models. - Removed the abstract
SentimentClassificationBenchmark
andBinaryClassificationBenchmark
, to simplify the classes. There is now only oneTextClassificationBenchmark
, which always evaluates with macro-F1.
- Changed the name of
europarl-sent
toeuroparl
, aseuroparl-subj
now does not exist anymore. - Changed the
nordial
dataset to the original 4-way classification dataset.
- Remove duplicate model IDs when calling the CLI or
Benchmark
class without any specified model IDs.
### Added
- Added the Bokmål and Nynorsk parts of the NorNE dataset, for named entity
recognition. They can be loaded with the
norne-nb
andnorne-nn
names. - There is now a
load_dataset
function, which can load any dataset, using the dataset's name (same name as in the CLI). For instance,load_dataset('angry-tweets')
loads theAngryTweets
dataset. This can be imported directly from the package:from scandeval import load_dataset
. The individual dataset loading functions can still be imported as before; e.g.,from scandeval.datasets import load_angry_tweets
.
- Refactored folder structure with benchmarks and datasets.
- Separated
dane
anddane-no-misc
into two distinct benchmark classes. Thedane-no-misc
can now also be loaded with theload_dataset
function.
### Added
- Added the Norwegian Review Corpus (NoReC), a sentiment classification dataset in Norwegian.
- Added the Bokmål/Nynorsk part of the Norwegian Dialect dataset (NorDial), a binary classification dataset in Norwegian.
- Changed the early stopping patience to
2 + 1000 // len(train)
from2 + 250 // len(train)
, to allow more patience (and thus, more stability), for smaller datasets.
- Merged the
lcc1
andlcc2
datasets into onelcc
dataset, which is reasonable as they have been annotated by the same person. Thelcc2
dataset was too small to give reasonable benchmarking results. - Renamed the
europarl2
dataset toeuroparl_sent
- Removed the
europarl1
dataset, as it was too small to give reliable benchmarking results. This dataset could not simply be added to theeuroparl2
dataset, as with the newlcc
dataset, as the annotaters are not the same.
- If errors occur during benchmarking, then garbage collect before skipping to the next benchmark, to avoid memory issues.
- Issue with
model_max_length
in tokenizer meant that models with an ill-set value ofmax_position_embeddings
could not be benchmarked. Now, ifmodel_max_length
is not set then the minimal value of the sizes inmax_model_input_sizes
will be used (which is usually 512).
- Disabling CUDNN benchmark when using the
pytorch
framework, to enforce better reproducibility.
- Rather than bootstrapping the training dataset and using the results to compute an estimator of the standard deviation, the same training dataset is trained on all ten times, and the mean of these along with a confidence interval is outputted.
- Updated the model metadata fetching to the new HTML structure of the HuggingFace Hub.
- A random seed is now set for all libraries, via the
transformers.set_seed
function. - Always update the list of all the benchmarks when calling the
Benchmark.benchmark
method, to allow for possibility of setting new benchmark parameters after initialisation.
### Added
- The subjective/objective part of the
TwitterSent
andEuroparl2
datasets have now been added as binary classification tasks, calledTwitterSubj
andEuroparlSubj
, respectively. These can now be benchmarked with theBenchmark
class and the CLI using thetwitter-subj
andeuroparl-subj
names, respectively. - Added an abstract
BinaryClassificationBenchmark
, to streamline the binary classification benchmark datasets, which now includes theDKHate
,TwitterSubj
andEuroparlSubj
datasets.
- Now catches
IndexError
during training.
### Fixed
- Properly filters by languages now via the
language
argument in the CLI and theBenchmark
class. As HuggingFace Hub does not have a keyword for language, a search for language also means that any other non-language tag with that name also shows up in the results. These are now manually removed. This means it takes a few more seconds to compile the model list, but it will at least be accurate. - In case
model_max_length
has not been set in a model configuration, it defaults to the value ofmax_position_embeddings
. This fixes a problem with some models not being able to be trained on datasets whose texts were too long. - Now handles the case where a non-classification model, such as a seq-to-seq model, are being benchmarked on a classification dataset.
### Added
- All the benchmark classes and
Benchmark
now has abenchmark
method, which does the same as the__call__
method. This is primarily so that it shows up in the Sphinx documentation. - Added the default
LABEL_0
andLABEL_1
label synonyms forNOT
andOFF
in theDKHate
benchmark. - Added the possibility of benchmarking randomly initialised RoBERTa models,
using the model IDs
random-roberta-sequence-clf
andrandom-roberta-token-clf
.
### Added
- Added the separate
nb
(Norwegian Bokmål) andnn
(Norwegian Nynorsk) language tags, on top of the generalno
(Norwegian). - Added more multilingual models.
- SpaCy models was evaluated wrongly on the
dane-no-misc
dataset, as theirMISC
predictions was not replaced withO
tags. - When evaluating models finetuned for token classification on a text
classification task, a
ValueError
was raised, rather than anInvalidBenchmark
exception. - If none of the model's labels are among the dataset's labels, and are not
even synonyms of them, then raise an
InvalidBenchmark
. This prevents things like evaluating a finetuned sentiment model on a NER task. - When
evaluate_train
wasTrue
, this previously evaluated the test set instead.
- Changed
Benchmark
API. Now the constructor and the__call__
method have the same arguments, except themodel_id
anddataset
in__call__
, where the constructor sets the default values and the__call__
method can change these to specific cases. - Changed the benchmarking order. Now benchmarks all datasets for a model, before moving on to the next model
- Renamed the
multilabel
argument to the more descriptivetwo_labels
. - Updated docstrings to be more accurate.
- Early stopping patience is now set to
2 + 250 // len(train)
, so that smaller datasets can enjoy a bit more patience, but if the dataset contains at least 250 samples then it will remain at the current 2 patience.
- Removed
learning_rate
,batch_size
,warmup_steps
andnum_finetunings
arguments from the benchmarks. These are now fixed to 2e-5, 32, 25% of the training dataset and 10, respectively. Note that the batch size will still automatically decrease if the GPU runs out of memory.
- Models are now being trained for much longer, but with an early stopping callback with patience 2. This will enable a more uniform comparison between models that require a different number of finetuning epochs.
- There was a bug when evaluating a finetuned PyTorch model on a sequence classification task, if the model had only been trained on a proper subset of the labels present in the dataset.
- All individual benchmarks have been removed from
__init__.py
. They can still be imported using their individual modules, for instancefrom scandeval.dane import DaneBenchmark
, but the idea is to use the generalBenchmark
class instead.
- Always ensure that a model can deal with the labels in the dataset when finetuning. If the model has not been trained on the label, then this will result in the model always getting that label wrong. For instance, this is the case for finetuned NER models not having been trained on MISC tags, if they are being evaluated on the DaNE dataset.
### Fixed
- Fixed bug when evaluating SpaCy models.
- Only removing objects at memory cleanup if they exist at all.
- When finetuning models, 10% of the training data is used to evaluate the models, which is used to choose the best performing model across all the epochs trained. This will allow for a more fair comparison, as some models degrade over time, while other models need a longer time to train.
- Uniformised the
_log_metrics
method for all benchmarks, now only defined inBaseBenchmark
.
- Garbage collects when downsizing batch size, to not keep all the previous models in memory.
- Typos in logging.
- Fixed bug when
evaluate_train
was set to False.
- The bootstrapping of the datasets is now done properly. Previously the bootstrapped datasets were not converted to HuggingFace Dataset objects.
### Added
- It is possible to only evaluate on the test sets, to save some time. This can
be done in the
Benchmark
class using theevaluate_train
argument, and in the CLI with the--evaluate_train
flag. - Added
progress_bar
argument toBenchmark
to control whether progress bars should be shown, and added theno_progress_bar
flag to the CLI for the same reason.
- Updated
epochs
andwarmup_steps
of all the datasets to something more reasonable, enabling better comparisons of the finetuned models. - Changed calculation of confidence intervals, which is now based on bootstrapping rather than the analytic approach. It will now evaluate ten times on the test set and compute a bootstrap estimate of the standard error, which is uses to compute an interval around the score on the entire test set.
- RuntimeErrors occuring during training will now raise an
InvalidBenchmark
exception, which means that the CLI and theBenchmark
class will skip it. This is for instance caused whenmax_length
has not been specified in the model config, meaning that the tokeniser does not know how much to truncate.
- Now catching the error where tokenisation is not possible, due to the model having been trained on a different task than what is present in the dataset. E.g., if a generator model is trained on a classification task.
### Fixed
- Now catching the error when the model's config does not align with the model
class. When using the CLI or
Benchmark
, these will be skipped.
- Added confidence intervals for finetuned models, where there is a 95% likelihood that the true score would belong to the interval, given infinite data from the same distribution. In the case of "raw" pretrained models, this radius is added onto the existing interval, so that both the uncertainty in model initialisation as well as sample size of the validation dataset affects the size of the interval.
- Added garbage collection after each benchmark, which will (hopefully) prevent memory leaking when benchmarking several models.
- New logo, including the Faroe Islands!
- Allow the possibility to include all languages and/or tasks in the CLI and
the
Benchmark
class. - Added Icelandic and Faroese to default list of languages in CLI and the
Benchmark
class. - The default value for
task
is now all tasks, which also includes models that haven't been assigned any task on the HuggingFace Hub; - If a model cannot be trained without running out of CUDA memory, even with a
batch size of 1, then the model will be skipped in
Benchmark
and the CLI.
- New model is initialised if CUDA runs out of memory, to ensure that we are now continuing to train the previous model.
- Dependency parsing now implemented properly as two-label classification, with associated UAS and LAS metric computations. Works for pretrained SpaCy models as well as finetuning general language models.
- Reduces batch size if CUDA runs out of memory during evaluation.
- Loading of text classification datasets now working properly.
- The
W036
warning message from SpaCy is no longer shown.
- Raise
InvalidBenchmark
if model cannot be loaded from the HuggingFace Hub.
- Added the part-of-speech tagging task from the Danish Dependency Treebank.
Can be loaded with
load_ddt_pos
and used inBenchmark
asddt-pos
. - Added the dependency parsing task from the Danish Dependency Treebank.
Can be loaded with
load_ddt_ddt
and used inBenchmark
asddt-dep
. - Documentation section and link to
README
- The
Benchmark
class and the CLI now accepts abatch_size
argument
Benchmark
argumentslanguages
,tasks
,model_ids
anddatasets
have been renamed tolanguage
,task
,model_id
anddataset
, to keep it consistent with the CLI.- When loading datasets, these will now be four dictionaries instead of lists, to allow for distinguishing features and labels.
batch_size
arguments can now only be among 1, 2, 4, 8, 16 and 32, and the corresponding gradient accumulation will be set to 32, 16, 8, 4, 2 and 1, respectively. This is to ensure that all finetuning is done using the same effective batch size, to ensure fair comparisons.- Batch sizes are automatically halved if the GPU runs out of memory, with gradient accumulation correspondingly doubles.
- Evaluation of
SpaCy
models on token classification tasks are more accurate.
README
typos fixed, and image renders correctly
- First beta release
- Features Danish sentiment, hate speech detection and named entity recognition datasets for benchmarking