The Benchmark

Dataset

59 collections submitted by users, 2936244 reviews in total.

Log Loss, R-squared, Root-mean-square error (RMSE) and Mean absolute error (MAE).

Model	Log loss	R-squared	RMSE	MAE
FSRS v4.5.1	0.37	0.73	4.0%	2.3%
LSTM	0.40	-0.58	6.3%	4.3%
FSRS v3.26.2	0.41	-1.76	7.0%	4.7%
SM-2	0.55	-29.55	18.5%	12.6%
Memrise	0.69	-51.50	18.0%	14.6%

Note that negative values of R-squared are not the result of a bug. R-squared can be negative in some cases.
The best results are highlighted in bold.
There were originally 66 collections. Two of them were so big they crashed Google Collab due to a lack of RAM, five were deemed outliers and therefore excluded.

Acknowledge to @Expertium, who conduct the benchmark experiment.

My fantastic research experience on spaced repetition algorithm: How did I publish a paper in ACMKDD as an undergraduate?

The largest open-source dataset on spaced repetition with time-series features: open-spaced-repetition/FSRS-Anki-20k