The Benchmark

The Benchmark Project: https://github.com/open-spaced-repetition/fsrs-benchmark

Introduction

Spaced repetition algorithms are computer programs designed to help people schedule reviews of flashcards. A good spaced repetition algorithm helps you remember things more efficiently. Instead of cramming all at once, it spreads out your study sessions over time. To make this efficient, these algorithms try to understand how your memory works. They aim to predict when you're likely to forget something, so they can schedule a review just in time.

This benchmark is a tool to test how well different algorithms do at predicting your memory. It compares several algorithms to see which ones give the most accurate predictions.

Dataset

The dataset for the FSRS benchmark comes from 20 thousand people who use Anki, a flashcard app. In total, this dataset contains information about ~1.5 billion reviews of flashcards. If you would like to obtain the full dataset, please contact Damien Elmes, the main Anki developer.

Evaluation

Data Split

In the FSRS benchmark, we use a tool called TimeSeriesSplit. This is part of the sklearn library used for machine learning. The tool helps us split the data by time: older reviews are used for training and newer reviews for testing. That way, we don't accidentally cheat by giving the algorithm future information it shouldn't have. In practice, we use past study sessions to predict future ones. This makes TimeSeriesSplit a good fit for our benchmark.

Note: TimeSeriesSplit will remove the first split from evaluation. This is because the first split is used for training, and we don't want to evaluate the algorithm on the same data it was trained on.

Metrics

We use two metrics in the FSRS benchmark to evaluate how well these algorithms work: log loss and a custom RMSE that we call RMSE (bins).

Log Loss (also known as Binary Cross Entropy): Utilized primarily for its applicability in binary classification problems, log loss serves as a measure of the discrepancies between predicted probabilities of recall and review outcomes (1 or 0). It quantifies how well the algorithm approximates the true recall probabilities, making it an important metric for model evaluation in spaced repetition systems.
Weighted Root Mean Square Error in Bins (RMSE (bins)): This is a metric engineered for the FSRS benchmark. In this approach, predictions and review outcomes are grouped into bins according to the predicted probabilities of recall. Within each bin, the squared difference between the average predicted probability of recall and the average recall rate is calculated. These values are then weighted according to the sample size in each bin, and then the final weighted root mean square error is calculated. This metric provides a nuanced understanding of model performance across different probability ranges.

Smaller is better. If you are unsure what metric to look at, look at RMSE (bins). That value can be interpreted as "the average difference between the predicted probability of recalling a card and the measured probability". For example, if RMSE (bins)=0.05, it means that that algorithm is, on average, wrong by 5% when predicting the probability of recall.

Models

FSRS v3: the first version of the FSRS algorithm that people actually used.
FSRS v4: the upgraded version of FSRS, made better with help from the community.
FSRS-4.5: the minorly improved version based on FSRS v4. The shape of the forgetting curve has been changed. This benchmark also includes FSRS-4.5 with default parameters (which have been obtained by running FSRS-4.5 on all 20 thousand collections) and FSRS-4.5 where only the first 4 parameters (values of initial stability after the first review) are optimized and the rest are set to default.
FSRS rs: the Rust port of FSRS v4, it's simplified due to the limitations of the Rust-based deep learning framework. See also: https://github.com/open-spaced-repetition/fsrs-rs
LSTM: a type of neural network that's often used for making predictions based on a sequence of data. It's a classic in the field of machine learning for time-related tasks. Our implementation includes 489 parameters.
HLR: the model proposed by Duolingo. Its full name is Half-Life Regression, for more details, you can read the paper here.
SM-2: one of the early algorithms used by SuperMemo, the first spaced repetition software. It was developed more than 30 years ago, and it's still popular today. Anki's default algorithm is based on SM-2, Mnemosyne also uses it.

For more details about the FSRS algorithm, read this wiki page: The Algorithm.

Result

Total number of users: 19,990.

Total number of reviews for evaluation: 708,151,820.

The following tables represent the weighted means and the 99% confidence intervals. The best result is highlighted in bold. The metrics are weighted by the logarithm of the number of reviews in each collection.

Algorithm	Log Loss	RMSE (bins)
FSRS-4.5	0.346±0.0030	0.063±0.0008
FSRS rs	0.350±0.0031	0.066±0.0008
FSRS v4	0.354±0.0033	0.074±0.0009
FSRS-4.5 (only pretrain)	0.361±0.0032	0.079±0.0009
FSRS-4.5 (default parameters)	0.376±0.0033	0.095±0.0011
FSRS v3	0.416±0.0043	0.104±0.0014
LSTM	0.50±0.007	0.137±0.0018
SM-2	0.68±0.011	0.210±0.0020
HLR	2.03±0.045	0.391±0.0040

Comparisons with SuperMemo 15/16/17

Please go to:

My representative paper at ACMKDD: A Stochastic Shortest Path Algorithm for Optimizing Spaced Repetition Scheduling

My fantastic research experience on spaced repetition algorithm: How did I publish a paper in ACMKDD as an undergraduate?

The largest open-source dataset on spaced repetition with time-series features: open-spaced-repetition/FSRS-Anki-20k

Provide feedback

Saved searches

Use saved searches to filter your results more quickly