Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Ideas to further improve the accuracy of the algorithm #461

Closed
Expertium opened this issue Sep 14, 2023 · 277 comments
Closed
Labels
enhancement New feature or request

Comments

@Expertium
Copy link
Collaborator

Expertium commented Sep 14, 2023

I understand that FSRS v5 won't be released for a while, but I still decided to open this issue to collect good ideas.
There is already an issue for D, so I won't be sharing ideas related to D, granted, I don't have any anyway.
Idea 1: R-matrix.
I have already talked about it here, but I want to talk more about the details. Last time Sherlock tried it, there was an important flaw with grouping, and unfortunately Sherlock still hasn't tried it with improved grouping, so here's how I think it should be done:

For t, a formula similar to this should be used: math.pow(1.4, math.floor(math.log(x, 1.4))), 2)). This formula is used in the B-W Heatmap, and it should work well here too. Although, I suggest a slight modification: 1.2*math.pow(1.4, math.floor(math.log(x, 1.4))), 2)). This makes the rounded values closer to the original values on average. The problem with the first formula is that rounded values are always less than or equal to original values, so on average they are smaller than original values.

For n, I came up with this: math.ceil(math.floor((x+0.75)**0.75)**(1/0.75)).
Figure_1
Again, I ensured that on average, rounded values are close to the original one, so there is no over- or underestimation.

For lapse, just use min(x, 8). This way all cards with >=8 lapses will be put into the same category, as they all are leeches.

Pros: adding a self-correction mechanism to FSRS will allow it to correct its own bad predictions based on real data.
Cons: hard to implement (especiall on mobile), possible non-monotonicity of the final estimate of R, which relies both on theoretical value and the R-matrix value.

Idea 2: A new power function R=f(S, t).

def power_curve(delta_t, S):
    return (1 + delta_t / (4.26316 * S)) ** -0.5
next_t = max(1, round(float(4.26316 * (states[0] - (requestRetention ** 2) * states[0]) / (requestRetention ** 2))))

Here 4.26316 is just a constant that ensures that R=90% when S=t.
I have already tested it in v4.0.0 Beta, it decreases RMSE by roughly 22% for my decks.
image

Pros: very easy to implement and benchmark.
Cons: this deviates from theory even further, as in theory forgetting should be exponential.

@Expertium Expertium added the enhancement New feature or request label Sep 14, 2023
@Expertium
Copy link
Collaborator Author

Expertium commented Sep 15, 2023

Also, regarding S and siblings.
It's quite easy to add a new parameter, such that whenever a card is reviewed, all siblings get a bonus to their S, S_new = w*S. The issue is that it would require re-writing the entire optimizer code since, after that change, in order to determine the S of a card, just the card's history won't be enough, and now the optimizer needs to go through the history of all siblings of that card as well.

@Expertium
Copy link
Collaborator Author

Expertium commented Sep 16, 2023

Idea 2: A new power function R=f(S, t).

def power_curve(delta_t, S):
    return (1 + delta_t / (4.26316 * S)) ** -0.5
next_t = max(1, round(float(4.26316 * (states[0] - (requestRetention ** 2) * states[0]) / (requestRetention ** 2))))

Here 4.26316 is just a constant that ensures that R=90% when S=t. I have already tested it in v4.0.0 Beta, it decreases RMSE by roughly 22% for my decks. image

Pros: very easy to implement and benchmark. Cons: this deviates from theory even further, as in theory forgetting should be exponential.

@L-M-Sherlock @user1823 thoughts? I would like to see Sherlock benchmark this.

@user1823
Copy link
Collaborator

Currently, I have just one suggestion: Use the accurate value of the constant.

def power_curve(delta_t, S):
    return (1 + 19 * delta_t / (81 * S)) ** -0.5

If you are wondering how to calculate this, refer to the formula I shared here: #215 (comment) (obtained using Wolfram Alpha)

@Expertium
Copy link
Collaborator Author

Expertium commented Sep 17, 2023

Thank you, I was wondering how to derive the value of that constant. Also, this comment of mine was wrong, as the shape does change. Though we still need to figure out the inverse function (that gives t when R and S are known) in its general form. If you can do that, that would be great!

@user1823
Copy link
Collaborator

Though we still need to figure out the inverse function (that gives t when R and S are known) in its general form. If you can do that, that would be great!

Well, this can be obtained by using basic algebra on the formula I shared previously.

But, I still calculated it for you:
$$t = S \times \frac{R^{1/f} - 1}{0.9^{1/f} - 1}$$

@Expertium
Copy link
Collaborator Author

Expertium commented Sep 17, 2023

Thank you. @L-M-Sherlock what's your opinion on the following questions (user1823, you too)

  1. Should this be implemented at all?
  2. If yes, then should this be implemented now or in some future version, months from now?
  3. Should we make it a new optimizable parameter that is different for different users, or use an average value of f, the same value for everyone?

EDIT: it has to be fixed, otherwise it's impossible to accurately estimate initial S before the optimization starts.

@user1823
Copy link
Collaborator

In my opinion, the answers would depend upon the results of benchmark testing.

Make f an optimizable parameter. Then, look for answers to:

  • How much did the RMSE decrease?
  • What is the optimized value of f for different collections? Do they have similar values of f or does it differ significantly among different collections?

@Expertium
Copy link
Collaborator Author

Expertium commented Sep 17, 2023

If FSRS wasn't being integrated into Anki, I would suggest implementing this later, because this would introduce more issue with compatibility between different versions and all of that. But since FSRS is being integrated into Anki, we might as well do some changes. It's kinda like using a new building material - you're not going to demolish an entire building just to use a new type of concrete to rebuild it, but if you have to build a new building anyway, you might as well use the new type of concrete.

How much did the RMSE decrease?

About 20-25% for my decks (fixed f), but of course we need a proper benchmark.

What is the optimized value of f for different collections? Do they have similar values of f or does it differ significantly among different collections?

Good point.

@Expertium
Copy link
Collaborator Author

Expertium commented Sep 17, 2023

Actually, I just realized that making f optimizable rather than fixed creates a major problem - initial S is estimated before the optimization starts, so it's impossible to estimate initial accurately S if f will be changed later.
While this rules out making f an optimizable parameter, it doesn't rule out using some fixed value of f. It would take a lot of time, but Sherlock could run the benchmark with several different values of f to find which one decreases RMSE the most.

@L-M-Sherlock
Copy link
Member

def power_curve(delta_t, S):
    return (1 + 19 * delta_t / (81 * S)) ** -0.5

I will test it in the benchmark soon.

@Expertium
Copy link
Collaborator Author

Expertium commented Sep 17, 2023

Idea 3: don't remove same-day reviews. Don't change S, but only change D.
Currently, the optimizer doesn't take same-day reviews into account because it's unclear how they would affect S, we don't have a formula for short-term memory. However, what if they only affect D? This could help with differentiating cards based on difficulty. In other words, when a same-day review occurs, only the formula for D should be used to modify the value of D. The value of S should remain unchanged. This would also mean adding a few more parameters, since the optimal parameters for D when it's a same-day review could be different from the parameters for when it's a review after >1d.
Pros: could improve estimation of difficulty.
Cons: would require major changes to all modules (optimizer, scheduler, helper).

@L-M-Sherlock
Copy link
Member

I will test it in the benchmark soon.

Algorithm Log Loss RMSE RMSE(bins)
FSRS v4 (old) 0.3830 0.3311 0.0543
FSRS v4 (new) 0.3788 0.3298 0.0476

p = 0.1 (No significant difference)

@Expertium
Copy link
Collaborator Author

That's very strange, are you sure you changed everything correctly?

  1. The power function in the section where initial S is estimated

  2. The power function in the main optimization section

  3. The inverse function (next_t)

@L-M-Sherlock
Copy link
Member

There is only one function for calculating the retention in fsrs-optimizer:

def power_forgetting_curve(t, s):
    return (1 + t / (9 * s)) ** -1

https://github.com/open-spaced-repetition/fsrs-optimizer/blob/b04bdcc3cf9e0e63c5c50f873a055f2effcfcaf1/src/fsrs_optimizer/fsrs_optimizer.py#L38-L39

So I only modified that function.

@Expertium
Copy link
Collaborator Author

Expertium commented Sep 18, 2023

Ah, ok. When I was testing it in an older version, I had to modify it in two places.
Assuming you didn't forget to modify the inverse function too, I'm surprised that the difference wasn't statistically significant. I would recommend running the benchmark with a few more values of the power, see user1823's formulas. I recommend trying 0.4 and 0.3.

@L-M-Sherlock
Copy link
Member

I think the effect of idea 2 depends on the degree of heterogeneity of user's data. It really improved the performance in some users' collections, but it also made some users' performance worse. Besides, if you are learning language, the immersion study beyond Anki would also flatten the forgetting curve.

@Expertium
Copy link
Collaborator Author

Expertium commented Sep 18, 2023

If the power could become an optimizable parameter, it would be much better and would adapt to each user. The issue is that I don't know how to combine that with our method of estimating initial S.

Actually, how about this: after the 4 initial values of S are estimated, we allow the optimizer to change them? Though choosing clampings is an issue.

@Expertium
Copy link
Collaborator Author

Expertium commented Sep 18, 2023

@L-M-Sherlock how about this: the initial S is estimated using f=0.5 (or 1), then 4 values are passed to the optimizer as parameters, with clampings (0.1, 365), and then the optimizer can change them, as well as f? This way the optimizer can use those estimates as initial guesses and update them as f changes.
image

@Expertium
Copy link
Collaborator Author

Related to dispersing siblings: https://github.com/thiswillbeyourgithub/AnnA_Anki_neuronal_Appendix
I don't know how computationally expensive this is aka how much it will slow Anki down and introduce lag, and it also doesn't seem like it wil be easy to integrate it into FSRS. Still, worth trying in the future.
For now I still would like Sherlock to try what I suggested in the comment above, regarding the power function.

@Expertium
Copy link
Collaborator Author

@L-M-Sherlock, I have a question. I have 2 ideas with a lot of potential.

  1. [Feature Request] Ideas to further improve the accuracy of the algorithm #461 (comment)
  2. [Feature Request] Ideas to further improve the accuracy of the algorithm #461 (comment)
    The first one is easier to implement, and also more likely to bear fruit.
    You could start working on them now and as a result make FSRS v5 and integrate it into Anki, so that when a stable verison of Anki is released, it will have FSRS v5; or you could wait and work on these ideas later, release FSRS v5 later, and when the new stable Anki version is released, it will have FSRS v4, for some time. Do you want to start working on this sooner or later?

@L-M-Sherlock
Copy link
Member

I plan to add fsrs-rs into fsrs-benchmark. It is in high priority, because I want to know the difference between these two implementation.

Then I have another research work to develop, where I need to collaborate with a PhD from the University of Minnesota. It would also take me a lot of time. So FSRS v5 wouldn't start in weeks.

@Vilhelm-Ian
Copy link

@L-M-Sherlock do you think there is a need for V5. V4 seems to be more than good enough.

@Expertium
Copy link
Collaborator Author

I'm not Sherlock, obviously, but I believe that if there is a way to make the algorithm more accurate, then why not?

@Vilhelm-Ian
Copy link

he seems to be working really hard :(

@Expertium
Copy link
Collaborator Author

Well, it's not the end of the world if he starts working on what I suggested above a few weeks or a few months later, just as long as he doesn't completely abandon improving accuracy.
Ideally, I would love it if FSRS got so accurate that RMSE would be ~1% for all users (or at least the vast majority of users). If most users had RMSE around 1%, I would say that at that point, the pursuit of accuracy is no longer important.

@Expertium
Copy link
Collaborator Author

Also, I've mentioned this before, but the correlation coefficient between log-loss and RMSE(bins) is only around 0.4-0.5 (according to my testing), and it's not uncommon to see a huge improvement in RMSE while log-loss barely changes. So ideally, we should find a way to optimize directly for RMSE rather than for log-loss. But I have no idea how to do that while still using the mini-batch approach.

@Expertium
Copy link
Collaborator Author

Expertium commented Sep 27, 2023

@L-M-Sherlock I apologize for the inconvenience, but can you help me?
I'm trying to make a custom loss function that mimics RMSE, but for some reason both train and test losses don't change during optimization.

class CustomLoss(nn.Module):

    def __init__(self):
        super(CustomLoss, self).__init__()

    def forward(self, predictions, labels):
        bins = 5
        zip_ab = torch.stack((predictions, labels), dim=1)
        counts = torch.zeros(bins)
        correct = torch.zeros(bins)
        prediction = torch.zeros(bins)
        for p, r in zip_ab:
            bin = min(int(p * bins), bins - 1)
            counts[bin] += 1
            correct[bin] += r
            prediction[bin] += p
        
        prediction_means = torch.tensor(prediction) / torch.tensor(counts)
        prediction_means = torch.nan_to_num(prediction_means, 0)
        correct_means = correct / counts
        correct_means = torch.nan_to_num(correct_means, 0)
        numerator = torch.sum(counts * (correct_means - prediction_means) ** 2)
        denominator = torch.sum(counts)
        RMSE = torch.sqrt(numerator/denominator)
        return RMSE.clamp(0.00001, 100)

And then replace self.loss_fn = nn.BCELoss(reduction='none') with self.loss_fn = CustomLoss(). I'm doing this in an older version, 4.0.0 Beta.
This is what happens:
image

EDIT: maybe it has something to do with the fact that this loss function is non-differentiable, but if that's the case, I would expect pytorch to raise an error and stop optimization.
EDIT 2: I think that the for loop is the problem, so now I'm trying to figure out how to get rid of it and use something else instead. It seems that due to problems with differentiation, it's impossible to use RMSE itself, but it may be possible to use something similar.
EDIT 3: I could use a different approach, the one Woz used, where instead of bins he just takes raw 1's and 0's and smoothes them using a moving average. But for that to work, values in the tensor must be sorted by predicted R, kinda like this:

([[0.0718, 0.0000],
[0.0719, 1.0000],
[0.1261, 0.0000],
[0.6392, 1.0000],
[0.6617, 1.0000],
[0.8616, 1.0000]])

And the sort() function is obviously not differentiable. So this doesn't work either.

@L-M-Sherlock
Copy link
Member

Use the mode.

I mean the new pretraining method is radical.

@Expertium
Copy link
Collaborator Author

The new method gives larger stabilities, but it's more accurate that way, so we should use it.
At the same time, we don't want the users to complain about long first intervals. Mode solves that problem since the distributions are right-skewed, and hence mode < median.

@L-M-Sherlock
Copy link
Member

Maybe it's time to adopt the idea of flatter power forgetting curve. It doesn't introduce any extra trainable parameters. And it could cooperate with the new pretraining method.

I want to delay the idea of considering short-term schedule because it requires a major changes of the scheduler in the Anki client. And the pattern of short-term memory is still unclear.

@L-M-Sherlock
Copy link
Member

Other finding: The benchmark doesn't include outlier filter, which also makes the initial stability too long. I will fix it soon.

@user1823
Copy link
Collaborator

Other finding: The benchmark doesn't include outlier filter, which also makes the initial stability too long. I will fix it soon.

You have already filtered the outliers in the dataset. Right?

If you add the filter to the benchmark too, you will be filtering them twice.

@L-M-Sherlock
Copy link
Member

You have already filtered the outliers in the dataset. Right?

Nope. I only filtered out reviews like manual reset and short-term reviews.

@Expertium
Copy link
Collaborator Author

I got back to my idea of correcting S based on the interval and grade. The idea is that if the delta_t/S ratio is high yet the grade is “Good” or “Easy”, S was underestimated. If the delta_t/S ratio is low yet the grade is “Hard” or “Again”, S was overestimated. Then we obtain a corrected estimate of S.
I tried this formula:

    def S_i(self, prev_s: Tensor, delta_t: Tensor, rating: Tensor) -> Tensor:
        ratio = (delta_t / prev_s).clamp(1/15, 15)
        # this seems to somehow be causing problems with the gradient somehow
        factor = torch.exp(self.w[18] * (rating - self.w[17]) * (ratio + self.w[19])).clamp(0.1, 10)
        # clamping here is to ensure that S doesn't change too much
        one = torch.ones_like(prev_s)
        factor = torch.where(rating == 1, torch.minimum(factor, one), torch.where(rating == 4, torch.maximum(factor, one), factor))
        # we can't allow S to increase if the grade is "Again" or decrease if the grade is "Easy", hence min and max
        corrected_s = prev_s * factor
        # S_i = S * e ^ (w_18 * (G - w_17) * (delta_t / S + w_19))
        return corrected_s.clamp(0.1, 36500)


            s_i = self.S_i(state[:,0], X[:, 0], X[:, 1])
            s_w = self.w[20] * state[:,0] + (1 - self.w[20]) * s_i
            s_w = s_w.clamp(0.1, 36500)
            r_w = power_forgetting_curve(X[:, 0], s_w)
            condition = X[:, 1] > 1
            new_s = torch.where(
                condition,
                self.stability_after_success(state[:,1], s_w, r_w, X[:, 1]),
                self.stability_after_failure(state[:,1], s_w, r_w),
            )

I tried different clampings, different initial parameters, etc. It didn't improve RMSE.
I also tried grade-derived R using my pretrain method (except that I changed the formula to allow R(Again)>0). It would then be combined with theoretical R using a weighted average, similar to what you can see in the code above. It didn't improve RMSE.
I also tried a novel loss function: https://arxiv.org/pdf/1908.06112.pdf. It didn't improve RMSE for any values of the hyperparameters.
image

@Expertium
Copy link
Collaborator Author

Btw, Sherlock, is there a way to use match/case with tensors?
Currently, I often use torch.where() inside another torch.where() inside another torch.where(), which is inconvenient. But from my limited testing, it seems like Python's match/case doesn't work with tensors.

@Expertium
Copy link
Collaborator Author

I also tried a different approach: instead of using my pretrain method, I let the optimizer find the optimal values of R(Again), R(Hard) and R(Good). It did not improve RMSE, in fact, according to the averages it's somehow worse.
image

@L-M-Sherlock
Copy link
Member

I often use torch.where() inside another torch.where() inside another torch.where(), which is inconvenient.

Could you provide an case for that?

@Expertium
Copy link
Collaborator Author

r_g = torch.where(X[:, 1] == 1, self.w[19], torch.where(X[:, 1] == 2, self.w[20], torch.where(X[:, 1] == 3, self.w[21], 1)))

@L-M-Sherlock
Copy link
Member

L-M-Sherlock commented Dec 16, 2023

It's better to assign the value via masking:

r_g = 1
r_g[X[:, 1] == 1] = self.w[19]
r_g[X[:, 1] == 2] = self.w[20]
r_g[X[:, 1] == 3] = self.w[21]

@Expertium
Copy link
Collaborator Author

Expertium commented Dec 16, 2023

Btw @L-M-Sherlock, I believe Anki 23.12 beta uses the median parameters, right? But since we have the mode estimator now, let's use the mode parameters. I suggest giving the mode parameters to Dae, so that he can add them to the next beta.

@L-M-Sherlock
Copy link
Member

I have updated the parameters in open-spaced-repetition/fsrs-rs#135

@Expertium
Copy link
Collaborator Author

r_g = 1
r_g[X[:, 1] == 1] = self.w[19]
r_g[X[:, 1] == 2] = self.w[20]
r_g[X[:, 1] == 3] = self.w[21]

Shouldn't the first line be r_g = torch.ones_like(X[:, 1])?

@Expertium
Copy link
Collaborator Author

Expertium commented Dec 16, 2023

I have an idea how to handle the situation with users using "Hard" instead of "Again": add a toggle "I use Hard as a failing grade" or something like that to Anki. Then, if the user enables that option, Hard will be assigned a value of 0 rather than 1 internally in FSRS, and it will affect the calculation of the loss, retention, RMSE, and all that. And there will be a new formula for new S, similar to the current formula for Again, but for Hard.
There are major problems with this, though:

  1. Since right now such a toggle doesn't exist, we cannot benchmark this idea
  2. This would add yet another setting that the user has to worry about, and we already have more settings than I would like (like SM-2 retention)
  3. This would make optimization more complicated

A simpler solution would be to just allow SInc for Hard to be <1, without changing anything else, but the issue is that Hard is assigned a value of 1 for the purpose of calculating the loss. We could come up with some method of determining whether Hard should be assigned a value of 0 or 1, but I don't know how. And even if I knew how, it would likely involve selecting some arbitrary threshold.
I could benchmark allowing SInc for Hard to be less than one while still assigning Hard the value of 1 for the loss, but I don't think that makes sense.

@Expertium
Copy link
Collaborator Author

I tested another idea: different SInc for Hard, Good and Easy. This can be seen as an extension of the idea of multiplying SInc by a constant that depends on the button. Instead, the formulas now use different parameters for f(S), f(D) and f(R).

    def SInc_Good(self, state: Tensor, r: Tensor) -> Tensor:
        SInc =  (
            1
            + torch.exp(self.w[8])
            * (11 - state[:, 1])
            * torch.pow(state[:, 0], -self.w[9])
            * (torch.exp((1 - r) * self.w[10]) - 1)
        )
        return SInc.clamp(1, 10)

    def SInc_Hard(self, state: Tensor, r: Tensor) -> Tensor:
        SInc =  (
            1
            + torch.exp(self.w[15])
            * (11 - state[:, 1])
            * torch.pow(state[:, 0], -self.w[16])
            * (torch.exp((1 - r) * self.w[17]) - 1)
        )
        return SInc.clamp(1, 10)

    def SInc_Easy(self, state: Tensor, r: Tensor) -> Tensor:
        SInc =  (
            1
            + torch.exp(self.w[18])
            * (11 - state[:, 1])
            * torch.pow(state[:, 0], -self.w[19])
            * (torch.exp((1 - r) * self.w[20]) - 1)
        )
        return SInc.clamp(1, 10)

It didn't improve RMSE.
image
I'm quite surprised. I'm used to my ideas not working, but I had high hopes for this one.

@Expertium
Copy link
Collaborator Author

I tested another idea involving R_g:

            surprise = -torch.log(1 - torch.abs(r - grade_derived_r))
            surprise = surprise.clamp(0, 10)
            new_d = state[:, 1] - self.w[6] * (X[:, 1] - 2.5) * surprise

Let's consider two scenarios: the user pressed "Good" when R was 90%, and the user pressed good when R was 1%, and R_g(Good) is 90%. In the latter case, it's far more surprising. You'd expect the user to press "Again". This means that our estimate of difficulty was way off, and we need to adjust it a lot. But in the first case, grade-derived R and theoretical R are very close, meaning that what we predicted is what we got, so we don't need to change difficulty.
It technically improved RMSE by a fraction of a %, but this difference was barely statistically significant, and it's not practically significant at all whatsoever.
image

@L-M-Sherlock
Copy link
Member

This issue has included too many comments. Most comments have been auto-hidden by GitHub. It's inconvenient to search them. Maybe we need a discord channel to discuss about the development of FSRS. @Expertium, are you in the Anki's discord server?

L-M-Sherlock added a commit to open-spaced-repetition/fsrs-rs that referenced this issue Dec 18, 2023
* Feat/flat power forgetting curve

* sqrt(count) as weights for pretrain

open-spaced-repetition/fsrs4anki#461 (comment)

* float eq

* use assert_approx_eq

* channel up

* make DECAY, FACTOR const and pub

* clippy fix

* pub(crate)

* fix test

---------

Co-authored-by: AsukaMinato <[email protected]>
@Expertium
Copy link
Collaborator Author

No. Can you give me a link?

@L-M-Sherlock
Copy link
Member

@Expertium
Copy link
Collaborator Author

Expertium commented Dec 18, 2023

Which channel should I use? Honestly, I think it's better to just make a new github issue.
I mean, it's an issue about very specific and technical FSRS stuff, not a general discussion of Anki. I think a new github issue is more suitable.

@Expertium
Copy link
Collaborator Author

Also, if you plan to add the new forgetting curve and change how pretrain works, I suggest renaming the version to v4.5. Also, you'll need to run the benchmark again to calculate new mode parameters.

@L-M-Sherlock
Copy link
Member

I have added the new forgetting curve in FSRS-rs and fsrs-optimizer. I'm benchmarking them in this PR: open-spaced-repetition/srs-benchmark#22

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants