Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ratings should consider komi on small boards #45

Open
dexonsmith opened this issue Dec 12, 2023 · 38 comments
Open

Ratings should consider komi on small boards #45

dexonsmith opened this issue Dec 12, 2023 · 38 comments

Comments

@dexonsmith
Copy link
Contributor

Ratings currently don't consider komi on small boards, but should, since usually it's the komi that changes (not the number of handicap stones) as handicap increases.

dexonsmith added a commit to dexonsmith/goratings that referenced this issue Dec 12, 2023
Add komi integration to the ratings math. This path can be turned on using:

- `--compute-handicap-via-komi-small` for 9x9 and 13x13 boards
- `--compute-handicap-via-komi-19x19` for 19x19 boards

For now, none of the scripts pass the necessary extra arguments (they assert
out if you pass those arguments).

Relates to online-go#45
@dexonsmith
Copy link
Contributor Author

dexonsmith commented Dec 12, 2023

I've started adding this to RatingsMath in a1486cb, but the individual scripts still need an update to send the board size, ruleset, and komi into calculate_handicap, so I can't test it out yet.

This also relates to #7, since the 19x19 version is essentially equivalent (at least, to the intent of #7, not sure about what's currently in the repo).

@dexonsmith
Copy link
Contributor Author

(The commit there uses the math from this WIP proposal: https://github.com/dexonsmith/online-go.com/blob/27bcbdc7699cf4fedc072336a4c36ab40897c876/doc/proposal-redesign-small-board-handicap-komi.md ... this is homework for the ratings part of the proposal.)

@dexonsmith
Copy link
Contributor Author

@anoek, the "ruleset" seems to be missing from the historical ratings database.

  • Is this possible to fix / get access to?
  • Does the komi in this database incorporate the handicap komi that AGA and Chinese rules add?

(See also the PR #46.)

Also, there are some rated games in there with massive handicaps. E.g., this 8-stone 9x9 game raised an adjusted_handicap < 50 assertion:

Processing OGS data
     243,876 /   15,123,682 games processed.   92.3s remaining
             size = 9
             komi = -2.5
         handicap = 8
       komi_bonus = 0
adjusted_handicap = 52.25

Indeed, an 8-stone handicap on a 9x9 board is a big advantage. Seems unnecessary to rate this game at all...

@BHydden
Copy link

BHydden commented Dec 12, 2023

anoek doesn't change the ratings calculations lightly. He also usually handles that himself and does a lot of simming before pushing anything. Just a heads up before you put too much work into this ❤️ love the work you've been doing recently, wouldn't want you to get discouraged 😛

@dexonsmith
Copy link
Contributor Author

anoek doesn't change the ratings calculations lightly. He also usually handles that himself and does a lot of simming before pushing anything. Just a heads up before you put too much work into this ❤️ love the work you've been doing recently, wouldn't want you to get discouraged 😛

Thanks for the heads up :). Already chatted with him about this, and we need to do something for small boards. Ratings adjustments are currently (and perhaps have always?) treating the small board handicaps as "1-stone == 1-rank" (as-if 19x19), which is completely haywire.

@BHydden
Copy link

BHydden commented Dec 12, 2023

Cool sounds good 👍 good luck ❤️ I agree 1 stone per rank on small boards is bonkers haha

@dexonsmith
Copy link
Contributor Author

Some data from running ./analysis/analyze_glicko2_one_game_at_a_time.py (hardcoding "japanese"):
compute-handicap-via-komi-baseline.txt
compute-handicap-via-komi-small.txt
compute-handicap-via-komi-19x19.txt
compute-handicap-via-komi-small+19x19.txt

Haven't looked closely, since I'm not sure I'll know how to interpret it.

Always dies for me with this traceback:

Traceback (most recent call last):
  File "/Users/dexonsmith/Repos/online-go/goratings/analysis/analyze_glicko2_one_game_at_a_time.py", line 116, in <module>
    tally.print()
  File "/Users/dexonsmith/Repos/online-go/goratings/analysis/util/TallyGameAnalytics.py", line 140, in print
    self.print_self_reported_stats()
  File "/Users/dexonsmith/Repos/online-go/goratings/analysis/util/TallyGameAnalytics.py", line 285, in print_self_reported_stats
    stats = self.get_self_reported_stats()
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dexonsmith/Repos/online-go/goratings/analysis/util/TallyGameAnalytics.py", line 339, in get_self_reported_stats
    raise Exception('Failed to find self_repoted_account_links json file')
Exception: Failed to find self_repoted_account_links json file

@dexonsmith
Copy link
Contributor Author

dexonsmith commented Dec 12, 2023

Here are the compact stats:

| Algorithm name | Stronger wins | h0 | h1 | h2 |
|:---------------|--------------:|---:|---:|--------------:|
| glicko2_one_ga |         68.3% | 68.5% | 67.4% | 67.0% | baseline
| glicko2_one_ga |         68.4% | 68.6% | 67.5% | 67.4% | small
| glicko2_one_ga |         69.0% | 68.7% | 67.9% | 71.9% | 19x19
| glicko2_one_ga |         69.0% | 68.7% | 67.9% | 70.9% | both

Interesting to see a modest improvement in the compact data for 19x19 but not much for small boards... could be the new math isn't quite right, or maybe for some (or all?) of the data the "handicap" value is storing a "handicap rank difference" (not stones) after all.

@dexonsmith
Copy link
Contributor Author

Other scripts available as of 330bba9 (I didn't test them but think the updates are correct).

Also thought of another two possibilities:

  • Might be very few games with small board handicaps, so the data doesn't affect the total much
  • The dataset might have a mix of games from OGS and other sources. The other sources may be storing "handicap rank difference" in the handicap field, even if OGS is storing "handicap stones".

@dexonsmith
Copy link
Contributor Author

After 080b385, "both" gets:

| Algorithm name | Stronger wins | h0 | h1 | h2 |
|:---------------|--------------:|---:|---:|--------------:|
| glicko2_one_ga |         69.0% | 68.7% | 67.8% | 70.8% |

Pretty similar.

One thing I'd like to do is "skip" some games as unrate-able (i.e., does not affect the rating), say if the rank adjustment is bigger than 20. Or 9. Or something. Not sure there's an easy way to do that right now though.

@dexonsmith
Copy link
Contributor Author

One thing I'd like to do is "skip" some games as unrate-able (i.e., does not affect the rating), say if the rank adjustment is bigger than 20. Or 9. Or something. Not sure there's an easy way to do that right now though.

I guess the way to do this is to skip them in the caller.

  • For 19x19, skip if handicap > 9 (shouldn't be any in the DB...)
  • For 13x13, skip if handicap > 5 (effective rank diff of 10-15, assuming scaling factor of 2.5-3x)
  • For 9x9, skip if handicap > 3 (effective rank diff of 12-18, assuming scaling factor of 4-6x... maybe even > 2 would be better)

I'd want to skip these games both for the purposes of:

  • Adjusting ratings
  • Computing stats on how effective the ratings are

@anoek, I haven't looked yet at how to do this yet (hopefully I'll pry myself away and won't get to it a few days), but curious if you have thoughts on (a) how this should be structured in the goratings code and (b) whether the number of skipped games is worth printing stats on.

@dexonsmith
Copy link
Contributor Author

Might also be interesting to see what happens to ratings if ALL small board handicap games are skipped... not the right end result, but a useful baseline.

@anoek
Copy link
Member

anoek commented Dec 12, 2023

Yeah given how bad the handicaps are for 9x9 and 13x13, I too am tempted to just throw those away too. Might be worth the experiment to consider them like you're proposing too to see if it can be useful, but it might just be a detriment.

@dexonsmith
Copy link
Contributor Author

Looks like EGF and AGA datasets just have 19x19 games. Here are the compact results from running on them:

Algorithm name Stronger wins h0 h1 h2 dataset / options
glicko2_one_ga 68.5% 69.1% 68.2% 66.5% aga baseline
glicko2_one_ga 70.0% 69.7% 68.7% 71.1% aga this branch
glicko2_one_ga 69.5% 68.7% 68.6% 72.0% egf baseline
glicko2_one_ga 67.9% 67.8% 68.4% 69.6% egf this branch

Again, assuming Japanese rules for all of them.

What rules does EGF use?

@anoek
Copy link
Member

anoek commented Dec 12, 2023

AGA uses AGA rules, EGF I think uses Japanese? There might be some flexibility for both organizations, I'm not entirely sure.

@dexonsmith
Copy link
Contributor Author

Do you trust the komi values in the AGA and EGF datasets?

Note that the following assertion passes for all games in both datasets:

assert game.komi == 0

(Makes sense for handicap games, but I imagine they use komi for even games?)

@anoek
Copy link
Member

anoek commented Dec 12, 2023

Yep I would trust them

@dexonsmith
Copy link
Contributor Author

Interesting. There are a couple of reasons for games to be ignored in the tallies

        if (result.black_deviation > PROVISIONAL_DEVIATION_CUTOFF or
            result.white_deviation > PROVISIONAL_DEVIATION_CUTOFF):
            self.games_ignored += 1
            return

        if abs(result.black_rank + result.game.handicap - result.white_rank) > 1:
            self.games_ignored += 1
            return

If I comment that out, I get these results for --aga (hardcoding AGA rules and komi=7.5 for even)

Algorithm name Stronger wins h0 h1 h2 options
glicko2_one_ga 68.6% 68.2% 68.2% 69.8% baseline
glicko2_one_ga 79.6% 72.3% 74.5% 83.0% this branch

git-blame tells me it has been that way since the initial commit in 68d0fff. Do you remember why we're ignoring those games?

@dexonsmith
Copy link
Contributor Author

Yep I would trust them

But don't AGA rules say they use komi for even games? Do they skip that in tournaments?

@anoek
Copy link
Member

anoek commented Dec 12, 2023

Yep, I might be confused here, are you seeing zeros for komi? That'd be weird. I thought your prior assertion ensured there were no zeros, if there's a value in there I have no reason not to trust them the komi values they provided, as I understand it these are database dumps directly from each organizations datrabases. If there are zeros, then that's not accurate I'd wager, but if they provide a komi I reckon that's as accurate as it can be given that there are probably humans filling out a lot of those values based on whatever tournaments the game came from.

@dexonsmith
Copy link
Contributor Author

dexonsmith commented Dec 12, 2023

Okay, this is:

  • --ogs (hardcoding Japanese, trusting komi)
  • ignoring PROVISIONAL_DEVIATION_CUTOFF (as in HEAD)
  • NOT ignoring effective handicap bigger than 1

EDIT: actually, I lost track of which data this is. Re-running.

@dexonsmith
Copy link
Contributor Author

Yep, I might be confused here, are you seeing zeros for komi? That'd be weird. I thought your prior assertion ensured there were no zeros, if there's a value in there I have no reason not to trust them the komi values they provided, as I understand it these are database dumps directly from each organizations datrabases. If there are zeros, then that's not accurate I'd wager, but if they provide a komi I reckon that's as accurate as it can be given that there are probably humans filling out a lot of those values based on whatever tournaments the game came from.

Yeah, both --aga and --egf have all zeroes for komi.

@dexonsmith
Copy link
Contributor Author

Okay, this is:

  • --ogs (hardcoding Japanese, trusting komi)
  • ignoring PROVISIONAL_DEVIATION_CUTOFF (as in HEAD)
  • NOT ignoring effective handicap bigger than 1

EDIT: actually, I lost track of which data this is. Re-running.

Yeah, I had the wrong code commented out when I was just doing one of them, and got it backwards.

  • Ignoring PROVISIONAL_DEVIATION_CUTOFF has very little effect.
  • Ignoring effective handicap bigger than 1 has a huge effect.

Still interesting in why these are being ignored.

(For looking at improvements to the small board analysis, I definitely need to look at effective handicap bigger than 1)

@dexonsmith
Copy link
Contributor Author

Update: you can ignore all my "baseline" numbers above :/. In my first commit on the branch, I somehow (???) corrupted the get_handicap_adjustment that the baseline measurements use. Reverted that mistake in 7435f26. Haven't rerun numbers yet.

@dexonsmith
Copy link
Contributor Author

Haven't rerun numbers yet.

As of 608e551, skipping rating games with effective handicaps bigger than 9.

New numbers:

Algorithm name Stronger wins h0 h1 h2 dataset options
glicko2_one_ga 68.8% 68.6% 68.7% 69.3% ogs baseline
glicko2_one_ga 69.0% 68.7% 67.8% 71.2% ogs this branch

@anoek
Copy link
Member

anoek commented Dec 13, 2023

        if abs(result.black_rank + result.game.handicap - result.white_rank) > 1:
            self.games_ignored += 1
            return

This code, only used for the analytics part, says

If this is a game between two people with an appropriate handicap (as judged by the outcome of our choice of rating parameters this run) - then include it in the stats. Otherwise discard the result for the purpose of our stats.

In other words, if in our game history we had a 2 stone handicap game between a 5d and a 1kyu, we don't want to tally this into our stats because it'll skew the results. The purpose of these stats is to judge how good our curve fit between rating and ranking is, so to do that we're optimizing on minimizing then difference in win rates between our handicap games and our non handicap games when the appropriate handicap is used in a game.

@dexonsmith
Copy link
Contributor Author

        if abs(result.black_rank + result.game.handicap - result.white_rank) > 1:
            self.games_ignored += 1
            return

This code, only used for the analytics part, says

If this is a game between two people with an appropriate handicap (as judged by the outcome of our choice of rating parameters this run) - then include it in the stats. Otherwise discard the result for the purpose of our stats.

In other words, if in our game history we had a 2 stone handicap game between a 5d and a 1kyu, we don't want to tally this into our stats because it'll skew the results. The purpose of these stats is to judge how good our curve fit between rating and ranking is, so to do that we're optimizing on minimizing then difference in win rates between our handicap games and our non handicap games when the appropriate handicap is used in a game.

Okay, that makes sense. I see the value in seeing the curve fit with that data excluded.

But, if that data is just ignored (and we don't look at it anywhere), I feel like it can hide problems.

E.g., when I uncomment it on --ogs, it LOWERS the win rate of the stronger player. Wouldn't we expect including that data to increase the win rate? (Or maybe I'm not understanding what the "stronger player" metric is.)

@dexonsmith
Copy link
Contributor Author

dexonsmith commented Dec 13, 2023

Here's the data (re-run, since I don't trust the stuff I printed before finding the weird bug I inserted in the baseline):

Algorithm name Stronger wins h0 h1 h2 options ignore? judgement
glicko2_one_ga 68.8% 68.6% 68.7% 69.3% baseline ignore stones
glicko2_one_ga 68.8% 68.6% 68.7% 69.3% baseline ignore rank difference
glicko2_one_ga 69.0% 68.7% 67.8% 71.2% this branch ignore stones
glicko2_one_ga 68.9% 68.7% 68.5% 70.4% this branch ignore rank difference
glicko2_one_ga 61.8% 60.4% 67.6% 68.6% baseline include stones
glicko2_one_ga 61.8% 60.4% 67.6% 68.6% baseline include rank difference
glicko2_one_ga 61.9% 60.4% 66.8% 70.3% this branch include stones
glicko2_one_ga 61.9% 60.4% 66.8% 70.3% this branch include rank difference

Although, come to think of it, maybe the problem is the 9x9 and 13x13 games. Let me see what happens if I ignore based on computed handicap_rank_difference and report back.

EDIT: updated with the data, where the judgement is based on handicap_rank_difference (6b5a96b). Not much difference.

@dexonsmith
Copy link
Contributor Author

Also tried adding --size=19 to the command line, just to totally exclude small boards. Still seeing LOWER rates for "stronger wins" when including badly mismatched games. This just doesn't make sense to me. When games are badly mismatched, the stronger player should almost always win.

@dexonsmith
Copy link
Contributor Author

Interestingly, also found that compact stats are completely ignoring small boards:

        prediction = (
            self.prediction_cost[19][ALL][ALL][ALL] / max(1, self.count[19][ALL][ALL][ALL])
        )
        prediction_h0 = (
            self.prediction_cost[19][ALL][ALL][0] / max(1, self.count[19][ALL][ALL][0])
        )
        prediction_h1 = (
            self.prediction_cost[19][ALL][ALL][1] / max(1, self.count[19][ALL][ALL][1])
        )
        prediction_h2 = (
            self.prediction_cost[19][ALL][ALL][2] / max(1, self.count[19][ALL][ALL][2])
        )

@dexonsmith
Copy link
Contributor Author

Interestingly, also found that compact stats are completely ignoring small boards:

This goes back to the initial commit in 51c291a.

@dexonsmith
Copy link
Contributor Author

Trying to page this back in after a few weeks away.

#51 landed, which is the basic options. It fails to run because of some very large effective handicap games. See the comments there on what to do about it. (Some of which were implemented in the abandoned #46 that it replaced.)

I'm still super curious about the following:

  • When collating statistics on the rating quality, we skip games that are "mismatched" — where the effective rank difference between players (after incorporating handicap) is bigger than 1.
  • I'd expect that including mismatched games in the stats would make it MORE likely that the stronger player wins.
  • This is what happens on EGF and AGA data. If you include all games in statistics, we get way better numbers for "stronger player wins" (vs. skipping mismatched games).
  • This is NOT what happens on OGS data. If you include all games in statistics, we get way worse numbers for "stronger player wins" (vs. skipping mismatched games).
  • IIRC, this broken expectation happens if you only run --size=19... but I should double-check, since it has been a few weeks.

That suggests to me some sort of fundamental problem/weirdness with the OGS data.

  • Is it dominated by sandbaggers? (unlikely)
  • Does capping the rank at 25k give us incorrect data on who is stronger?
  • Are there lots of games where the handicap is recorded incorrectly?
  • ...?

I feel like this discrepancy is important to understand...

@anoek
Copy link
Member

anoek commented Jan 10, 2024

Just to note, we're talking about the statistics we use to gauge how well the parameters we've chosen for our rating system, in particular our rating to ranking curve since those are the ones we really twiddle, are performing.

Here's my logic for discarding those other values:

If you have two players that are equal strength, then depending on komi, we expect black to win somewhere in the 50-56% range.

Ideally, if you have a player that is "3 ranks higher", aka "3 stones stronger" than their opponent, and they give their opponent a 3 stone handicap, then we'd expect black to win that same 50-56% of the time.

The primary goal of this repo is to tune the parameters used to fit the rating to ranking curve such that we minimize the divergence of the black win rate for handicap games from that of even games. That is to say, the win rates for all of our handicap 1,2 , 3 etc games where the ranks were 1, 2, 3 apart with white being the stronger player, those should all have black winning ~50-56% of the time. If we have a skew, say black winning 70% of the time or something, we know our ranks are not the right distance apart. For example, if black was winning too much, it would mean the ranks were too close because on average white is giving too much handicap for how strong their rank claims.

Now say we have a handicap 2 game but the rank difference is 5 with white being the stronger player, then we'd expect black to win with some small chance way less than 50%, hence why we don't include them in our statistics that we are using to measure how well our ranking system is at determining good handicap values - it'd just skew the results.

I could see an argument made that if the distribution of games where you had mismatched rank/handicap combinations was somewhat normal, then including those values would still improve the average, but I've been operating under the assumption that it's not normal and has a bit of structure to it since we have years of data of games played with different rating and ranking systems, so you've got bias coming from those systems, then you also have manually specified handicap games which, anecdotally, I suspect are bias towards not providing enough handicap. Hence, throwing all those values out for our purposes here.

@dexonsmith
Copy link
Contributor Author

dexonsmith commented Jan 10, 2024

Now say we have a handicap 2 game but the rank difference is 5 with white being the stronger player, then we'd expect black to win with some small chance way less than 50%

I agree that is what we'd expect, but it's not what I'm seeing with OGS data. (With EGF and AGA data, that's indeed what we see...)

Instead, with OGS data, including those results makes the stronger player win less often than when excluding them. See the "stronger wins" column from the 8-row table above with an "ignore?" column.

  • "ignore?=ignore" rows exclude games like you mention. Stronger wins 69% of the time. (Code in-tree.)
  • "ignore?=include" rows include games like you mention. Stronger wins 62% of the time. (Commented out the "ignore" code.)

That's what I'm puzzled about. The "ignore?=include" rows should have stronger winning 80% (or something), not 62%.

@dexonsmith
Copy link
Contributor Author

When I get a chance (maybe not until the weekend?) I'll reproduce and post a patch which adds a command-line option to include those results, so you can review and reproduce yourself.

@anoek
Copy link
Member

anoek commented Jan 11, 2024

Ok, as per usual I'm a little out of sync and adding to the confusion. There's a few stats. The one that I've pretty much exclusively cared about for optimizing on is the handicap performance, the first set of numbers. Excluding mismatched rank/handicap games from that is important. But the code you are talking about has nothing to do with this, so what I wrote above is moot, sorry for the confusion.

You're looking at the stronger player wins stat. That stat we're not targeting ~50%, I think we'd optimally target consistency, so 62% vs 69% isn't important, it's that it's 62% across the board or 69% across the board. That said, pretty sure I wasn't optimizing on that, it was more just a curiosity and bench marking thing I think.

But on to the actual question, what was the purpose of this block of code and should it remain there, and why when we remove it does our stronger win stat go down and not up?

        if abs(result.black_rank + result.game.handicap - result.white_rank) > 1:
            self.games_ignored += 1
            return

As for why it's there, it's probably there because I wanted those individual numbers to be comparable to one another and exclude some of the inherent bias that might exist of inadvertent or purposeful sandbagging or helium bagging.

There's also a published EGF stat https://en.wikipedia.org/wiki/Go_ranks_and_ratings that notes that in their rating system a 1k vs 2k in an even game has a 71.3% chance of winning, so while not directly comparable because we're using fractional rankings here and looking at 0 <= R < 1 as opposed to something like round(Black) + handicap - round(White) == 1, it's probably something of a sanity check I was using, noting that the value is somewhat close.

HOWEVER The real elephant in the room, the thing that doesn't pass the smell test, the bug - why when you remove that does our win rate for stronger players not go up? Pretty sure it's that the value being displayed is not in fact the stronger win rate at all like it should be but rather some number we get out of prediction_cost, so all those values are wrong. Specifically https://github.com/online-go/goratings/blob/master/analysis/util/TallyGameAnalytics.py#L144-L155 we're reading from prediction_cost instead of what I think should be predicted_outcome.

Changing that produces some results that align more with our intuition, also the expected values seem better.

As for if that condition should remain or not, I think that just comes down to what we're hoping to learn from that stat in the first place. It's interesting, but unclear if it's particularly useful for tuning.

@dexonsmith
Copy link
Contributor Author

Thanks, that's helpful!

@dexonsmith
Copy link
Contributor Author

A few updates.

2-3 more pull requests out:

I had a look at "black wins" results with those merged.

  • Stats get closer to 50% for strong players.
  • Stats drift away a little for weaker players.

I think that's good? We expect handicaps to be most accurate for strong players, who play most consistently.

here's the baseline (after applying those patches):
baseline.txt

here's the result with --handicap-rank-difference-{19x19,small} (after applying those patches):
options.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants