-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ratings should consider komi on small boards #45
Comments
Add komi integration to the ratings math. This path can be turned on using: - `--compute-handicap-via-komi-small` for 9x9 and 13x13 boards - `--compute-handicap-via-komi-19x19` for 19x19 boards For now, none of the scripts pass the necessary extra arguments (they assert out if you pass those arguments). Relates to online-go#45
I've started adding this to RatingsMath in a1486cb, but the individual scripts still need an update to send the board size, ruleset, and komi into This also relates to #7, since the 19x19 version is essentially equivalent (at least, to the intent of #7, not sure about what's currently in the repo). |
(The commit there uses the math from this WIP proposal: https://github.com/dexonsmith/online-go.com/blob/27bcbdc7699cf4fedc072336a4c36ab40897c876/doc/proposal-redesign-small-board-handicap-komi.md ... this is homework for the ratings part of the proposal.) |
@anoek, the "ruleset" seems to be missing from the historical ratings database.
(See also the PR #46.) Also, there are some rated games in there with massive handicaps. E.g., this 8-stone 9x9 game raised an
Indeed, an 8-stone handicap on a 9x9 board is a big advantage. Seems unnecessary to rate this game at all... |
anoek doesn't change the ratings calculations lightly. He also usually handles that himself and does a lot of simming before pushing anything. Just a heads up before you put too much work into this ❤️ love the work you've been doing recently, wouldn't want you to get discouraged 😛 |
Thanks for the heads up :). Already chatted with him about this, and we need to do something for small boards. Ratings adjustments are currently (and perhaps have always?) treating the small board handicaps as "1-stone == 1-rank" (as-if 19x19), which is completely haywire. |
Cool sounds good 👍 good luck ❤️ I agree 1 stone per rank on small boards is bonkers haha |
Some data from running Haven't looked closely, since I'm not sure I'll know how to interpret it. Always dies for me with this traceback:
|
Here are the compact stats:
Interesting to see a modest improvement in the compact data for 19x19 but not much for small boards... could be the new math isn't quite right, or maybe for some (or all?) of the data the "handicap" value is storing a "handicap rank difference" (not stones) after all. |
Other scripts available as of 330bba9 (I didn't test them but think the updates are correct). Also thought of another two possibilities:
|
After 080b385, "both" gets:
Pretty similar. One thing I'd like to do is "skip" some games as unrate-able (i.e., does not affect the rating), say if the rank adjustment is bigger than 20. Or 9. Or something. Not sure there's an easy way to do that right now though. |
I guess the way to do this is to skip them in the caller.
I'd want to skip these games both for the purposes of:
@anoek, I haven't looked yet at how to do this yet (hopefully I'll pry myself away and won't get to it a few days), but curious if you have thoughts on (a) how this should be structured in the goratings code and (b) whether the number of skipped games is worth printing stats on. |
Might also be interesting to see what happens to ratings if ALL small board handicap games are skipped... not the right end result, but a useful baseline. |
Yeah given how bad the handicaps are for 9x9 and 13x13, I too am tempted to just throw those away too. Might be worth the experiment to consider them like you're proposing too to see if it can be useful, but it might just be a detriment. |
Looks like EGF and AGA datasets just have 19x19 games. Here are the compact results from running on them:
Again, assuming Japanese rules for all of them. What rules does EGF use? |
AGA uses AGA rules, EGF I think uses Japanese? There might be some flexibility for both organizations, I'm not entirely sure. |
Do you trust the komi values in the AGA and EGF datasets? Note that the following assertion passes for all games in both datasets:
(Makes sense for handicap games, but I imagine they use komi for even games?) |
Yep I would trust them |
Interesting. There are a couple of reasons for games to be ignored in the tallies
If I comment that out, I get these results for
git-blame tells me it has been that way since the initial commit in 68d0fff. Do you remember why we're ignoring those games? |
But don't AGA rules say they use komi for even games? Do they skip that in tournaments? |
Yep, I might be confused here, are you seeing zeros for komi? That'd be weird. I thought your prior assertion ensured there were no zeros, if there's a value in there I have no reason not to trust them the komi values they provided, as I understand it these are database dumps directly from each organizations datrabases. If there are zeros, then that's not accurate I'd wager, but if they provide a komi I reckon that's as accurate as it can be given that there are probably humans filling out a lot of those values based on whatever tournaments the game came from. |
Okay, this is:
EDIT: actually, I lost track of which data this is. Re-running. |
Yeah, both |
Yeah, I had the wrong code commented out when I was just doing one of them, and got it backwards.
Still interesting in why these are being ignored. (For looking at improvements to the small board analysis, I definitely need to look at effective handicap bigger than 1) |
Update: you can ignore all my "baseline" numbers above :/. In my first commit on the branch, I somehow (???) corrupted the |
As of 608e551, skipping rating games with effective handicaps bigger than 9. New numbers:
|
This code, only used for the analytics part, says If this is a game between two people with an appropriate handicap (as judged by the outcome of our choice of rating parameters this run) - then include it in the stats. Otherwise discard the result for the purpose of our stats. In other words, if in our game history we had a 2 stone handicap game between a 5d and a 1kyu, we don't want to tally this into our stats because it'll skew the results. The purpose of these stats is to judge how good our curve fit between rating and ranking is, so to do that we're optimizing on minimizing then difference in win rates between our handicap games and our non handicap games when the appropriate handicap is used in a game. |
Okay, that makes sense. I see the value in seeing the curve fit with that data excluded. But, if that data is just ignored (and we don't look at it anywhere), I feel like it can hide problems. E.g., when I uncomment it on |
Here's the data (re-run, since I don't trust the stuff I printed before finding the weird bug I inserted in the baseline):
Although, come to think of it, maybe the problem is the 9x9 and 13x13 games. Let me see what happens if I ignore based on computed EDIT: updated with the data, where the judgement is based on |
Also tried adding |
Interestingly, also found that compact stats are completely ignoring small boards:
|
This goes back to the initial commit in 51c291a. |
Trying to page this back in after a few weeks away. #51 landed, which is the basic options. It fails to run because of some very large effective handicap games. See the comments there on what to do about it. (Some of which were implemented in the abandoned #46 that it replaced.) I'm still super curious about the following:
That suggests to me some sort of fundamental problem/weirdness with the OGS data.
I feel like this discrepancy is important to understand... |
Just to note, we're talking about the statistics we use to gauge how well the parameters we've chosen for our rating system, in particular our rating to ranking curve since those are the ones we really twiddle, are performing. Here's my logic for discarding those other values: If you have two players that are equal strength, then depending on komi, we expect black to win somewhere in the 50-56% range. Ideally, if you have a player that is "3 ranks higher", aka "3 stones stronger" than their opponent, and they give their opponent a 3 stone handicap, then we'd expect black to win that same 50-56% of the time. The primary goal of this repo is to tune the parameters used to fit the rating to ranking curve such that we minimize the divergence of the black win rate for handicap games from that of even games. That is to say, the win rates for all of our handicap 1,2 , 3 etc games where the ranks were 1, 2, 3 apart with white being the stronger player, those should all have black winning ~50-56% of the time. If we have a skew, say black winning 70% of the time or something, we know our ranks are not the right distance apart. For example, if black was winning too much, it would mean the ranks were too close because on average white is giving too much handicap for how strong their rank claims. Now say we have a handicap 2 game but the rank difference is 5 with white being the stronger player, then we'd expect black to win with some small chance way less than 50%, hence why we don't include them in our statistics that we are using to measure how well our ranking system is at determining good handicap values - it'd just skew the results. I could see an argument made that if the distribution of games where you had mismatched rank/handicap combinations was somewhat normal, then including those values would still improve the average, but I've been operating under the assumption that it's not normal and has a bit of structure to it since we have years of data of games played with different rating and ranking systems, so you've got bias coming from those systems, then you also have manually specified handicap games which, anecdotally, I suspect are bias towards not providing enough handicap. Hence, throwing all those values out for our purposes here. |
I agree that is what we'd expect, but it's not what I'm seeing with OGS data. (With EGF and AGA data, that's indeed what we see...) Instead, with OGS data, including those results makes the stronger player win less often than when excluding them. See the "stronger wins" column from the 8-row table above with an "ignore?" column.
That's what I'm puzzled about. The "ignore?=include" rows should have stronger winning 80% (or something), not 62%. |
When I get a chance (maybe not until the weekend?) I'll reproduce and post a patch which adds a command-line option to include those results, so you can review and reproduce yourself. |
Ok, as per usual I'm a little out of sync and adding to the confusion. There's a few stats. The one that I've pretty much exclusively cared about for optimizing on is the handicap performance, the first set of numbers. Excluding mismatched rank/handicap games from that is important. But the code you are talking about has nothing to do with this, so what I wrote above is moot, sorry for the confusion. You're looking at the stronger player wins stat. That stat we're not targeting ~50%, I think we'd optimally target consistency, so 62% vs 69% isn't important, it's that it's 62% across the board or 69% across the board. That said, pretty sure I wasn't optimizing on that, it was more just a curiosity and bench marking thing I think. But on to the actual question, what was the purpose of this block of code and should it remain there, and why when we remove it does our stronger win stat go down and not up?
As for why it's there, it's probably there because I wanted those individual numbers to be comparable to one another and exclude some of the inherent bias that might exist of inadvertent or purposeful sandbagging or helium bagging. There's also a published EGF stat https://en.wikipedia.org/wiki/Go_ranks_and_ratings that notes that in their rating system a 1k vs 2k in an even game has a 71.3% chance of winning, so while not directly comparable because we're using fractional rankings here and looking at HOWEVER The real elephant in the room, the thing that doesn't pass the smell test, the bug - why when you remove that does our win rate for stronger players not go up? Pretty sure it's that the value being displayed is not in fact the stronger win rate at all like it should be but rather some number we get out of Changing that produces some results that align more with our intuition, also the expected values seem better. As for if that condition should remain or not, I think that just comes down to what we're hoping to learn from that stat in the first place. It's interesting, but unclear if it's particularly useful for tuning. |
Thanks, that's helpful! |
A few updates. 2-3 more pull requests out:
I had a look at "black wins" results with those merged.
I think that's good? We expect handicaps to be most accurate for strong players, who play most consistently. here's the baseline (after applying those patches): here's the result with |
Ratings currently don't consider komi on small boards, but should, since usually it's the komi that changes (not the number of handicap stones) as handicap increases.
The text was updated successfully, but these errors were encountered: