-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Popularity is broken when the wikidata entry points to a wikipedia redirect #59
Comments
[Copied from @hyanwong's comment in #49]
So we can either change the link in Wikidata and/or somehow keep redirects in the wikipedia dump, and follow those? I seem to remember that I did do a bit of redirect-following in my original wikimedia parser, but I think I didn't think about that case on wikipedia. Yan |
Interestingly, if you look at the links on https://www.wikidata.org/wiki/Q311417, next to 'en', there is a little arrow symbol. If you hover over it, it says "intentional sitelink to redirect". I'm not sure what the reasoning is, as in most cases, the link is to the main page with no redirects.
The wikiDATA dump doesn't know about redirects afaik. It just has wikiPEDIA links, which sometimes happen to redirect. So I'm not sure if we can get the redirect information, other than by actually making http requests to it, which is too painful/slow. |
[Copied from @hyanwong's comment in #49]
Indeed, but perhaps the |
Yes, it has a |
Good point. This makes the logic a bit more convoluted, doesn't it, but I think it is probably worth doing. As a half-way house we could check in the SQL dump but rather than locate the proper pagename via the wikipedia API, we could simply emit a warning that the name is a redirect, and won't be used for popularity. |
Yes, TBH, I don't like the idea of making http requests during processing. Today, everything happens offline, which is nice. What we could do is have a separate process which looks for all such entries and figures out the redirects, then saves the results in a mapping file that we commit. Presumably, it wouldn't change that often. Then
Yes, we could start there. Would be interesting to see how common of an issue it is today. |
Yes, I agree this is much nicer. |
I made a change to include the There are 10334 entries in our filtered dump that has this set to 1. But note that this includes all entries in our filtered wikidata dump, which has all taxons & vernaculars. So in practice, it's likely a much smaller set of redirects that actually affect us. We can get better stats once we work on this in the popularity logic in |
Great, thanks @davidebbo. 10334 sounds quite a lot, but as you say, only a subset will be relevant to us, fortunately. |
[This is a spin-off issue from #49]
Another weird case: Leopardus pajeros is the Pampas cat.
Problem is that the wikidata entry has the english link linking to Leopardus_pajeros (which is a redirect), instead of to the main Pampas cat page:
"enwiki": { "title": "Leopardus pajeros" }
. So we end up looking upLeopardus pajeros
in the Page Count file, and not finding anything, because all the hits are with the Pampas cat entry.The text was updated successfully, but these errors were encountered: