Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Popularity is broken when the wikidata entry points to a wikipedia redirect #59

Open
davidebbo opened this issue May 12, 2024 · 9 comments

Comments

@davidebbo
Copy link
Collaborator

davidebbo commented May 12, 2024

[This is a spin-off issue from #49]

Another weird case: Leopardus pajeros is the Pampas cat.

Problem is that the wikidata entry has the english link linking to Leopardus_pajeros (which is a redirect), instead of to the main Pampas cat page: "enwiki": { "title": "Leopardus pajeros" }. So we end up looking up Leopardus pajeros in the Page Count file, and not finding anything, because all the hits are with the Pampas cat entry.

@davidebbo
Copy link
Collaborator Author

davidebbo commented May 12, 2024

[Copied from @hyanwong's comment in #49]

has the english link linking to Leopardus_pajeros (which is a redirect)

So we can either change the link in Wikidata and/or somehow keep redirects in the wikipedia dump, and follow those?

I seem to remember that I did do a bit of redirect-following in my original wikimedia parser, but I think I didn't think about that case on wikipedia.

Yan

@davidebbo
Copy link
Collaborator Author

change the link in Wikidata

Interestingly, if you look at the links on https://www.wikidata.org/wiki/Q311417, next to 'en', there is a little arrow symbol. If you hover over it, it says "intentional sitelink to redirect". I'm not sure what the reasoning is, as in most cases, the link is to the main page with no redirects.

somehow keep redirects in the wikipedia dump, and follow those

The wikiDATA dump doesn't know about redirects afaik. It just has wikiPEDIA links, which sometimes happen to redirect. So I'm not sure if we can get the redirect information, other than by actually making http requests to it, which is too painful/slow.

@davidebbo
Copy link
Collaborator Author

davidebbo commented May 12, 2024

[Copied from @hyanwong's comment in #49]

The wikiDATA dump doesn't know about redirects afaik. It just has wikiPEDIA links, which sometimes happen to redirect. So I'm not sure if we can get the redirect information, other than by actually making http requests to it, which is too painful/slow.

Indeed, but perhaps the enwiki-latest-page.sql.gz file contains information on redirects?

@davidebbo
Copy link
Collaborator Author

Indeed, but perhaps the enwiki-latest-page.sql.gz file contains information on redirects?

Yes, it has a page_is_redirect Boolean field (https://www.mediawiki.org/wiki/Manual:Page_table#page_is_redirect). I don't think it gives the redirection target, but at least if we had to make a request, that would greatly reduce the number of cases where it's needed.

@hyanwong
Copy link
Member

hyanwong commented May 13, 2024

that would greatly reduce the number of cases where it's needed.

Good point. This makes the logic a bit more convoluted, doesn't it, but I think it is probably worth doing. As a half-way house we could check in the SQL dump but rather than locate the proper pagename via the wikipedia API, we could simply emit a warning that the name is a redirect, and won't be used for popularity.

@davidebbo
Copy link
Collaborator Author

This makes the logic a bit more convoluted, doesn't it, but I think it is probably worth doing.

Yes, TBH, I don't like the idea of making http requests during processing. Today, everything happens offline, which is nice.

What we could do is have a separate process which looks for all such entries and figures out the redirects, then saves the results in a mapping file that we commit. Presumably, it wouldn't change that often. Then CSV_base_table_creator can just rely on this file to go to the correct entry when processing both page views and page sizes.

As a half-way house we could check in the SQL dump but rather than locate the proper pagename via the wikipedia API, we could simply emit a warning that the name is a redirect, and won't be used for popularity.

Yes, we could start there. Would be interesting to see how common of an issue it is today.

@hyanwong
Copy link
Member

Today, everything happens offline, which is nice.

Yes, I agree this is much nicer.

@davidebbo
Copy link
Collaborator Author

I made a change to include the page_is_redirect column in the filtered SQL dump.

There are 10334 entries in our filtered dump that has this set to 1. But note that this includes all entries in our filtered wikidata dump, which has all taxons & vernaculars. So in practice, it's likely a much smaller set of redirects that actually affect us. We can get better stats once we work on this in the popularity logic in CSV_base_table_creator.

@hyanwong
Copy link
Member

Great, thanks @davidebbo. 10334 sounds quite a lot, but as you say, only a subset will be relevant to us, fortunately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants