Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Somehow mark / remove categories with associated page of same name #2

Open
rlaemmel opened this issue Jun 9, 2013 · 1 comment
Open
Assignees

Comments

@rlaemmel
Copy link
Collaborator

rlaemmel commented Jun 9, 2013

During extraction I notice that many categories pop up that are sort of proxies for actual members. For instance, there is a category XML and the actual page XML. Same for many other pages.

Clearly, Wikipedia is using categories here in a special way to collected pages related to a language. I would think that such categories should be removed along extraction.

In a first step, such categories should be marked. If our experience suggests that such categories should be typically removed, then we could even add an "option" to the extraction dialog so that indeed such categories are auto-excluded.

Beware, sometimes the category name vs. the page name seem to vary with regard to upper and lower case as in Category:Troff vs. just "troff" (where wikipedia's API seems to understand though "Troff" and maps it to "troff").

@ghost ghost assigned dmosen Jun 9, 2013
@rlaemmel
Copy link
Collaborator Author

rlaemmel commented Jun 9, 2013

Just a related observation.
There is this sort of category:
http://en.wikipedia.org/wiki/Category:Wikipedia_categories_named_after_programming_languages
That's just for programming languages, indeed.
In fact, I doubt that this category is complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants