Somehow mark / remove categories with associated page of same name #2

rlaemmel · 2013-06-09T13:17:26Z

During extraction I notice that many categories pop up that are sort of proxies for actual members. For instance, there is a category XML and the actual page XML. Same for many other pages.

Clearly, Wikipedia is using categories here in a special way to collected pages related to a language. I would think that such categories should be removed along extraction.

In a first step, such categories should be marked. If our experience suggests that such categories should be typically removed, then we could even add an "option" to the extraction dialog so that indeed such categories are auto-excluded.

Beware, sometimes the category name vs. the page name seem to vary with regard to upper and lower case as in Category:Troff vs. just "troff" (where wikipedia's API seems to understand though "Troff" and maps it to "troff").

rlaemmel · 2013-06-09T13:28:43Z

Just a related observation.
There is this sort of category:
http://en.wikipedia.org/wiki/Category:Wikipedia_categories_named_after_programming_languages
That's just for programming languages, indeed.
In fact, I doubt that this category is complete.

ghost assigned dmosen Jun 9, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Somehow mark / remove categories with associated page of same name #2

Somehow mark / remove categories with associated page of same name #2

rlaemmel commented Jun 9, 2013

rlaemmel commented Jun 9, 2013

Somehow mark / remove categories with associated page of same name #2

Somehow mark / remove categories with associated page of same name #2

Comments

rlaemmel commented Jun 9, 2013

rlaemmel commented Jun 9, 2013