You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During extraction I notice that many categories pop up that are sort of proxies for actual members. For instance, there is a category XML and the actual page XML. Same for many other pages.
Clearly, Wikipedia is using categories here in a special way to collected pages related to a language. I would think that such categories should be removed along extraction.
In a first step, such categories should be marked. If our experience suggests that such categories should be typically removed, then we could even add an "option" to the extraction dialog so that indeed such categories are auto-excluded.
Beware, sometimes the category name vs. the page name seem to vary with regard to upper and lower case as in Category:Troff vs. just "troff" (where wikipedia's API seems to understand though "Troff" and maps it to "troff").
The text was updated successfully, but these errors were encountered:
During extraction I notice that many categories pop up that are sort of proxies for actual members. For instance, there is a category XML and the actual page XML. Same for many other pages.
Clearly, Wikipedia is using categories here in a special way to collected pages related to a language. I would think that such categories should be removed along extraction.
In a first step, such categories should be marked. If our experience suggests that such categories should be typically removed, then we could even add an "option" to the extraction dialog so that indeed such categories are auto-excluded.
Beware, sometimes the category name vs. the page name seem to vary with regard to upper and lower case as in Category:Troff vs. just "troff" (where wikipedia's API seems to understand though "Troff" and maps it to "troff").
The text was updated successfully, but these errors were encountered: