-
Notifications
You must be signed in to change notification settings - Fork 273
GSOC2013_Progress_Kasun
Type inference to extend coverage Project Proposal
Sources for type inference. The list is based on the comments by Aleksander Pohl on the project proposal
- Setup the public clone of Extraction Framework
- Setting up extraction-framework code on IDEA IDE, Building the code ect.
- Working on the Issue#33
- Familiarize with Scala, git, and IDEA
- integrate the Airpedia triples classes in 31 languages to theDbpedia-links.
- Integrate triples obtained from the Wikipedia infoboxes, introductory sentences, categories and direct mapping between Wikipedia and Cyc to the Dbpedia-links. Aleksander's classification outputs (not completed)
- Identify Wikipedia leaf categories #Issue16Investigate on YAGO approach, read YAGO paper again
- Mail discuss tread on choosing the source data for leaf category identification Link to mail tread
- Method of leaf category identification
- get all parent categories
- get all child categories
- substitute "1" from "2" result is the all leaf categories.
-
Processing Wikipedia categories #issue17 Save parent-child relationship of the categories to a MySQL database in-order to address the requirement of the #issue17
-
Created tables
-
Node Table
CREATE TABLE IF NOT EXISTS node
(
node_id
int(10) NOT NULL AUTO_INCREMENT,
category_name
varchar(40) NOT NULL,
is_leaf
tinyint(1) NOT NULL,
is_prominent
tinyint(1) NOT NULL,
score_interlang
double DEFAULT NULL,
score_edit_histo
double NOT NULL,
PRIMARY KEY (node_id
),
UNIQUE KEY category_name
(category_name
)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;
- Edge Table
CREATE TABLE IF NOT EXISTS edges
(
parent_id
int(10) NOT NULL,
child_id
int(10) NOT NULL,
PRIMARY KEY (parent_id
,child_id
)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
The leaf node detection, finding parent-child relationship approach mentioned in the 2nd week was abandoned due to following reasons.
- "categories that don't have a broader category are not included in skos_categories dump"
evidence for this claim is discussed here 1-issue#16 2- Mail Archive - data freshness issues- since Dbpedia dumps nearly 1 year old and unavailability of synchronized sub-dumps for data analyze
New Approach Wikipedia Category table Category (cat_id, cat_title,cat_pages,cat_subcats,cat_files,cat_hidden)
cat_pages - excludes pages in subcategories, but it contains the count of other pages like talk: pages, template: pages ect. with actual article pages Need to find out a way to filter out unnecessary pages from the these statistics.
Some hints about categories usage
- Some of the selected categories have cat_pages=0; i.e. these categories are not used
- Some of the selected categories have cat_pages> 10000 ;Which are possibly administrative categories or higher nodes of the category graph.
- When cat_subcats=0, which will get all categories that don’t have subcategories.
Use of Category table for Selection of leaf nodes
A query such as below would be used to find possible leaf node candidates, given the optimum “threshold”
SELECT * FROM category
WHERE cat_subcats
=0 AND cat_pages
>0 AND cat_pages
<threshold ";
Here is my threshold calculations. This shows the threshold values and count of categories having less pages than the threshold value. (adhering to the above SQL query)
A suitable threshold value need to be selected.
More details on using Wikipedia Category and Categorylinks SQL dumps is drafted [here] (https://docs.google.com/document/d/1kXhaQu4UrEKX-v1DPwC6V2Sk9SNTDIwvgDtOZX5bZgk/edit?usp=sharing)
- [Wikipedia Data Dump dated 2013/06/04] (http://dumps.wikimedia.org/enwiki/20130604/) was used for above mentioned work
Identification on which are the administrative categories and how they are distributed according to 'cat_pages'