GSOC2013_Progress_Kasun

Proposal

Type inference to extend coverage Project Proposal

Sources for type inference. The list is based on the comments by Aleksander Pohl on the project proposal

Project Updates

Warm Up period (27th May - 16th June)

Setup the public clone of Extraction Framework
Setting up extraction-framework code on IDEA IDE, Building the code ect.
Working on the Issue#33
Familiarize with Scala, git, and IDEA

Week 1 (17th June- 23rd June)

integrate the Airpedia triples classes in 31 languages to theDbpedia-links.
Integrate triples obtained from the Wikipedia infoboxes, introductory sentences, categories and direct mapping between Wikipedia and Cyc to the Dbpedia-links. Aleksander's classification outputs (not completed)

Week 2 (24th June- 30th June)

Identify Wikipedia leaf categories #Issue16Investigate on YAGO approach, read YAGO paper again
Mail discuss tread on choosing the source data for leaf category identification Link to mail tread
Method of leaf category identification

get all parent categories
get all child categories
substitute "1" from "2" result is the all leaf categories.

Processing Wikipedia categories #issue17 Save parent-child relationship of the categories to a MySQL database in-order to address the requirement of the #issue17
Created tables
Node Table

CREATE TABLE IF NOT EXISTS node ( node_id int(10) NOT NULL AUTO_INCREMENT, category_name varchar(40) NOT NULL, is_leaf tinyint(1) NOT NULL, is_prominent tinyint(1) NOT NULL, score_interlang double DEFAULT NULL, score_edit_histo double NOT NULL, PRIMARY KEY (node_id), UNIQUE KEY category_name (category_name) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;

Edge Table

CREATE TABLE IF NOT EXISTS edges ( parent_id int(10) NOT NULL, child_id int(10) NOT NULL, PRIMARY KEY (parent_id,child_id) ) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Week 3 (1st July- 7th July)

The leaf node detection, finding parent-child relationship approach mentioned in the 2nd week was abandoned due to following reasons.

"categories that don't have a broader category are not included in skos_categories dump"
evidence for this claim is discussed here 1-issue#16 2- Mail Archive
data freshness issues- since Dbpedia dumps nearly 1 year old and unavailability of synchronized sub-dumps for data analyze

New Approach Wikipedia Category table Category (cat_id, cat_title,cat_pages,cat_subcats,cat_files,cat_hidden)

cat_pages - excludes pages in subcategories, but it contains the count of other pages like talk: pages, template: pages ect. with actual article pages Need to find out a way to filter out unnecessary pages from the these statistics.

Some hints about categories usage

Some of the selected categories have cat_pages=0; i.e. these categories are not used
Some of the selected categories have cat_pages> 10000 ;Which are possibly administrative categories or higher nodes of the category graph.
When cat_subcats=0, which will get all categories that don’t have subcategories.

Use of Category table for Selection of leaf nodes

A query such as below would be used to find possible leaf node candidates, given the optimum “threshold” SELECT * FROM category WHERE cat_subcats=0 AND cat_pages>0 AND cat_pages<threshold ";

Here is my threshold calculations. This shows the threshold values and count of categories having less pages than the threshold value. (adhering to the above SQL query)

A suitable threshold value need to be selected.

More details on using Wikipedia Category and Categorylinks SQL dumps is drafted [here] (https://docs.google.com/document/d/1kXhaQu4UrEKX-v1DPwC6V2Sk9SNTDIwvgDtOZX5bZgk/edit?usp=sharing)

[Wikipedia Data Dump dated 2013/06/04] (http://dumps.wikimedia.org/enwiki/20130604/) was used for above mentioned work

Week 4 (8th July- 14th July)

Identification on which are the administrative categories and how they are distributed according to 'cat_pages'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly