-
Notifications
You must be signed in to change notification settings - Fork 273
GSoC2022_Progress_Celian_RINGWALD
DBpedia provides monthly releases produced by the DBpedia Extraction Framework. They are composed of various data artifacts that mainly stem from the wiki dumps. However, some of them also rely on API calls for rendering dynamic contents, which is the case of the DBpedia abstracts. The large amount of data requested from APIs couldn't be extracted entirely within a month today. We suggest solving this issue by a strategy composed of four steps: - a study based on the data recorded during the last abstract extraction - the test and implement the use of the TextExtracts extension and the improvement the error management - the reduction the number of possible calls - the integration into the framework of the possibility to appeal to more than one API Each step of the project will be developed into a new dedicated GitHub branch of the DBpedia extractor framework, which could be documented and used for working on the project.
- Link to the project seed : https://forum.dbpedia.org/t/developing-a-new-dbpedia-abstract-extraction-gsoc2022/1620
- Link to the proposal : https://summerofcode.withgoogle.com/media/user/8129e10aed83/proposal/d8mMiYASojjvUVPv.pdf
- Project Tracker : https://docs.google.com/spreadsheets/d/1kMGiDM71Qz4cZNdw86UfpqIDR6dIrieeeOV8b3PimQk/edit#gid=1703594761
- Link to Pull request : https://github.com/dbpedia/extraction-framework/pull/740
- Final Report : https://docs.google.com/document/d/10xvukZVeKNA1n_VT_q2pWtuPWznEl2Hz/edit?usp=sharing&ouid=104536663383791851600&rtpof=true&sd=true
- Mykola Medynskyi
- Marvin Hofer
- Dimitris Kontokostas
I am Célian Ringwald, research engineer in charge of the French DBpedia chapter at Inria in the Wimmics team. My topic of research is mainly related to NLP and Semantic Web questions, having an access to abstract of Wikipedia (and more broadly to the textual content of it) trough DBpedia is a very important milestone in my perspective
- intiate a working space : github fork + wiki preparation
- first play with Wikimedia on docker > https://github.com/datalogism/mediawiki_docker
- First meeting : 8th June
-
Kick-off meeting : How to compare it / Metrics repports
- Focus on English and French chapter
- Find a solution for avoiding data traffic jam
- Using Marvin for testing
-
Done during the week :
-
Parralel benchmark : parallelizing seems useless due to rate limits
-
On all the dataset for el, sh, ro ,tr :
extractor | parallel-process | time |
---|---|---|
nif | 1 | 9h |
plain | 1 | 3h40 |
nif | 2 | 3h55 |
plain | 2 | 3h |
nif | 4 | 4 |
plain | 4 | 3h38 |
=> Seem to be faster but is it giving more data ??
- Monday : GSOC Meeting 2
- Creation of a script for creating test set of pages based on clickstream for a given lang : https://github.com/datalogism/extraction-framework/blob/gsoc-celian/dump/src/test/bash/create_custom_sample.sh
- First TestSuite development for testing abstract extraction : https://github.com/datalogism/extraction-framework/blob/gsoc-celian/dump/src/test/scala/org/dbpedia/extraction/dump/ExtractionTestAbstract2.scala
- https://github.com/datalogism/mediawiki_docker > Ok
- But could be very long on large wiki > "English wikipedia" need 6months for being loaded ....
- output line by lines
- count nb of FailedIOException / FailedOutOfMemoryError / FailedNullPointerException
- from the most views pages to the least views
- Results are the same > extraction failed after the first 10% wikipages
=> Can we have a view of the current rate limit ? NO
=> Are the differents MediaWiki Api endpoint share the same limits ? YES
=> Is the https://en.wikipedia.org/api/rest_v1/ API a best solution ? YES
=> javax.xml.stream.XMLStreamException: ParseError at [row,col]:[517851,1]" Message: expected <title>, found => It was caused by specials characters in URIs
- First tests on requests parameters : using retry-after / max lag / User agent & gzip options
-
Romanian mediawiki clone :
- management of problem due to original language parameter error > NEED TO BE "RO"
- Problem concerning Campaigns extension
-
I added the 2 following parameters :
- I needed to fix universal config max lag / User agent
-
test of new API
- I created a script that call the RESTAPI on the main dataset that i created by following the guidelines and more specifically the rate limit
- I recorded the results here
- I aggregated them here : in a nutshell > we got almost everything but we receive some disambiguisation errors (only type of error recorded) > Pb related to redirect pages parameter ?
- I created a dedicated log appender here, this one is logging into the logdir path definded log file composed of json rows, one for each call
- First lesson learn : we need to give a user-agent header param > not really a good new because depending of our calling process we can be banished.
- plain abstract problem : we have only a part of the process that are found in the stat for plain abstract
- Reason : we cannot run together both test on plain and HTML without causing it...
-
Romanian wiki :
- Id of page given in dump are not the same than the id in the original wiki
- By runing the 1000 sample i freeze my app
- I got 100% success rate on 100 page BUT... i deleted it by mistake (shame on me)
-
Some fact arround the wikiclone :
- The abstract is generally not victim of it but : Wikipedia is loading external data from Wikidata in Infoboxes here a romanian example and also in Wikibase models As Authority control one
- As mentioned in this page : this is not possible to get it from a mediawiki clone, for getting it we must also mirroring wikidata ....
-
Using the Mediawiki only as parser :
- Parsing of entire french dump with mediawiki clone done in 4 days
- Experiment records : https://docs.google.com/spreadsheets/d/1JCMozvQ7oC_AkDuoCS1ZNlasaaUm8tWg/edit#gid=1770381193
- MidWay repport : https://docs.google.com/document/d/101OvYuKvD4o9UPmLvuDkN5hgfupyfpVO/edit#
- Mediawiki clone documentation ok : https://github.com/datalogism/mediawiki_docker
- Implementation test of a "retry after" sensitive pipeline > see MediaWikiConnector4.scala
- Comparison of the plain text answers of the old and the new APIs
- Implementation and test of the gzip parameter
- MediaWikiConnector3.scala
- HTML answer parsing ok > readInAbstractHTML
- but problems with parameter "redirect=true" and OutputStreamWriter
- Midway repport - HTML content part
- Different structures of HTML
- Links parsing problem solved
- Test en sample 1000 > 988 abstracts only ok with parsing error : http://dbpedia.org/resource/Javier_Bardem http://dbpedia.org/ontology/abstract "Javier Ángel Encinas Bardem ("},"2":{"wt":"lang"}},"i":0}}]}' id="mwDQ">Spanish: ; born 1 March 1969) is a Spanish actor. Known for his roles in and foreign films, he has received , including an , a , and a . Bardem won the for his performance as the assassin in the ' modern western drama film (2007). He also received critical acclaim for his roles in films such as (1992), (1995), (1997), (2002), and (2004). He has also starred in 's romantic drama (2008), 's spy film (2012), 's drama (2013), 's film (2017), 's mystery drama (2018) and 's science fiction drama (2021). Bardem's other Oscar-nominated performances include 's (2000), 's (2010), and 's (2021). He is the first Spanish actor to be nominated for an Academy Award ( for Before Night Falls in 2001), as well as the first and only Spanish actor to win one ( for in 2008). He is also the recipient of a , two , and six . In January 2018, Bardem became ambassador of for the protection of ."@en .
- Adaptation of the HTMLNifExtractor / WikipediaNifExtractor / LinkExtractor
- Parsing error of mw-data included into REST API answer still not fixed
- Commit all the changes to https://github.com/datalogism/extraction-framework/tree/gsoc-celian_clean branch
- Benchmark the APIs with different configurations on a sample of 1000 english pages > cf LastTestMadeOnEN sheet
- Merging different MWC old API implementations (retry-after, max-lag incrementation process) into MediaWikiConnector2.scala
- Fix the parsing error of the REST API answer by deleting mw-data attributes into getJsoupDoc function of HtmlNifExtractor.scala
- Parsing of the new HTML structure is now ok > we are able to run the entire NIF extraction if needed WikipediaNifExtractor2
- Clean code of WikipediaNifExtractor by creating an abstract class WikipediaNifExtractor extended for the REST API case into WikipediaNifExtractor2
- Clean code of the MediaWikiConnectors by creating an abstract class MediaWikiConnectorAbstract, extended for MWC API and for the REST MWC API
- Pushing all API parameters into the config files : extraction.nif.abstracts.properties and extraction.plain.abstracts.properties
- Correction of parsing errors due to data-mw parsing with Jsoup : HTML entites were parsed before as result the simple quote inside the data-mw attributes. I firstly fixed it with a unstable regexpr and i finally found a way to fix it simply by deleting a html placed in the wrong place...
- Correction of an other parsing errors in Plain abstract extraction parsing process : in some case ?> header is misteriously return without the first caracters...
- Test of every functions on RO, FR and EN language
- Cleaning the code and final pull request
- End of report writing
- A benchmark of the different possible API configuration
- I created a MediaWiki Clone for creating a local API
- I adapted the "old API" to the guidelines allowing us to avoid the rate limits problem we got at the beginning of the project
- I also added to the "old API" a better retry-after and an incremental maxlag mechanisms
- I implemented a way to use the mediawiki rest API into the DIEF and i solved the different problems related to the new structure of it answer
- Concerning the "Old Wikimedia" API, I didn't implemented a solution integrating the generator (cf rate limit paragraph of the guidelines), because of the DIEF process. Indeed each Wikipedia articles is proceed one by one for enabling parallelization.
- I didn't worked at all on a smart update strategy as exposed into the proposal. It is still possible to think a solution taking account of the last release
- For the moment the chose of a given API is given by the configuration, we could imagine to dynamically call one API depending on the answers time...
- The Rest API could also imitate a maxlag mechanism if we control and play on the request calls