Skip to content

Latest commit

 

History

History
73 lines (53 loc) · 4.45 KB

README.md

File metadata and controls

73 lines (53 loc) · 4.45 KB

Quickstarter for DBpedia Spotlight models

Gitter

Update, January 2022

The language models are build with the latest version of redirects, disambiguations, and instance-types artifacts, downloaded from the DBpedia Databus. The Catalan, Finish, Lithuanian, and Romanian languages was integrated to the creation model list.

Update, January 2016

This tool now uses the wikistatsextractor by the great folks over at DiffBot. This means: no more Hadoop and Pig! Running the biggest model (English) takes around 2h on a single machine with around 32GB of RAM. We recommend running this script on an SSD with around 100GB of free space.

Requirements

  • Git
  • Maven 3

Spotlight model creation

You can use this tool for creating models of DBpedia Spotlight in your language.

  1. docker run -it dbpediaspotlight/model-quickstarter bash

    Generate the models outside the container - If you want to generate the models outside the container, just map volumes for the folders `/model-quickstarter/wdir`, `/model-quickstarter/data` and `/model-quickstarter/models`, e.g.
     docker run -v /home/user/data/model/wdir:/model-quickstarter/wdir -v /home/user/data/model/data:/model-quickstarter/data -v /home/user/data/model/models:/model-quickstarter/models -it dbpediaspotlight/model-quickstarter bash
    
  2. cd model-quickstarter/

  3. Copy & paste one of the following commands to begin the corresponding language model creation process.

Language Language code Locator code Analyzer+Stemmer language prefix Command
Catalan ca ES Catalan ./index_db.sh wdir ca_ES ca/stopwords.list Catalan models/ca
Danish da DK Danish ./index_db.sh wdir da_DK da/stopwords.list Danish models/da
German de DE German ./index_db.sh -b de/ignore.list wdir de_DE de/stopwords.list German models/de
English en US English ./index_db.sh -b en/ignore.list wdir en_US en/stopwords.list English models/en
Spanish es ES Spanish ./index_db.sh -b es/ignore.list wdir es_ES es/stopwords.list Spanish models/es
Finish fi FI Finnish ./index_db.sh wdir fi_FI fi/stopwords.list Finnish models/fi
French fr FR French ./index_db.sh -b fr/ignore.list wdir fr_FR fr/stopwords.list French models/fr
Hungarian hu HU Hungarian ./index_db.sh wdir hu_HU hu/stopwords.list Hungarian models/hu
Italian it IT Italian ./index_db.sh wdir it_IT it/stopwords.list Italian models/it
Lithuanian lt LT Lithuanian ./index_db.sh wdir lt_LT lt/stopwords.list Lithuanian models/lt
Dutch nl NL Dutch ./index_db.sh -b nl/ignore.list wdir nl_NL nl/stopwords.list Dutch models/nl
Norwegian no NO Norwegian ./index_db.sh -b no/ignore.list wdir no_NO no/stopwords.list Norwegian models/no
Portuguese pt BR Portuguese ./index_db.sh -b pt/ignore.list wdir pt_BR pt/stopwords.list Portuguese models/pt
Romanian ro RO Romanian ./index_db.sh wdir ro_RO ro/stopwords.list Romanian models/ro
Russian ru RU Russian ./index_db.sh wdir ru_RU ru/stopwords.list Russian models/ru
Swedish sv SE Swedish ./index_db.sh -b sv/ignore.list wdir sv_SE sv/stopwords.list Swedish models/sv
Turkish tr TR Turkish ./index_db.sh -b tr/ignore.list wdir tr_TR tr/stopwords.list Turkish models/tr

Datasets

You can find pre-built datasets created using the model-quickstarter here:

Citation

If you use the current (statistical version) of DBpedia Spotlight or the data/models created using this repository, please cite the following paper.

@inproceedings{isem2013daiber,
  title = {Improving Efficiency and Accuracy in Multilingual Entity Extraction},
  author = {Joachim Daiber and Max Jakob and Chris Hokamp and Pablo N. Mendes},
  year = {2013},
  booktitle = {Proceedings of the 9th International Conference on Semantic Systems (I-Semantics)}
}