Quickstarter for DBpedia Spotlight models

Update, January 2022

The language models are build with the latest version of redirects, disambiguations, and instance-types artifacts, downloaded from the DBpedia Databus. The Catalan, Finish, Lithuanian, and Romanian languages was integrated to the creation model list.

Update, January 2016

This tool now uses the wikistatsextractor by the great folks over at DiffBot. This means: no more Hadoop and Pig! Running the biggest model (English) takes around 2h on a single machine with around 32GB of RAM. We recommend running this script on an SSD with around 100GB of free space.

Requirements

Git
Maven 3

Spotlight model creation

You can use this tool for creating models of DBpedia Spotlight in your language.

docker run -it dbpediaspotlight/model-quickstarter bash
Generate the models outside the container
- If you want to generate the models outside the container, just map volumes for the folders `/model-quickstarter/wdir`, `/model-quickstarter/data` and `/model-quickstarter/models`, e.g.
```
 docker run -v /home/user/data/model/wdir:/model-quickstarter/wdir -v /home/user/data/model/data:/model-quickstarter/data -v /home/user/data/model/models:/model-quickstarter/models -it dbpediaspotlight/model-quickstarter bash
```
cd model-quickstarter/
Copy & paste one of the following commands to begin the corresponding language model creation process.

Language	Language code	Locator code	Analyzer+Stemmer language prefix	Command
Catalan	ca	ES	Catalan	./index_db.sh wdir ca_ES ca/stopwords.list Catalan models/ca
Danish	da	DK	Danish	./index_db.sh wdir da_DK da/stopwords.list Danish models/da
German	de	DE	German	./index_db.sh -b de/ignore.list wdir de_DE de/stopwords.list German models/de
English	en	US	English	./index_db.sh -b en/ignore.list wdir en_US en/stopwords.list English models/en
Spanish	es	ES	Spanish	./index_db.sh -b es/ignore.list wdir es_ES es/stopwords.list Spanish models/es
Finish	fi	FI	Finnish	./index_db.sh wdir fi_FI fi/stopwords.list Finnish models/fi
French	fr	FR	French	./index_db.sh -b fr/ignore.list wdir fr_FR fr/stopwords.list French models/fr
Hungarian	hu	HU	Hungarian	./index_db.sh wdir hu_HU hu/stopwords.list Hungarian models/hu
Italian	it	IT	Italian	./index_db.sh wdir it_IT it/stopwords.list Italian models/it
Lithuanian	lt	LT	Lithuanian	./index_db.sh wdir lt_LT lt/stopwords.list Lithuanian models/lt
Dutch	nl	NL	Dutch	./index_db.sh -b nl/ignore.list wdir nl_NL nl/stopwords.list Dutch models/nl
Norwegian	no	NO	Norwegian	./index_db.sh -b no/ignore.list wdir no_NO no/stopwords.list Norwegian models/no
Portuguese	pt	BR	Portuguese	./index_db.sh -b pt/ignore.list wdir pt_BR pt/stopwords.list Portuguese models/pt
Romanian	ro	RO	Romanian	./index_db.sh wdir ro_RO ro/stopwords.list Romanian models/ro
Russian	ru	RU	Russian	./index_db.sh wdir ru_RU ru/stopwords.list Russian models/ru
Swedish	sv	SE	Swedish	./index_db.sh -b sv/ignore.list wdir sv_SE sv/stopwords.list Swedish models/sv
Turkish	tr	TR	Turkish	./index_db.sh -b tr/ignore.list wdir tr_TR tr/stopwords.list Turkish models/tr

Datasets

You can find pre-built datasets created using the model-quickstarter here:

Pre-built Spotlight models
Raw counts

Citation

If you use the current (statistical version) of DBpedia Spotlight or the data/models created using this repository, please cite the following paper.

@inproceedings{isem2013daiber,
  title = {Improving Efficiency and Accuracy in Multilingual Entity Extraction},
  author = {Joachim Daiber and Max Jakob and Chris Hokamp and Pablo N. Mendes},
  year = {2013},
  booktitle = {Proceedings of the 9th International Conference on Semantic Systems (I-Semantics)}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Quickstarter for DBpedia Spotlight models

Update, January 2022

Update, January 2016

Requirements

Spotlight model creation

Datasets

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Quickstarter for DBpedia Spotlight models

Update, January 2022

Update, January 2016

Requirements

Spotlight model creation

Datasets

Citation