Data extraction from MediaWiki pages made easy.
Metafacture-Mediawiki is a plugin for Metafacture. It provides modules for extracting information from MediaWiki pages such as Wikipedia articles. Currently, modules for extracting links and templates exist. Adding new extraction modules is easy.
The plugin relies on the excellent Sweble wikitext parser for parsing wikitext into abstract syntax trees.
- Extracts basic metadata information about pages from MediaWiki xml documents
- Extracts simple information from wikitext using regular expressions (fast but not suitable for complex tasks)
- Wraps the Sweble wikitext parser for conveniently parsing wikitext into an abstract syntax tree within a Flux flow
- Extracts links and templates from abstract syntax trees created by Sweble and turns them into a Metafacture event stream
- Makes writing additional extraction modules easy
- Supports running multiple extraction modules hassle-free
Metafacture-Mediawiki can be used as a plugin in the Metafacture distribution or as a Java library in your own programs.
The plugin can be downloaded on the releases page. Drop this plugin jar into the /plugins
folder of the metafacture-runner to use the plugin.
Metafacture-Mediawiki is available on Maven Central. To use it, add the following dependency declaration to your pom.xml:
<dependency>
<groupId>org.culturegraph</groupId>
<artifactId>metafacture-mediawiki</artifactId>
<version>4.0.0</version>
</dependency>
Additionally, you need to add the metafacture-core package as a dependency:
<dependency>
<groupId>org.culturegraph</groupId>
<artifactId>metafacture-core</artifactId>
<version>4.0.0</version>
</dependency>
Our integration server automatically publishes successful builds of the master branch as snapshot versions on Sonatype OSS Repository.
The documentation of Metafacture-Mediawiki can be found in the Wiki.
Copyright 2013, 2015 Deutsche Nationalbibliothek.
Metafacture-Mediawiki is distributed under the Apache 2.0 License.