Skip to content
Xoff edited this page Mar 20, 2013 · 4 revisions

Metafacture-mediawiki is a plugin for Metafacture.

Modules

The modules in Metafacture-Mediawiki can be divided in three groups.

Base modules

These modules provide MediaWiki xml and wikitext parsing. They create and augment WikiPage objects.

  • WikiXmlHandler parses a MediaWiki xml document and emits a WikiPage object for every page found
  • WikiTextParser uses Sweble to parse the wikitext in a WikiPage object and attaches the abstract syntax tree (AST) to the object

Extractors

Please note: Extractors are called analyzers in the code. The code will be updated with the next major revision (see issue #2) but until this happens the documentation is ahead of the code.

The extractors extract information from the different representations of a wiki page in WikiPage object and turn these information into a Metafacture event stream.

  • AuthorityLinkExtractor extracts authority file links (GND, LOC, IMDB, VIAF) from Wikipedia articles
  • LinkExtractor extracts all internal links in a wiki page from an AST
  • SimpleLinkExtractor extracts links from a wiki page using regular expression
  • TemplateExtractor extracts all templates from a wiki pages whose name matches a pattern
  • MultiExtractor runs a list of extractors and merges the results into a single record. Additionally, it makes sure that each extractor receives a WikiPage containing the representations of the wikitext it requires.

Utility modules

These modules help working with WikiPage objects.

  • AstToJson adds a serialised representation of an AST to a WikiPage object
  • JsonToAst adds an AST to a WikiPage object which is reconstructed from a serialised represenation

Tutorials

Be the first to write a tutorial!

Clone this wiki locally