Skip to content

Toolchain: OAI PMH for Open Journal Systems

Mark Jordan edited this page Mar 20, 2017 · 12 revisions

Overview

This toolchain allows the creation of Islandora import packages consisting of metadata and PDF files retrieved from Open Journal Systems via the OAI-PMH protocol. The resulting Islandora import packages can then be ingested into Islandora using the standard Islandora Batch module.

The toolchain creates valid Islandora import packages but is primarily intended to demonstrate the creation of toolchains for platforms that support OAI-PMH. Currently the toolchain is limited to retrieving one PDF file for each OJS article. It does not retrieve supplementary files attached to articles, but could be made to do so if someone wanted to use the toolchain in production.

OAI-PMH provides a mechanism for harvesting metadata about items but not for identifying files that make up the item. Consequently, MIK must find the link to each item's PDF file (in the case of this toolchain) by scraping the item's HTML page. The toolchain's filegetter class finds the file's URL and the writer class retrieves the file and saves it as part of the Islandora import package. However, filegetter and writer classes for other platforms that allow harvesting of metadata via OAI-PMH should be relatively easy to write. Use cases are welcome.

Preparing the content files

All content added to Islandora import packages by this toolchain comes from the remote OJS instance, so there is no need to prepare content.

Preparing the configuration file

All MIK configuration files are standard INI files which contain the following sections: [SYSTEM], [CONFIG], [FETCHER], [METADATA_PARSER], [FILE_GETTER], [WRITER], [MANIPULATORS], and [LOGGING]. Entries are required unless indicated otherwise below.

Commented lines begin with a semicolon. Values that contain whitespace or special characters (equals, semicolon, etc.) should be wrapped in double quotation marks. If in doubt, use the quotation marks. The order of the sections and the entries within each section do not matter.

The SYSTEM section

This section of the configuration file sets or overrides configuration settings for PHP and the various third-party PHP components used by MIK. It can contain the following entries:

  • date_default_timezone: Optional. Provide a default timezone if date.timezone is null in the the PHP INI. You will know if you need to use this setting because Monolog will throw MIK exceptions and halt MIK. Set to one of the valid PHP timezone values listed at http://php.net/manual/en/timezones.php.
  • verify_ca: Optional. OSX's default PHP configuration use Apple's Secure Transport rather than OpenSSL, causing issues with Certificate Authority verification in Guzzle requests against websites that use HTTPS. This setting allows Guzzle to override CA verification. You will know if you need to use this setting because Guzzle will write entries in your mik.log complaining about CA verification. Set to false to ignore CA verification.

Note: if you set verify_ca to false, you are bypassing HTTPS encryption between MIK and the remote website. Use at your own risk.

Example

[SYSTEM]
date_default_timezone = 'America/Vancouver'
verify_ca = false

The CONFIG section

Key-value pairs of configuration entries in this section are simply written to the top of the log file specified in the [LOGGING] section's path_to_log setting. You can add whatever values you want, but they are static (that is, they can't be dynamically derived at runtime). Therefore, all entries in this section are optional.

Example

[CONFIG]
config_id = oai-ojs-demo
last_updated_on = "2015-12-20"
last_update_by = "Mark Jordan"

The FETCHER section

This section of the configuration file must contain the following entries:

  • class: Must be 'Oaipmh'.
  • oai_endpoint: Full URL to the source OJS instance's OAI-PMH endpoint.
  • from: Optional; a date in either YYYY-MM-DD and YYYY-MM-DDThh:mm:ssZ format that defines the start date in a selective harvest. Date-based harvests are described in the OAI-PMH spec.
  • until: Optional; a date in either YYYY-MM-DD and YYYY-MM-DDThh:mm:ssZ format that defines the end date in a selective harvest.
  • set_spec: Optional; the set spec that limits the OAI harvest to a specific set.
  • temp_directory: Full path to the directory where the fetchers write data for use later in the toolchain.
  • use_cache: Optional; set to false in automated tests (in other words, you will not need to use this unless you are writing automated tests for this fetcher).

Example

[FETCHER]
class = Oaipmh
oai_endpoint = "http://journals.sfu.ca/present/index.php/demojournal/oai"
set_spec = demojournal:ART
from = 1990-01-01
until = 2010-12-31
temp_directory = "/tmp/oaitest_temp"

The METADATA_PARSER section

This section of the toolchain's configuration file contains the following entries:

  • class: Must be 'dc\OaiToDc'.
  • xslt_path: Full or relative (the the mik script) path to the XSLT file used to transform the Dublin Core metadata retrieved during the OAI-PMH harvest into MODS. Use "extras/scripts/oai_to_dc.xsl" unless you want to use your own custom XSLT stylesheet.

Example

[METADATA_PARSER]
class = dc\OaiToDc
xslt_path = "extras/scripts/oai_to_dc.xsl"

The Dublin Core metadata retrieved from OJS is transformed into MODS by Islandora automatically on ingest.

The FILE_GETTER section

This section of the toolchain's configuration file contains the following entries:

  • class: Must be 'OaipmhOjsPdf'.
  • temp_directory: Full path to the directory where the file getter will write data for use later in the toolchain. Can be the same as the temp_directory value used in the [FETCHER] section.

Example

[FILE_GETTER]
class = OaipmhOjsPdf
temp_directory = "/tmp/oaitest_temp"

The WRITER section

This section of the CSV toolchain's configuration file contains the following entries:

  • class: Must be 'Oaipmh'.
  • output_directory: The full path to the directory where output packages are written.

Example

[WRITER]
class = Oaipmh
output_directory = "/tmp/oaitest_output"

The MANIPULATORS section

This toolchain currently does not use any manipulators. Leave this section of the .ini file empty.

If you have have a use case, please file an issue.

Example

[MANIPULATORS]

The LOGGING section

This section of the CSV toolchain's configuration file contains the following entries:

  • path_to_log: The full path to the standard log generated by MIK.

Example

[LOGGING]
path_to_log = "/tmp/oaitest_output/mik.log"
Clone this wiki locally