Skip to content

Toolchain: CONTENTdm compound PDFs

Mark Jordan edited this page Feb 8, 2016 · 12 revisions

Overview

This toolchain allows the creation of Islandora import packages consisting of CONTENTdm objects that are comprised of page-level PDF files. The resulting packages, which consist of one MODS XML file and one multipage PDF per object, can then be ingested into Islandora using the Islandora Batch module.

Preparing the content files

Not applicable to this toolchain, since all content is retrieved from the CONTENTdm server.

Preparing the configuration file

All MIK configuration files are standard INI files which contain the following sections: [CONFIG], [FETCHER], [METADATA_PARSER], [FILE_GETTER], [WRITER], [MANIPULATORS], and [LOGGING]. Entries are required unless indicated otherwise below.

Commented lines begin with a semicolon. Values that contain whitespace or special characters (equals, semicolon, etc.) should be wrapped in double quotation marks. If in doubt, use the quotation marks. The order of the sections and the entries within each section do not matter.

The CONFIG section

Key-value pairs of configuration entries in this section are simply written to the top of the log file specified in the [LOGGING] section's path_to_log setting. You can add whatever values you want, but they are static (that is, they can't be dynamically derived at runtime). Therefore, all entries in this section are optional.

Example

[CONFIG]
config_id = compound_pdf_test
last_updated_on = "2016-02-01"
last_update_by = "Mark Jordan"

The FETCHER section

This section of the configuration file must contain the following entries:

  • class: Must be 'Cdm'.
  • alias: The CONTENTdm alias (collection string) for the source collection, without the leading /.
  • temp_directory: Full path to the directory where the fetchers write data for use later in the toolchain.
  • ws_url: The full URL to your CONTENTdm server's web services API endpoint.
  • use_cache: Optional; set to false in automated tests (in other words, you will not need to use this unless you are writing automated tests for this fetcher).
  • record_key: Must be 'pointer'.

Example

[FETCHER]
class = Cdm
; The alias of the CONTENTdm collection.
alias = ecucals
ws_url = "http://content.lib.sfu.ca:81/dmwebservices/index.php?q="
temp_directory = "/tmp/ecucals_temp"
; 'record_key' should always be 'pointer' for CONTENTdm fetchers.
record_key = pointer

The METADATA_PARSER section

This section of the CONTENTdm Newspapers toolchain's configuration file contains the following entries:

  • class: Must be 'mods\CdmToMods'.
  • alias: The CONTENTdm alias (collection string) for the source collection, without the leading /.
  • ws_url: The full URL to your CONTENTdm server's web services API endpoint.
  • mapping_csv_path: The path, either full or relative to the mik script, where the metadata mapppings file is located.
  • include_migrated_from_uri: If set to 'true', adds an <identifier> element to the object's MODS XML that indicates the source object's reference URL in CONTENTdm. An example of this element is <identifier type="uri" invalid="yes" displayLabel="Migrated From">http://content.lib.sfu.ca/cdm/ref/collection/CT_1930-34/id/17583</identifier>
  • repeatable_wrapper_elements: By default MIK reduces repeated top-level wrapper MODS elements (same element name with the same attributes) down to a single instance of the element. This setting lets you indicate which elements you want to be repeated (i.e, have multiple of) in your MODS. The most common use for this setting is to allow repeated <extension> elements.

Example

[METADATA_PARSER]
class = mods\CdmToMods
alias = ecucals
ws_url = "http://content.lib.sfu.ca:81/dmwebservices/index.php?q="
; Path to the csv file that contains the CONTENTdm to MODS mappings.
mapping_csv_path = 'ecucals.csv'
; Include the migrated from uri into your generated metadata (e.g., MODS)
include_migrated_from_uri = TRUE

The FILE_GETTER section

This section of the CONTENTdm Newspapers toolchain's configuration file contains the following entries:

  • class: Must be 'CdmPhpDocuments'.
  • temp_directory: Full path to the directory where the file getter will write data for use later in the toolchain. Can be the same as the temp_directory value used in the [FETCHER] section.
  • ws_url: The full URL to your CONTENTdm server's web services API endpoint.
  • utils_url: The full URL to your CONTENTdm server's web "utilities" directory. More information is available in the API entries listed under the "CONTENTdm Website API Reference — utils" section of the CONTENTdm API documentation.

Example

[FILE_GETTER]
class = CdmPhpDocuments
alias = ecucals
ws_url = "http://content.lib.sfu.ca:81/dmwebservices/index.php?q="
utils_url = "http://content.lib.sfu.ca/utils/"
temp_directory = "/tmp/vanpunk_temp"

The WRITER section

This section of the CONTENTdm Newspapers toolchain's configuration file contains the following entries:

  • class: Must be 'CdmPhpDocuments'.
  • output_directory: The full path to the directory where output packages are written.
  • postwritehooks: Optional. A multivalued list of post-write hook scripts. Values have two parts, the full path to the PHP, Python, or shell executable, and the full path to the script itself.
  • datastreams: Optional. Valid values are 'MODS', 'OBJ', or both. If defined, only the indicated datastream files will be generated. If not defined, MIK will create both the MODS XML file and the OBJ file. Most useful for testing metadata generation, for example datastreams[] = "MODS", which would tell MIK to generate only a MODS XML file for each object.

Example

[WRITER]
class = CdmPhpDocuments
alias = ecucals
ws_url = "http://content.lib.sfu.ca:81/dmwebservices/index.php?q="
output_directory = "/tmp/vanpunk_output"
; Leave blank for Cdm single file objects (the MIK writer assigns the filename).
metadata_filename =
postwritehooks[] = "/usr/bin/php extras/scripts/postwritehooks/validate_mods.php"
; datastreams[] = MODS

The MANIPULATORS section

This section of the CSV toolchain's configuration file defines which manipulators should be used. Multiple manipulators can be defined for each type (fetchermanipulators, filegettermanipulators, metadatamanipulators) as illustrated below. The value of each entry is the manipulator class name plus any pip-separated parameters that the manipulator may require. Entries in this section are optional.

Example

[MANIPULATORS]
; You must use the CdmCompound fetcher manipulator with this toolchain, with
; specifying 'Document-PDF' as its parameter.
fetchermanipulators[] = "CdmCompound|Document-PDF"
; fetchermanipulators[] = "RandomSet|50"

Manipulators that you may find useful with this toolchain include:

The LOGGING section

This section of the CSV toolchain's configuration file contains the following entries:

  • path_to_log: The full path to the standard log generated by MIK.
  • path_to_manipulator_log: The full path to the log that the manipulators write status and error messages to.

Example

[LOGGING]
; Full path to log file for general mik log file.
path_to_log = "/tmp/ecucals_output/mik.log"
; Full path to log file for manipulators.
path_to_manipulator_log = "/tmp/ecucals_output/manipulator.log"
Clone this wiki locally