Skip to content

Information for developers

Mark Jordan edited this page Feb 5, 2019 · 40 revisions

This guide is still being developed. Let us know if you have any questions!

Overview

We want MIK to be useful to as many people as possible. MIK is designed to be extended and enhanced, and offers a number of ways that users can contribute. This guide provides information for developers who want to contribute to MIK or who can help us fix bugs.

Our CONTRIBUTING.md file provides some general information on reporting issues, opening pull requests, etc.

Autoloading classes

MIK uses Composer's vendor/autoload.php. When creating classes (or modifying the namespaces of existing classes), you will need to regenerate the autoload files. To do this, run the following from the command-line:

composer dump-autoload

composer update

Coding standards

Use the PSR-2 coding standard for any new code. You can check your work using PHP Code Sniffer by issuing the following command from within the mik directory:

vendor/bin/phpcs --standard=PSR2 yourfile.php

Running tests

From within the mik directory, run:

./vendor/bin/phpunit

Some tests may be skipped, but you should not see any failures:

OK, but incomplete, skipped, or risky tests!
Tests: 56, Assertions: 86, Skipped: 1.

Generating sample configurations and data

MIK provides a utility script that allow you to generate configuation (an .ini file and a mappings file) and sample data (a CSV metadata file and images) for use in development, testing, and as part of pull requests (where you are expected to provide configuration and sample data to assist in code testing/review).

For example, to generate sample configuration and input data for single-file objects, run within your MIK directory:

php extras/scripts/samplecontentgenerator/generate.php --id single-test /tmp/single

This will produce output in /tmp/single that looks like this:

/tmp/single/
├── file1.jpg
├── file2.jpg
├── file3.jpg
├── file4.jpg
├── file5.jpg
├── single-test.ini
├── single-test_mappings.csv
└── single-test_metadata.csv

If you run the following command:

php extras/scripts/samplecontentgenerator/generate.php -m compound --id compound-test /tmp/compound

You will get output in /tmp/compound like this:

/tmp/compound/
├── compoundobject1
│   ├── image_01.tif
│   └── image_02.tif
├── compoundobject2
│   ├── image_01.tif
│   └── image_02.tif
├── compoundobject3
│   ├── image_01.tif
│   └── image_02.tif
├── compoundobject4
│   ├── image_01.tif
│   └── image_02.tif
├── compoundobject5
│   ├── image_01.tif
│   └── image_02.tif
├── compound-test.ini
├── compound-test_mappings.csv
└── compound-test_metadata.csv

Other supported content models are 'books' and 'newspapers'. Run php extras/scripts/samplecontentgenerator/generate.php --help for more information.

Extending MIK

From a high-level workflow perspective, MIK loops through a set of records and writes out an Islandora ingest package for each one.

At the code level, the mik script gets a list of records from a Fetcher, and for each record, invokes a Writer, which assembles the metadata and any children (e.g., page of a newspaper issue) of the object described in the metadata record. The Writer's writePackages() method generates the MODS (via a MetadataParser) and any content files (via a FileGetter), and writes out everything to an Islandora ingest package. The values in the .ini file tell these components where to get their input data, what manipulators to apply, where to log things, and where to write out the packages.

MIK can be extended by using standard object-oriented PHP development techniques. The various components pictured in the overview diagram represent groups of object-oriented PHP code:

MIK details

MIK provides multiple fetcher classes, metadata parser classes, file getter classes, and writer classes, plus multiple manipulator classes. A representation of the class layout (with some classes removed for brevity) is:

mik/src
├── config
│   ├── CdmConfig.php
│   ├── Config.php
│   └── CsvConfig.php
├── exceptions
│   └── MikErrorException.php
├── fetchermanipulators
│   ├── CdmByDmDate.php
│   ├── Fetchermanipulator.php
│   ├── RandomSet.php
│   └── SpecificSet.php
├── fetchers
│   ├── Cdm.php
│   ├── Csv.php
│   ├── Fetcher.php
│   └── Oaipmh.php
├── filegettermanipulators
│   ├── CdmSingleFile.php
│   ├── CdmSingleFileVanpunk.php
│   └── Filegettermanipulator.php
├── filegetters
│   ├── CdmBooks.php
│   ├── CsvNewspapers.php
│   ├── FileGetter.php
│   └── OaipmhXpath.php
├── filemanipulators
│   ├── FileManipulator.php
│   └── ThumbnailFromCDM.php
├── metadatamanipulators
│   ├── MetadataManipulator.php
│   ├── NormalizeDate.php
│   ├── PiratizeAbstract.php
│   └── SimpleReplace.php
├── metadataparsers
│   ├── dc
│   │   ├── CdmToDc.php
│   │   ├── Dc.php
│   │   └── OaiToDc.php
│   ├── MetadataParser.php
│   └── mods
│       ├── CdmToMods.php
│       ├── CsvToMods.php
│       └── Mods.php
└── writers
    ├── CdmCompound.php
    ├── CsvSingleFile.php
    ├── Oaipmh.php
    └── Writer.ph

As described below, making MIK work with new sources of data involves subclassing new fetchers, file getters, metadata parsers, writers, and manipulators.

Post-write hook and shutdown scripts are not classes, so they do not live within the class tree. They can be located anywhere (but are by convention located in MIK's extras/scripts subdirectory), since you refer to them explicitly in your .ini file.

Configure before you code

MIK is highly configurable. Before you write any code to extend MIK, determine if what you want to do can be accomplished via settings in your .ini file.

In particular, MIK's manipulators, which are registered in an .ini file's [MANIPULATORS] section, can add a lot of extra functionality to toolchains. Many manipulators take parameters, which lets you further customize them. The overview provides links to the wiki pages for some manipulators. Take a look in the source code directories for the fetcher manipulators and metadata manipulators.

The MIK Cookbook also documents many features and capabilities that might meet your needs.

Writing manipulators

We may get rid of file manipulators: https://github.com/MarcusBarnes/mik/issues/117

Manipulators are plugins that can "manipulate" the behaviour of fetchers, metadata parsers, and other core MIK classes via configuration in the .ini file. All the code for a manipulator is encapsulated in a single PHP class file. Manipulators are registered in the MIK configuration file in the [MANIPULATORS] section, and may take parameters. The signatures for manipulators identify the group they are in, followed by an equal sign, followed by the manipulator's parameters, which are delimited by the pipe symbol (|). The first parameter is always the name of the manipulator. For example, in the following example, the "RandomSet" manipulator is being registered, taking the parameters "10" (the size of the random set) and "randomset.txt" (the file to save the set's members to):

[MANIPULATORS]
fetchermanipulators[] = "RandomSet|10|randomset.txt"

Within the manipulator class, the manipulator's parameters (or "settings") are passed to the __contstruct() method as the $manipulator_settings array, where you can access them within your constructor. The first entry in the array ($manipulator_settings[0]) is the manipulator's name, so you can ignore it:

class RandomSet extends FetcherManipulator
{
    /**
     * @var int $setSize - The size of the random set.
     */
    public $setSize;

    /**
     * Create a new RandomSet fetchermanipulator Instance.
     *
     * @param array $settings
     *   All of the settings from the .ini file.
     *
     * @param array $manipulator_settings
     *   An array of all of the settings for the current manipulator,
     *   with the manipulator class name in the first position, the string
     *   indicating the set size in the second, and the optional
     *   output filename in the third.
     */
    public function __construct($settings, $manipulator_settings)
    {
        $this->setSize = $manipulator_settings[1];
        if (isset($manipulator_settings[2])) {
            $this->outputFile = $manipulator_settings[2];          
            $now = date("F j, Y, g:i a");
            $message = "# Output of the MIK Random Set fetcher manipulator, generated $now" . PHP_EOL;
            if (file_exists($this->outputFile)) {
                $message = PHP_EOL . $message;
            }
            file_put_contents($this->outputFile, $message, FILE_APPEND);
        }
        // To get the value of $onWindows.
        parent::__construct();
    }

Note that if you add a new manipulator, you must run composer dump-autoload.

Fetcher manipulators

Fetcher manipulators must implement a manipulate() method, whose signature is:

    /** 
     * @param array $all_records
     *   All of the records from the fetcher.
     *
     * @return array $filtered_records
     *   An array of records that pass the test(s) defined in the fetcher manipulator.
     */
    public function manipulate($all_records)
    {
    }

Good examples are:

Metadata manipulators

Metadata manipulator classes must implement the manipulate() method, whose signature is:

    /**
     * General manipulate wrapper method.
     *
     *  @param string $input
     *     The XML fragment to be manipulated.
     *
     * @return string
     *     One of the manipulated XML fragment, the original input XML if the
     *     input is not the fragment we are interested in, or an empty string.
     */
    public function manipulate($input)
    {
    }

Good examples are:

Writing post-write hook scripts

Post-write hook scripts apply actions to each ingest package immeditely after it is written out by MIK. More information is available in the overview. They do not typically use MIK's core fetcher, file getter, metadata parse, and writer classes, although they may if necessary.

Sample:

<?php

/**
 * All post-write hook scripts get the record key as the first parameter, a comma-
 * separated list of children record keys as the second, and the path to the MIK .ini
 * file as the third.
 *
 * This is a sample script that writes some data to a file.
 */

$record_key = trim($argv[1]);
$children_record_keys_string = trim($argv[2]);
$children_record_keys = explode(',', $children_record_keys_string);
$config_path = trim($argv[3]);

$config = parse_ini_file($config_path, true);

// Write some data from the parameters to a file.
file_put_contents('/tmp/task1.txt', "Record key from task1.php: $record_key\n", FILE_APPEND);
file_put_contents('/tmp/task1.txt', "Children record key from task1.php: " . implode(',', $children_record_keys) . "\n", FILE_APPEND);
file_put_contents('/tmp/task1.txt', "Output directory from MIK config: " . $config['WRITER']['output_directory'] . "\n", FILE_APPEND);
file_put_contents('/tmp/task1.txt', "Sample post-write hook script has finished\n", FILE_APPEND);

Developers at Louisiana State University Library have contributed a post-write hook script that uses the Saxon XSLT processor to apply some XSLT 2.0 stylesheets to MODS XML files. They have provided a brief overview. The script itself is fairly simple, but it applies a wide range of cleanup stylesheets to the MODS files.

Post-write hook scripts can be difficult to troubleshoot, since they run as background processes and do not write to STDOUT or STDERR. Generally speaking, dumping variables to a file while troubleshooting is the best approach. You can also test as a standalone script by running it on the command line and passing it its three expected arguments, 1) a record key, 2) a serialized list of child record keys (or if there are no children, an empty string), and 3) the path to MIK's .ini file. For example, if your post-write hook script is called pwhs.php, you can run it as:

php pwhs.php image01 '' /my/mik.ini

The same script run using children record keys as its second argument is:

php pwhs.php image01 image12,image13,image14 /my/mik.ini

Writing shutdown scripts

Shutdown scripts run after MIK has written all of the ingest packages. More information is available in the overview. Like post-write hook scripts, they do not typically use MIK's core fetcher, file getter, metadata parse, and writer classes, although they may if necessary.

<?php

/**
 * Shutdown hook script for MIK that deletes the contents of the
 * temp_directory defined in the MIK .ini file.
 */

$config_path = trim($argv[1]);
$config = parse_ini_file($config_path, TRUE);
$temp_dir = $config['FETCHER']['temp_directory'];

delete_temp_files($temp_dir);

function delete_temp_files($temp_dir) {
    $temp_files = glob($temp_dir . '/*');
    foreach($temp_files as $temp_file) {
        unlink($temp_file);
    }
}

Unlike post-write hook scripts, shutdown scripts are run as foreground processes. Shutdown scripts take only one parameter, the path to the .ini file. They are also easier to debug than post-write hook scripts. You can run a shutdown script easily outside of MIK by passing the path to the .ini file as a parameter. For example, if your script is called shutdown.php, you can run it as:

php shutdown.php /my/mik.ini

Writing fetchers, file getters, metadata parsers, and writers

Adding only a single toolchain component

MIK has been designed so that is possible to add new fetchers, file getters, metadata parsers, and writers that can be used with components from existing toolchains. A good illustration of this is the file getters used with the OAI-PMH toolchains. Most repositories that provide an OAI-PMH gateway follow the standard closely. However, the OAI-PMH specification only deals with how metadata (such as Dublin Core) about objects in a repository should be shared; it is completely silent on how PDFs, image files, and other non-metadata content for the objects described in the Dublin Core metadata can be retrieved. To make MIK capable of migrating content from a new OAI-PMH compliant repository platform, all you need to do is add a specialized file getter. The fetcher, metadata manuipulator, and writer that are part of the other OAI-PMH toolchains should work.

To illustrate this, we provide a sample OAI-PMH file getter that is hard-coded to retrieve files from Islandora. We do not need to write a separate fetcher, metadata parser or writer. Here is the code for the file getter, with some inline comments pointing out the Islandora-specific code:

<?php

/**
 * This filegetter is for use in OAI-PMH toolchains that harvest content from
 * Islandora sites. Will harvest the datastreams listed in the config option
 * [WRITER] datastream_ids.
 *
 * Intended as an example of a specialized, repository-specific filegetter
 * and is primarly for use in workshops and other training or testing situtions.
 */

namespace mik\filegetters;

use mik\exceptions\MikErrorException;
use Monolog\Logger;

class OaipmhIslandoraObj extends FileGetter
{
    /**
     * @var array $settings - configuration settings from configuration class.
     */
    public $settings;

    /**
     * Create a new OAI Single File Fetcher Instance.
     * @param array $settings configuration settings.
     */
    public function __construct($settings)
    {
        $this->settings = $settings['FILE_GETTER'];
        $this->fetcher = new \mik\fetchers\Oaipmh($settings);
        $this->temp_directory = $this->settings['temp_directory'];

        // Set up logger.
        $this->pathToLog = $settings['LOGGING']['path_to_log'];
        $this->log = new \Monolog\Logger('OaipmhIslandoraObj filegetter');
        $this->logStreamHandler = new \Monolog\Handler\StreamHandler($this->pathToLog,
            Logger::ERROR);
        $this->log->pushHandler($this->logStreamHandler);

        $this->oai_endpoint = $settings['FETCHER']['oai_endpoint'];
        // The list of datastreams to download is specific to this filegetter
        // so we need to define that list in a new config option.
        $this->datastreamIds = $settings['FILE_GETTER']['datastream_ids'];
    }

    /**
     * Placeholder method needed because it's called in the main loop in mik.
     */
    public function getChildren($record_key)
    {
        return array();
    }

    /**
     * Get the URL for the datastream (OBJ, PDF, image, etc.).
     *
     * @param string $record_key
     *
     * @return string $ds_url
     */
    public function getFilePath($record_key)
    {
        // Get the OAI record from the temp directory.
        $raw_metadata_path = $this->settings['temp_directory'] . DIRECTORY_SEPARATOR . $record_key . '.metadata';
        $dom = new \DOMDocument;
        $xml = file_get_contents($raw_metadata_path);
        $dom->loadXML($xml);

        // There will only be one oai:identifer element. Islandora's OAI identifiers look like
        // oai:digital.lib.sfu.ca:foo_112, 'foo_123' being the object's PID.
        $identifier = $dom->getElementsByTagNameNS('http://www.openarchives.org/OAI/2.0/', 'identifier')->item(0);
        $raw_pid = preg_replace('#.*:#', '', trim($identifier->nodeValue));
        $pid = preg_replace('/_/', ':', $raw_pid);

        // Get bits that make up the Islandora instances host plus port. Assumes that the OAI-PMH
        // endpoint is on the same host as the datastream files.
        $islandora_url_info = parse_url($this->oai_endpoint);
        if (isset($islandora_url_info['port'])) {
            $port = $islandora_url_info['port'];
        }
        else {
            $port = '';
        }
        $islandora_host = $islandora_url_info['scheme'] . '://' . $islandora_url_info['host'] . $port;

        // Assemble the URL of each datastream listed in the config and return on the first one
        // that is available. We loop through DSIDs because not all Islandora content models
        // require an OBJ datastream, e.g., PDF, video and audio content models.
        foreach ($this->datastreamIds as $dsid) {
            $ds_url = $islandora_host . '/islandora/object/' . $pid . '/datastream/' . $dsid . '/download';
            // HEAD is probably more efficient than the default GET.
            stream_context_set_default(array('http' => array('method' => 'HEAD')));
            $headers = get_headers($ds_url, 1);
            if ($headers[0] == 'HTTP/1.1 200 OK') {
                return $ds_url;
            }
        }

        // If no datastreams listed in $this->datastreamIds are available, return false.
        return false;
    }
}

The .ini configuration options for this file getter are:

[FILE_GETTER]
class = OaipmhIslandoraObj
temp_directory = "/tmp/oaitest_temp"
datastream_ids[] = OBJ
datastream_ids[] = PDF

Adding multiple toolchain components

In most cases, you will only have to write one new component, as explained above. In others, you may need to write multiple new components. As an example, pretend that we want a toolchain that fetches from a source not addressed in MIK's current set of toolchains, and that generates content for ingestion into some other repository platform.

To illustrate how this is done, we provide sample components of a CSV to JSON toolchain that work with existing CSV toolchain components. This toolchain does not produce Islandora ingest packages, it produces ingest packages for a hypothetical repository platform called Friday (Jason, get it?), which is like Islandora but uses JSON documents instead of MODS XML to store metadata. The new components are:

We do not need to write fetcher and file getter classes for this CSV to JSON toolchain because the existing CSV fetcher and single-file file getter can be used. The ability to reuse these components is evident in the sample .ini file for our new toolchain:

; MIK configuration file for the demonstration CSV to JSON toolchain.

: This toolchain is intended to illustrate how to extend MIK to create
; output that differs from Islandora ingest packages. In this case, the
: metadata files are in serialized JSON format, not XML. Uses the existing
; Csv fetcher and CsvSingleFile filegetter.

; This toolchain is not intended to be used in production.

[SYSTEM]

[CONFIG]
config_id = MIK CSV to JSON test
last_updated_on = "2016-10-27"
last_update_by = "Mark Jordan"

[FETCHER]
class = Csv
input_file = "tutorial_metadata.csv"
temp_directory = "/tmp/csv_to_json_temp"
record_key = Identifier

[METADATA_PARSER]
class = json\CsvToJson
; No mappings file; CSV column headings are used as the keys in the JSON.

[FILE_GETTER]
class = CsvSingleFile
input_directory = "/home/mark/Downloads/mik_tutorial_data"
temp_directory = "/tmp/csv_to_json_temp"
file_name_field = File

[WRITER]
class = CsvSingleFileJson
output_directory = "/tmp/csv_to_json_output"
preserve_content_filenames = true

[MANIPULATORS]
metadatamanipulators[] = "SplitRepeatedValuesInJson|Subjects|;"

[LOGGING]
path_to_log = "/tmp/csv_to_json_output/mik.log"
path_to_manipulator_log= "/tmp/csv_to_json_output/manipulator.log"

Required properties and functions in subclassed components

As described ealier, the mik script gets a list of records from a Fetcher, and for each record, invokes a Writer, which assembles the metadata and any children of the object described in the metadata record. The Writer's writePackages() method generates the MODS or DC XML via the MetadataParser's metadata() method and any content files via a FileGetter's getFilePath() or equivalent method, and writes out everything to an Islandora ingest package.

If you are writing a new component for an existing toolchain, or or writing an entirely new toolchain, the reference below will guide you in determining which methods the components should implement.

Fetchers

 /**
    * Return an array of records. For CONTENTdm toolchains,
    *   this will be all of records in a collection. For CSV toolchains,
    *   this will be the all of the rows of data with a unique index.
    *
    * @param $limit int
    *   Optional. If present, only the first $limit records will be
    *   returned.
    *
    * @return object|array Either an array of records (e.g., Cdm)
    * or an object containing an array of records (e.g. Csv).
    */
    public function getRecords($limit = null)
    {
    }

    /**
     * Implements fetchers\Fetcher::getNumRecs.
     *
     * Returns the number of records in the set returned
     *   by getRecords().
     *
     * @return total number of records
     */
    public function getNumRecs()
    {
    }

    /**
     * Implements fetchers\Fetcher::getItemInfo
     *
     * Returns a hashed array or object containing a record's fields.
     *
     * @param string $recordKey the unique record_key
     *   For CSV, this will the the unique id assisgned to a row of data.
     *   For Cdm, this will be the CONTENTdm pointer for the record.
     *
     * @return object The record.
     */
    public function getItemInfo($recordKey)
    {
    }

Fetchers may also implement a applyFetchermanipulators() method, which applies all fetcher manipulators registered in the [MANIPULATORS] section of the toolchain's .ini file.

Writers

    /**
     * Writes files and folders that make up the Islandora ingest package.
     *
     * @param string $metadata
     *   The XML file that is to be written.
     *
     * @param array $pages
     *   An array of page record keys. Should be an empty array if there are no children.
     *
     * @param string $record_key
     *   The unique key for this object.
     */
    public function writePackages($metadata, $pages, $record_id)
    {
    }

    /*
     * Writes the metadata file for the Islandora ingest package.
     *
     * @param string $metadata
     *   The XML file that is to be written.
     *
     * @param array $path
     *   The destination path for the metadata file.
     *
     * @param bool $overwrite
     *   Whether or not to overwrite the file if it exists.
     */
    public function writeMetadataFile($metadata, $path, $overwrite = true)
    {
    }

Metadataparsers

    /**
     * Gets the serialized metadata file (MODS, DC, etc.) for a specific object.
     *
     * @param string $record_key
     *   The object's record key.
     *
     * @return string
     *   The serialized (XML, JSON, etc.) version of the object's metadata.
     */
    public function metadata($record_key)
    {
    }

Filegetters

    /**
     * Array of record keys for children of the current object
     * (e.g., pages of a newspaper issue). Filegetters for content
     * models that do not have children (PDF, image, etc.) must implement
     * this method and return an empty array.
     *
     * @param string $record_key
     *
     * @return array
     */
    public function getChildren($record_key)
    {
    }

The name of the method that returns locations of files varies across toolchains. Some FileGetter classes that deal with single file objects (CsvSingleFile and the Oaipmh* filegetters) implement a getFilePath() method:

    /**
     * @param string $record_key
     *
     * @return string $path_to_file
     */
    public function getFilePath($record_key)
    {
        // CsvSingleFile
    }

There is little consistency in other filegetter subclasses, since the exact method is hard-coded in the toolchain's Writer subclass. To make matter even more confusing, some filegetters have functions that identify master files and files retrieved from a remote server (e.g., Cdm* filegetters), while others have functions that are inconsitenly named but do essentially the same thing (CsvCompound's getCpdSourcePath(), CsvBook's getBookSourcePath(), and CsvNewspaper's getIssueSourcePath()).

Using configuration options in your subclasses

@todo: explain how .ini file options can be accessed within classes.

Understanding MIK's log files

While developing for MIK you will need to refer to the log files it writes out. This cookbook entry provides a detailed overview of the log files.

Writing tests

Easy for CSV, hard for toolchains that interact with remote systems like CONTENTdm and OAI-PMH providers.

If writing tests that invoke the CSV fetcher, don't forget to use the 'use_cache = false' config option. A symptom that you need to use this is if your tests appear to be using a different input CSV than the one specified.

End-to-end test class for the CsvToJson toolchain (testing fetcher, metadata parser, and filegetter + writer) is in tests/CsvToJsonToolchainTest.php.

To run the tests, in your mik directory, run:

phpunit --exclude-group inputvalidators --bootstrap vendor/autoload.php tests

phpunit --group inputvalidators --bootstrap vendor/autoload.php tests

Utility functions

Currently, MIK provides a utility function to log a variable's value, intended for use during development and troubleshooting.

Clone this wiki locally