diff --git a/search/search_index.json b/search/search_index.json index 5a64799..619198f 100644 --- a/search/search_index.json +++ b/search/search_index.json @@ -1 +1 @@ -{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Home","text":""},{"location":"#oaipmh-scythe-oai-pmh-for-humans","title":"oaipmh-scythe: OAI-PMH for Humans","text":"

This is a community maintained fork of the original sickle.

CI Docs Meta

oaipmh-scythe is a lightweight OAI-PMH client library written in Python. It has been designed for conveniently retrieving data from OAI interfaces the Pythonic way:

>>> from oaipmh_scythe import Scythe\n>>> scythe = Scythe(\"https://zenodo.org/oai2d\")\n>>> records = scythe.list_records(metadataPrefix=\"oai_dc\")\n>>> records.next()\n<Record oai:zenodo.org:4574771>\n
"},{"location":"#features","title":"Features","text":""},{"location":"#requirements","title":"Requirements","text":"

Python >= 3.8

"},{"location":"#installation","title":"Installation","text":"
python -m pip install oaipmh-scythe\n
"},{"location":"#documentation","title":"Documentation","text":"

The documentation is made with Material for MkDocs and is hosted by GitHub Pages.

"},{"location":"#license","title":"License","text":"

oaipmh-scythe is distributed under the terms of the BSD license.

"},{"location":"api/","title":"API","text":""},{"location":"api/#the-scythe-client","title":"The Scythe Client","text":"

Client for harvesting OAI interfaces.

Use it like this:

>>> scythe = Scythe(\"https://zenodo.org/oai2d\")\n>>> records = scythe.list_records(metadataPrefix=\"oai_dc\")\n>>> records.next()\n<Record oai:zenodo.org:4574771>\n

:param endpoint: The endpoint of the OAI interface. :param http_method: Method used for requests (GET or POST, default: GET). :param protocol_version: The OAI protocol version. :param iterator: The type of the returned iterator (default: :class:sickle.iterator.OAIItemIterator) :param max_retries: Number of retry attempts if an HTTP request fails (default: 0 = request only once). Sickle will use the value from the retry-after header (if present) and will wait the specified number of seconds between retries. :param retry_status_codes: HTTP status codes to retry (default will only retry on 503) :param default_retry_after: default number of seconds to wait between retries in case no retry-after header is found on the response (defaults to 60 seconds) :param class_mapping: A dictionary that maps OAI verbs to classes representing OAI items. If not provided, :data:sickle.app.DEFAULT_CLASS_MAPPING will be used. :param encoding: Can be used to override the encoding used when decoding the server response. If not specified, requests will use the encoding returned by the server in the content-type header. However, if the charset information is missing, requests will fallback to 'ISO-8859-1'. :param request_args: Arguments to be passed to requests when issuing HTTP requests. Useful examples are auth=('username', 'password') for basic auth-protected endpoints or timeout=<int>. See the documentation of requests <http://docs.python-requests.org/en/master/api/#main-interface>_ for all available parameters.

Source code in src/oaipmh_scythe/app.py
class Scythe:\n    \"\"\"Client for harvesting OAI interfaces.\n\n    Use it like this:\n\n        >>> scythe = Scythe(\"https://zenodo.org/oai2d\")\n        >>> records = scythe.list_records(metadataPrefix=\"oai_dc\")\n        >>> records.next()\n        <Record oai:zenodo.org:4574771>\n\n    :param endpoint: The endpoint of the OAI interface.\n    :param http_method: Method used for requests (GET or POST, default: GET).\n    :param protocol_version: The OAI protocol version.\n    :param iterator: The type of the returned iterator\n           (default: :class:`sickle.iterator.OAIItemIterator`)\n    :param max_retries: Number of retry attempts if an HTTP request fails (default: 0 = request only once). Sickle will\n                        use the value from the retry-after header (if present) and will wait the specified number of\n                        seconds between retries.\n    :param retry_status_codes: HTTP status codes to retry (default will only retry on 503)\n    :param default_retry_after: default number of seconds to wait between retries in case no retry-after header is found\n                                on the response (defaults to 60 seconds)\n    :param class_mapping: A dictionary that maps OAI verbs to classes representing\n                          OAI items. If not provided,\n                          :data:`sickle.app.DEFAULT_CLASS_MAPPING` will be used.\n    :param encoding:     Can be used to override the encoding used when decoding\n                         the server response. If not specified, `requests` will\n                         use the encoding returned by the server in the\n                         `content-type` header. However, if the `charset`\n                         information is missing, `requests` will fallback to\n                         `'ISO-8859-1'`.\n    :param request_args: Arguments to be passed to requests when issuing HTTP\n                         requests. Useful examples are `auth=('username', 'password')`\n                         for basic auth-protected endpoints or `timeout=<int>`.\n                         See the `documentation of requests <http://docs.python-requests.org/en/master/api/#main-interface>`_\n                         for all available parameters.\n    \"\"\"\n\n    def __init__(\n        self,\n        endpoint: str,\n        http_method: str = \"GET\",\n        protocol_version: str = \"2.0\",\n        iterator: BaseOAIIterator = OAIItemIterator,\n        max_retries: int = 0,\n        retry_status_codes: Iterable[int] | None = None,\n        default_retry_after: int = 60,\n        class_mapping: dict[str, OAIItem] | None = None,\n        encoding: str | None = None,\n        timeout: int = 60,\n        **request_args: str,\n    ):\n        self.endpoint = endpoint\n        if http_method not in (\"GET\", \"POST\"):\n            raise ValueError(\"Invalid HTTP method: %s! Must be GET or POST.\")\n        if protocol_version not in (\"2.0\", \"1.0\"):\n            raise ValueError(\"Invalid protocol version: %s! Must be 1.0 or 2.0.\")\n        self.http_method = http_method\n        self.protocol_version = protocol_version\n        if inspect.isclass(iterator) and issubclass(iterator, BaseOAIIterator):\n            self.iterator = iterator\n        else:\n            raise TypeError(\"Argument 'iterator' must be subclass of %s\" % BaseOAIIterator.__name__)\n        self.max_retries = max_retries\n        self.retry_status_codes = retry_status_codes or (503,)\n        self.default_retry_after = default_retry_after\n        self.oai_namespace = OAI_NAMESPACE % self.protocol_version\n        self.class_mapping = class_mapping or DEFAULT_CLASS_MAP\n        self.encoding = encoding\n        self.timeout = timeout\n        self.request_args = request_args\n\n    def harvest(self, **kwargs: str) -> OAIResponse:\n        \"\"\"Make HTTP requests to the OAI server.\n\n        :param kwargs: OAI HTTP parameters.\n        \"\"\"\n        http_response = self._request(kwargs)\n        for _ in range(self.max_retries):\n            if self._is_error_code(http_response.status_code) and http_response.status_code in self.retry_status_codes:\n                retry_after = self.get_retry_after(http_response)\n                logger.warning(\"HTTP %d! Retrying after %d seconds...\" % (http_response.status_code, retry_after))\n                time.sleep(retry_after)\n                http_response = self._request(kwargs)\n        http_response.raise_for_status()\n        if self.encoding:\n            http_response.encoding = self.encoding\n        return OAIResponse(http_response, params=kwargs)\n\n    def _request(self, kwargs: str) -> Response:\n        if self.http_method == \"GET\":\n            return requests.get(self.endpoint, timeout=self.timeout, params=kwargs, **self.request_args)\n        return requests.post(self.endpoint, data=kwargs, timeout=self.timeout, **self.request_args)\n\n    def list_records(self, ignore_deleted: bool = False, **kwargs: str) -> BaseOAIIterator:\n        \"\"\"Issue a ListRecords request.\n\n        :param ignore_deleted: If set to :obj:`True`, the resulting\n                              iterator will skip records flagged as deleted.\n        \"\"\"\n        params = kwargs\n        params.update({\"verb\": \"ListRecords\"})\n        return self.iterator(self, params, ignore_deleted=ignore_deleted)\n\n    def list_identifiers(self, ignore_deleted: bool = False, **kwargs: str) -> BaseOAIIterator:\n        \"\"\"Issue a ListIdentifiers request.\n\n        :param ignore_deleted: If set to :obj:`True`, the resulting\n                              iterator will skip records flagged as deleted.\n        \"\"\"\n        params = kwargs\n        params.update({\"verb\": \"ListIdentifiers\"})\n        return self.iterator(self, params, ignore_deleted=ignore_deleted)\n\n    def list_sets(self, **kwargs: str) -> BaseOAIIterator:\n        \"\"\"Issue a ListSets request.\"\"\"\n        params = kwargs\n        params.update({\"verb\": \"ListSets\"})\n        return self.iterator(self, params)\n\n    def identify(self) -> Identify:\n        \"\"\"Issue an Identify request.\"\"\"\n        params = {\"verb\": \"Identify\"}\n        return Identify(self.harvest(**params))\n\n    def get_record(self, **kwargs: str) -> Record:\n        \"\"\"Issue a GetRecord request.\"\"\"\n        params = kwargs\n        params.update({\"verb\": \"GetRecord\"})\n        record = self.iterator(self, params).next()\n        return record\n\n    def list_metadataformats(self, **kwargs: str) -> BaseOAIIterator:\n        \"\"\"Issue a ListMetadataFormats request.\"\"\"\n        params = kwargs\n        params.update({\"verb\": \"ListMetadataFormats\"})\n        return self.iterator(self, params)\n\n    def ListRecords(self, ignore_deleted: bool = False, **kwargs: str) -> BaseOAIIterator:\n        warnings.warn(\"ListRecords is deprecated, use list_records instead\", DeprecationWarning, stacklevel=2)\n        return self.list_records(ignore_deleted, **kwargs)\n\n    def ListIdentifiers(self, ignore_deleted: bool = False, **kwargs: str) -> BaseOAIIterator:\n        warnings.warn(\"ListIdentifiers is deprecated, use list_identifiers instead\", DeprecationWarning, stacklevel=2)\n        return self.list_identifiers(ignore_deleted, **kwargs)\n\n    def ListSets(self, **kwargs: str) -> BaseOAIIterator:\n        warnings.warn(\"ListSets is deprecated, use list_sets instead\", DeprecationWarning, stacklevel=2)\n        return self.list_sets(**kwargs)\n\n    def Identify(self) -> Identify:\n        warnings.warn(\"Identify is deprecated, use identify instead\", DeprecationWarning, stacklevel=2)\n        return self.identify()\n\n    def GetRecord(self, **kwargs: str) -> Record:\n        warnings.warn(\"GetRecord is deprecated, use get_record instead\", DeprecationWarning, stacklevel=2)\n        return self.get_record(**kwargs)\n\n    def ListMetadataFormats(self, **kwargs: str) -> BaseOAIIterator:\n        warnings.warn(\n            \"ListMetadataFormats is deprecated, use list_metadataformats instead\", DeprecationWarning, stacklevel=2\n        )\n        return self.list_metadataformats(**kwargs)\n\n    def get_retry_after(self, http_response: Response) -> int:\n        if http_response.status_code == 503:\n            try:\n                return int(http_response.headers.get(\"retry-after\"))\n            except TypeError:\n                return self.default_retry_after\n        return self.default_retry_after\n\n    @staticmethod\n    def _is_error_code(status_code: int) -> bool:\n        return status_code >= 400\n
"},{"location":"api/#oaipmh_scythe.app.Scythe.get_record","title":"get_record(**kwargs)","text":"

Issue a GetRecord request.

Source code in src/oaipmh_scythe/app.py
def get_record(self, **kwargs: str) -> Record:\n    \"\"\"Issue a GetRecord request.\"\"\"\n    params = kwargs\n    params.update({\"verb\": \"GetRecord\"})\n    record = self.iterator(self, params).next()\n    return record\n
"},{"location":"api/#oaipmh_scythe.app.Scythe.harvest","title":"harvest(**kwargs)","text":"

Make HTTP requests to the OAI server.

:param kwargs: OAI HTTP parameters.

Source code in src/oaipmh_scythe/app.py
def harvest(self, **kwargs: str) -> OAIResponse:\n    \"\"\"Make HTTP requests to the OAI server.\n\n    :param kwargs: OAI HTTP parameters.\n    \"\"\"\n    http_response = self._request(kwargs)\n    for _ in range(self.max_retries):\n        if self._is_error_code(http_response.status_code) and http_response.status_code in self.retry_status_codes:\n            retry_after = self.get_retry_after(http_response)\n            logger.warning(\"HTTP %d! Retrying after %d seconds...\" % (http_response.status_code, retry_after))\n            time.sleep(retry_after)\n            http_response = self._request(kwargs)\n    http_response.raise_for_status()\n    if self.encoding:\n        http_response.encoding = self.encoding\n    return OAIResponse(http_response, params=kwargs)\n
"},{"location":"api/#oaipmh_scythe.app.Scythe.identify","title":"identify()","text":"

Issue an Identify request.

Source code in src/oaipmh_scythe/app.py
def identify(self) -> Identify:\n    \"\"\"Issue an Identify request.\"\"\"\n    params = {\"verb\": \"Identify\"}\n    return Identify(self.harvest(**params))\n
"},{"location":"api/#oaipmh_scythe.app.Scythe.list_identifiers","title":"list_identifiers(ignore_deleted=False, **kwargs)","text":"

Issue a ListIdentifiers request.

:param ignore_deleted: If set to :obj:True, the resulting iterator will skip records flagged as deleted.

Source code in src/oaipmh_scythe/app.py
def list_identifiers(self, ignore_deleted: bool = False, **kwargs: str) -> BaseOAIIterator:\n    \"\"\"Issue a ListIdentifiers request.\n\n    :param ignore_deleted: If set to :obj:`True`, the resulting\n                          iterator will skip records flagged as deleted.\n    \"\"\"\n    params = kwargs\n    params.update({\"verb\": \"ListIdentifiers\"})\n    return self.iterator(self, params, ignore_deleted=ignore_deleted)\n
"},{"location":"api/#oaipmh_scythe.app.Scythe.list_metadataformats","title":"list_metadataformats(**kwargs)","text":"

Issue a ListMetadataFormats request.

Source code in src/oaipmh_scythe/app.py
def list_metadataformats(self, **kwargs: str) -> BaseOAIIterator:\n    \"\"\"Issue a ListMetadataFormats request.\"\"\"\n    params = kwargs\n    params.update({\"verb\": \"ListMetadataFormats\"})\n    return self.iterator(self, params)\n
"},{"location":"api/#oaipmh_scythe.app.Scythe.list_records","title":"list_records(ignore_deleted=False, **kwargs)","text":"

Issue a ListRecords request.

:param ignore_deleted: If set to :obj:True, the resulting iterator will skip records flagged as deleted.

Source code in src/oaipmh_scythe/app.py
def list_records(self, ignore_deleted: bool = False, **kwargs: str) -> BaseOAIIterator:\n    \"\"\"Issue a ListRecords request.\n\n    :param ignore_deleted: If set to :obj:`True`, the resulting\n                          iterator will skip records flagged as deleted.\n    \"\"\"\n    params = kwargs\n    params.update({\"verb\": \"ListRecords\"})\n    return self.iterator(self, params, ignore_deleted=ignore_deleted)\n
"},{"location":"api/#oaipmh_scythe.app.Scythe.list_sets","title":"list_sets(**kwargs)","text":"

Issue a ListSets request.

Source code in src/oaipmh_scythe/app.py
def list_sets(self, **kwargs: str) -> BaseOAIIterator:\n    \"\"\"Issue a ListSets request.\"\"\"\n    params = kwargs\n    params.update({\"verb\": \"ListSets\"})\n    return self.iterator(self, params)\n
"},{"location":"api/#working-with-oai-responses","title":"Working with OAI Responses","text":""},{"location":"api/#iterating-over-oai-items","title":"Iterating over OAI Items","text":"

Bases: BaseOAIIterator

Iterator over OAI records/identifiers/sets transparently aggregated via OAI-PMH.

Can be used to conveniently iterate through the records of a repository.

:param scythe: The Scythe object that issued the first request. :param params: The OAI arguments. :type params: dict :param ignore_deleted: Flag for whether to ignore deleted records.

Source code in src/oaipmh_scythe/iterator.py
class OAIItemIterator(BaseOAIIterator):\n    \"\"\"Iterator over OAI records/identifiers/sets transparently aggregated via OAI-PMH.\n\n    Can be used to conveniently iterate through the records of a repository.\n\n    :param scythe: The Scythe object that issued the first request.\n    :param params: The OAI arguments.\n    :type params:  dict\n    :param ignore_deleted: Flag for whether to ignore deleted records.\n    \"\"\"\n\n    def __init__(self, scythe: Scythe, params: dict[str, str], ignore_deleted: bool = False) -> None:\n        self.mapper = scythe.class_mapping[params.get(\"verb\")]\n        self.element = VERBS_ELEMENTS[params.get(\"verb\")]\n        super().__init__(scythe, params, ignore_deleted)\n\n    def _next_response(self):\n        super()._next_response()\n        self._items = self.oai_response.xml.iterfind(\".//\" + self.scythe.oai_namespace + self.element)\n\n    def next(self):\n        \"\"\"Return the next record/header/set.\"\"\"\n        while True:\n            for item in self._items:\n                mapped = self.mapper(item)\n                if self.ignore_deleted and mapped.deleted:\n                    continue\n                return mapped\n            if self.resumption_token and self.resumption_token.token:\n                self._next_response()\n            else:\n                raise StopIteration\n
"},{"location":"api/#oaipmh_scythe.iterator.OAIItemIterator.next","title":"next()","text":"

Return the next record/header/set.

Source code in src/oaipmh_scythe/iterator.py
def next(self):\n    \"\"\"Return the next record/header/set.\"\"\"\n    while True:\n        for item in self._items:\n            mapped = self.mapper(item)\n            if self.ignore_deleted and mapped.deleted:\n                continue\n            return mapped\n        if self.resumption_token and self.resumption_token.token:\n            self._next_response()\n        else:\n            raise StopIteration\n
"},{"location":"api/#iterating-over-oai-responses","title":"Iterating over OAI Responses","text":"

Bases: BaseOAIIterator

Iterator over OAI responses.

Source code in src/oaipmh_scythe/iterator.py
class OAIResponseIterator(BaseOAIIterator):\n    \"\"\"Iterator over OAI responses.\"\"\"\n\n    def next(self):\n        \"\"\"Return the next response.\"\"\"\n        while True:\n            if self.oai_response:\n                response = self.oai_response\n                self.oai_response = None\n                return response\n            elif self.resumption_token and self.resumption_token.token:\n                self._next_response()\n            else:\n                raise StopIteration\n
"},{"location":"api/#oaipmh_scythe.iterator.OAIResponseIterator.next","title":"next()","text":"

Return the next response.

Source code in src/oaipmh_scythe/iterator.py
def next(self):\n    \"\"\"Return the next response.\"\"\"\n    while True:\n        if self.oai_response:\n            response = self.oai_response\n            self.oai_response = None\n            return response\n        elif self.resumption_token and self.resumption_token.token:\n            self._next_response()\n        else:\n            raise StopIteration\n
"},{"location":"api/#classes-for-oai-items","title":"Classes for OAI Items","text":""},{"location":"api/#identify","title":"Identify","text":""},{"location":"api/#record","title":"Record","text":"

Record objects represent single OAI records.

Bases: OAIItem

Represents an OAI record.

:param record_element: The XML element 'record'. :type record_element: :class:lxml.etree._Element :param strip_ns: Flag for whether to remove the namespaces from the element names.

Source code in src/oaipmh_scythe/models.py
class Record(OAIItem):\n    \"\"\"Represents an OAI record.\n\n    :param record_element: The XML element 'record'.\n    :type record_element: :class:`lxml.etree._Element`\n    :param strip_ns: Flag for whether to remove the namespaces from the\n                     element names.\n    \"\"\"\n\n    def __init__(self, record_element: etree._Element, strip_ns: bool = True) -> None:\n        super().__init__(record_element, strip_ns=strip_ns)\n        self.header = Header(self.xml.find(\".//\" + self._oai_namespace + \"header\"))\n        self.deleted = self.header.deleted\n        if not self.deleted:\n            self.metadata = self.get_metadata()\n\n    def __repr__(self) -> str:\n        if self.header.deleted:\n            return f\"<Record {self.header.identifier} [deleted]>\"\n        return f\"<Record {self.header.identifier}>\"\n\n    def __iter__(self):\n        return iter(self.metadata.items())\n\n    def get_metadata(self):\n        # We want to get record/metadata/<container>/*\n        # <container> would be the element ``dc``\n        # in the ``oai_dc`` case.\n        return xml_to_dict(\n            self.xml.find(\".//\" + self._oai_namespace + \"metadata\").getchildren()[0],\n            strip_ns=self._strip_ns,\n        )\n
"},{"location":"api/#header","title":"Header","text":""},{"location":"api/#set","title":"Set","text":""},{"location":"api/#metadataformat","title":"MetadataFormat","text":""},{"location":"changelog/","title":"Changelog","text":""},{"location":"changelog/#changelog","title":"Changelog","text":""},{"location":"changelog/#unreleased","title":"Unreleased","text":""},{"location":"changelog/#070-2020-05-17","title":"0.7.0 (2020-05-17)","text":""},{"location":"changelog/#065-2020-01-12","title":"0.6.5 (2020-01-12)","text":""},{"location":"changelog/#064-2018-10-02","title":"0.6.4 (2018-10-02)","text":""},{"location":"changelog/#063-2018-04-08","title":"0.6.3 (2018-04-08)","text":""},{"location":"changelog/#062-2017-08-11","title":"0.6.2 (2017-08-11)","text":""},{"location":"changelog/#061-2016-11-13","title":"0.6.1 (2016-11-13)","text":""},{"location":"changelog/#05-2015-11-12","title":"0.5 (2015-11-12)","text":""},{"location":"changelog/#04-2015-05-31","title":"0.4 (2015-05-31)","text":""},{"location":"changelog/#03-2013-04-17","title":"0.3 (2013-04-17)","text":""},{"location":"changelog/#02-2013-02-26","title":"0.2 (2013-02-26)","text":""},{"location":"changelog/#01-2013-02-20","title":"0.1 (2013-02-20)","text":"

First public release.

"},{"location":"credits/","title":"Credits","text":""},{"location":"customizing/","title":"Harvesting other Metadata Formats than OAI-DC","text":"

By default, oaipmh-scythe's mapping of the record XML into Python dictionaries is tailored to work only with Dublin-Core-encoded metadata payloads. Other formats most probably won't be mapped correctly, especially if they are more hierarchically structured than Dublin Core.

In case you want to harvest these more complex formats, you have to write your own record model class by subclassing the default implementation that unpacks the metadata XML:

from oaipmh_scythe.models import Record\n\nclass MyRecord(Record):\n    # Your XML unpacking implementation goes here.\n    pass\n

Note

Take a look at the implementation of oaipmh_scythe.models.Record to get an idea of how to do this.

Next, associate your implementation with OAI verbs in the oaipmh_scythe.app.Scythe object. In this case, we want the oaipmh_scythe.app.Scythe object to use our implementation to represent items returned by ListRecords and GetRecord responses:

scythe = Scythe('http://...')\nscythe.class_mapping['ListRecords'] = MyRecord\nscythe.class_mapping['GetRecord'] = MyRecord\n

If you need to rewrite all item implementations, you can also provide a complete mapping to the oaipmh_scythe.app.Scythe object at instantiation:

my_mapping = {\n    'ListRecords': MyRecord,\n    'GetRecord': MyRecord,\n    # ...\n}\n\nscythe = Scythe('https://...', class_mapping=my_mapping)\n
"},{"location":"development/","title":"Development","text":""},{"location":"license/","title":"License","text":"

Copyright (c) 2013 by Mathias Loesch.

Some rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright\n  notice, this list of conditions and the following disclaimer.\n\n* Redistributions in binary form must reproduce the above\n  copyright notice, this list of conditions and the following\n  disclaimer in the documentation and/or other materials provided\n  with the distribution.\n\n* The names of the contributors may not be used to endorse or\n  promote products derived from this software without specific\n  prior written permission.\n

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS \"AS IS\" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

"},{"location":"oaipmh/","title":"OAI-PMH Primer","text":"

This section gives a basic overview of the Open Archives Protocol for Metadata Harvesting (OAI-PMH). For more detailed information, please refer to the protocol specification.

"},{"location":"oaipmh/#glossary-of-important-oai-pmh-concepts","title":"Glossary of Important OAI-PMH Concepts","text":"

Repository

A repository is a server-side application that exposes metadata via OAI-PMH.

Harvester

OAI-PMH client applications like Sickle are called harvesters.

record

A record is the XML-encoded container for the metadata of a single publication item. It consists of a header and a metadata section.

header

The record header contains a unique identifier and a datestamp.

metadata

The record metadata contains the publication metadata in a defined metadata format.

set

A structure for grouping records for selective harvesting.

harvesting

The process of requesting records from the repository by the harvester.

"},{"location":"oaipmh/#oai-verbs","title":"OAI Verbs","text":"

OAI-PMH features six main API methods (so-called \"OAI verbs\") that can be issued by harvesters. Some verbs can be combined with further arguments:

Identify

Returns information about the repository. Arguments: None.

GetRecord

Returns a single record. Arguments:

ListRecords

Returns the records in the repository in batches (possibly filtered by a timestamp or a set). Arguments:

ListIdentifiers

Like ListRecords but returns only the record headers.

ListSets

Returns the list of sets supported by this repository. Arguments: None

ListMetadataFormats

Returns the list of metadata formats supported by this repository. Arguments: None

"},{"location":"oaipmh/#metadata-formats","title":"Metadata Formats","text":"

OAI interfaces may expose metadata records in multiple metadata formats. These formats are identified by so-called \"metadata prefixes\". For instance, the prefix oai_dc refers to the OAI-DC format, which by definition has to be exposed by every valid OAI interface. OAI-DC is based on the 15 metadata elements specified in the Dublin Core Metadata Element Set.

Note

oaipmh-scythe only supports the OAI-DC format out of the box. See the section on customizing for information on how to extend oaipmh-scythe for retrieving metadata in other formats.

"},{"location":"tutorial/","title":"Tutorial","text":"

This section gives a brief overview on how to use oaipmh-scythe for querying OAI interfaces.

"},{"location":"tutorial/#initialize-an-oai-interface","title":"Initialize an OAI Interface","text":"

To make a connection to an OAI interface, you need to import the Scythe class:

>>> from oaipmh_scythe import Scythe\n

Next, you can initialize the connection by passing it the base URL. In our example, we use the OAI interface of Zenodo:

>>> scythe = Scythe(\"https://zenodo.org/oai2d\")\n
"},{"location":"tutorial/#issuing-requests","title":"Issuing Requests","text":"

oaipmh-scythe provides methods for each of the six OAI verbs (ListRecords, GetRecord, Idenitfy, ListSets, ListMetadataFormats, ListIdentifiers). Start with a ListRecords request:

>>> records = scythe.ListRecords(metadataPrefix='oai_dc')\n

Note that all keyword arguments you provide to this function are passed to the OAI interface as HTTP parameters. Therefore the example request would send the parameters verb=ListRecords&metadataPrefix=oai_dc. We can add additional parameters, like, for example, an OAI set:

>>> records = scythe.ListRecords(metadataPrefix=\"oai_dc\", set=\"driver\")\n
"},{"location":"tutorial/#consecutive-harvesting","title":"Consecutive Harvesting","text":"

Since most OAI verbs yield more than one element, their respective Scythe methods return iterator objects which can be used to iterate over the records of a repository:

>>> records = scythe.ListRecords(metadataPrefix=\"oai_dc\")\n>>> records.next()\n<Record oai:oai:zenodo.org:4574771>\n

Note that this works with all verbs that return more than one element. These are: ListRecords(), ListIdentifiers(), ListSets(), and ListMetadataFormats().

The following example shows how to iterate over the headers returned by ListIdentifiers:

>>> headers = scythe.ListIdentifiers(metadataPrefix=\"oai_dc\")\n>>> headers.next()\n<Header oai:eprints.rclis.org:4088>\n

Iterating over the sets returned by ListSets works similarly:

>>> sets = scythe.ListSets()\n>>> sets.next()\n<Set Status = In Press>\n
"},{"location":"tutorial/#using-the-from-parameter","title":"Using the from Parameter","text":"

If you need to perform selective harvesting by date using the from parameter, you may face the problem that from is a reserved word in Python:

>>> records = scythe.ListRecords(metadataPrefix=\"oai_dc\", from=\"2012-12-12\")\n  File \"<stdin>\", line 1\n    records = scythe.ListRecords(metadataPrefix=\"oai_dc\", from=\"2012-12-12\")\n                                                              ^\nSyntaxError: invalid syntax\n

Fortunately, you can circumvent this problem by using a dictionary together with the ** operator:

>>> records = scythe.ListRecords(\n...             **{'metadataPrefix': 'oai_dc',\n...             'from': '2012-12-12'\n...            })\n
"},{"location":"tutorial/#getting-a-single-record","title":"Getting a Single Record","text":"

OAI-PMH allows you to get a single record by using the GetRecord verb:

>>> scythe.GetRecord(identifier='oai:eprints.rclis.org:4088',\n...                  metadataPrefix='oai_dc')\n<Record oai:eprints.rclis.org:4088>\n
"},{"location":"tutorial/#harvesting-oai-items-vs-oai-responses","title":"Harvesting OAI Items vs. OAI Responses","text":"

Scythe supports two harvesting modes that differ in the type of the returned objects. The default mode returns OAI-specific items (records, headers etc.) encoded as Python objects as seen earlier. If you want to save the whole XML response returned by the server, you have to pass the oaipmh_scythe.iterator.OAIResponseIterator during the instantiation of the Scythe object:

>>> scythe = Scythe('http://elis.da.ulcc.ac.uk/cgi/oai2', iterator=OAIResponseIterator)\n>>> responses = Scythe.ListRecords(metadataPrefix='oai_dc')\n>>> responses.next()\n<OAIResponse ListRecords>\n

You could then save the returned responses to disk:

>>> with open(\"response.xml\", \"w\") as f:\n...     f.write(responses.next().raw.encode(\"utf8\"))\n
"},{"location":"tutorial/#ignoring-deleted-records","title":"Ignoring Deleted Records","text":"

The ListRecords() and ListIdentifiers() methods accept an optional parameter ignore_deleted. If set to True, the returned OAIItemIterator will skip deleted records/headers:

>>> records = scythe.ListRecords(metadataPrefix=\"oai_dc\", ignore_deleted=True)\n

Note

This works only using the oaipmh_scythe.iterator.OAIItemIterator. If you use the oaipmh_scythe.iterator.OAIResponseIterator, the resulting OAI responses will still contain the deleted records.

"}]} \ No newline at end of file +{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Home","text":""},{"location":"#oaipmh-scythe-oai-pmh-for-humans","title":"oaipmh-scythe: OAI-PMH for Humans","text":"

This is a community maintained fork of the original sickle.

CI Docs Meta

oaipmh-scythe is a lightweight OAI-PMH client library written in Python. It has been designed for conveniently retrieving data from OAI interfaces the Pythonic way:

>>> from oaipmh_scythe import Scythe\n>>> scythe = Scythe(\"https://zenodo.org/oai2d\")\n>>> records = scythe.list_records(metadataPrefix=\"oai_dc\")\n>>> records.next()\n<Record oai:zenodo.org:4574771>\n
"},{"location":"#features","title":"Features","text":""},{"location":"#requirements","title":"Requirements","text":"

Python >= 3.8

"},{"location":"#installation","title":"Installation","text":"
python -m pip install oaipmh-scythe\n
"},{"location":"#documentation","title":"Documentation","text":"

The documentation is made with Material for MkDocs and is hosted by GitHub Pages.

"},{"location":"#license","title":"License","text":"

oaipmh-scythe is distributed under the terms of the BSD license.

"},{"location":"api/","title":"API","text":""},{"location":"api/#the-scythe-client","title":"The Scythe Client","text":"

Client for harvesting OAI interfaces.

Use it like this:

>>> scythe = Scythe(\"https://zenodo.org/oai2d\")\n>>> records = scythe.list_records(metadataPrefix=\"oai_dc\")\n>>> records.next()\n<Record oai:zenodo.org:4574771>\n

:param endpoint: The endpoint of the OAI interface. :param http_method: Method used for requests (GET or POST, default: GET). :param protocol_version: The OAI protocol version. :param iterator: The type of the returned iterator (default: :class:sickle.iterator.OAIItemIterator) :param max_retries: Number of retry attempts if an HTTP request fails (default: 0 = request only once). Sickle will use the value from the retry-after header (if present) and will wait the specified number of seconds between retries. :param retry_status_codes: HTTP status codes to retry (default will only retry on 503) :param default_retry_after: default number of seconds to wait between retries in case no retry-after header is found on the response (defaults to 60 seconds) :param class_mapping: A dictionary that maps OAI verbs to classes representing OAI items. If not provided, :data:sickle.app.DEFAULT_CLASS_MAPPING will be used. :param encoding: Can be used to override the encoding used when decoding the server response. If not specified, requests will use the encoding returned by the server in the content-type header. However, if the charset information is missing, requests will fallback to 'ISO-8859-1'. :param request_args: Arguments to be passed to requests when issuing HTTP requests. Useful examples are auth=('username', 'password') for basic auth-protected endpoints or timeout=<int>. See the documentation of requests <http://docs.python-requests.org/en/master/api/#main-interface>_ for all available parameters.

Source code in src/oaipmh_scythe/app.py
class Scythe:\n    \"\"\"Client for harvesting OAI interfaces.\n\n    Use it like this:\n\n        >>> scythe = Scythe(\"https://zenodo.org/oai2d\")\n        >>> records = scythe.list_records(metadataPrefix=\"oai_dc\")\n        >>> records.next()\n        <Record oai:zenodo.org:4574771>\n\n    :param endpoint: The endpoint of the OAI interface.\n    :param http_method: Method used for requests (GET or POST, default: GET).\n    :param protocol_version: The OAI protocol version.\n    :param iterator: The type of the returned iterator\n           (default: :class:`sickle.iterator.OAIItemIterator`)\n    :param max_retries: Number of retry attempts if an HTTP request fails (default: 0 = request only once). Sickle will\n                        use the value from the retry-after header (if present) and will wait the specified number of\n                        seconds between retries.\n    :param retry_status_codes: HTTP status codes to retry (default will only retry on 503)\n    :param default_retry_after: default number of seconds to wait between retries in case no retry-after header is found\n                                on the response (defaults to 60 seconds)\n    :param class_mapping: A dictionary that maps OAI verbs to classes representing\n                          OAI items. If not provided,\n                          :data:`sickle.app.DEFAULT_CLASS_MAPPING` will be used.\n    :param encoding:     Can be used to override the encoding used when decoding\n                         the server response. If not specified, `requests` will\n                         use the encoding returned by the server in the\n                         `content-type` header. However, if the `charset`\n                         information is missing, `requests` will fallback to\n                         `'ISO-8859-1'`.\n    :param request_args: Arguments to be passed to requests when issuing HTTP\n                         requests. Useful examples are `auth=('username', 'password')`\n                         for basic auth-protected endpoints or `timeout=<int>`.\n                         See the `documentation of requests <http://docs.python-requests.org/en/master/api/#main-interface>`_\n                         for all available parameters.\n    \"\"\"\n\n    def __init__(\n        self,\n        endpoint: str,\n        http_method: str = \"GET\",\n        protocol_version: str = \"2.0\",\n        iterator: BaseOAIIterator = OAIItemIterator,\n        max_retries: int = 0,\n        retry_status_codes: Iterable[int] | None = None,\n        default_retry_after: int = 60,\n        class_mapping: dict[str, OAIItem] | None = None,\n        encoding: str | None = None,\n        timeout: int = 60,\n        **request_args: str,\n    ):\n        self.endpoint = endpoint\n        if http_method not in (\"GET\", \"POST\"):\n            raise ValueError(\"Invalid HTTP method: %s! Must be GET or POST.\")\n        if protocol_version not in (\"2.0\", \"1.0\"):\n            raise ValueError(\"Invalid protocol version: %s! Must be 1.0 or 2.0.\")\n        self.http_method = http_method\n        self.protocol_version = protocol_version\n        if inspect.isclass(iterator) and issubclass(iterator, BaseOAIIterator):\n            self.iterator = iterator\n        else:\n            raise TypeError(\"Argument 'iterator' must be subclass of %s\" % BaseOAIIterator.__name__)\n        self.max_retries = max_retries\n        self.retry_status_codes = retry_status_codes or (503,)\n        self.default_retry_after = default_retry_after\n        self.oai_namespace = OAI_NAMESPACE % self.protocol_version\n        self.class_mapping = class_mapping or DEFAULT_CLASS_MAP\n        self.encoding = encoding\n        self.timeout = timeout\n        self.request_args = request_args\n\n    def harvest(self, **kwargs: str) -> OAIResponse:\n        \"\"\"Make HTTP requests to the OAI server.\n\n        :param kwargs: OAI HTTP parameters.\n        \"\"\"\n        http_response = self._request(kwargs)\n        for _ in range(self.max_retries):\n            if self._is_error_code(http_response.status_code) and http_response.status_code in self.retry_status_codes:\n                retry_after = self.get_retry_after(http_response)\n                logger.warning(\"HTTP %d! Retrying after %d seconds...\" % (http_response.status_code, retry_after))\n                time.sleep(retry_after)\n                http_response = self._request(kwargs)\n        http_response.raise_for_status()\n        if self.encoding:\n            http_response.encoding = self.encoding\n        return OAIResponse(http_response, params=kwargs)\n\n    def _request(self, kwargs: str) -> Response:\n        if self.http_method == \"GET\":\n            return requests.get(self.endpoint, timeout=self.timeout, params=kwargs, **self.request_args)\n        return requests.post(self.endpoint, data=kwargs, timeout=self.timeout, **self.request_args)\n\n    def list_records(self, ignore_deleted: bool = False, **kwargs: str) -> BaseOAIIterator:\n        \"\"\"Issue a ListRecords request.\n\n        :param ignore_deleted: If set to :obj:`True`, the resulting\n                              iterator will skip records flagged as deleted.\n        \"\"\"\n        params = kwargs\n        params.update({\"verb\": \"ListRecords\"})\n        return self.iterator(self, params, ignore_deleted=ignore_deleted)\n\n    def list_identifiers(self, ignore_deleted: bool = False, **kwargs: str) -> BaseOAIIterator:\n        \"\"\"Issue a ListIdentifiers request.\n\n        :param ignore_deleted: If set to :obj:`True`, the resulting\n                              iterator will skip records flagged as deleted.\n        \"\"\"\n        params = kwargs\n        params.update({\"verb\": \"ListIdentifiers\"})\n        return self.iterator(self, params, ignore_deleted=ignore_deleted)\n\n    def list_sets(self, **kwargs: str) -> BaseOAIIterator:\n        \"\"\"Issue a ListSets request.\"\"\"\n        params = kwargs\n        params.update({\"verb\": \"ListSets\"})\n        return self.iterator(self, params)\n\n    def identify(self) -> Identify:\n        \"\"\"Issue an Identify request.\"\"\"\n        params = {\"verb\": \"Identify\"}\n        return Identify(self.harvest(**params))\n\n    def get_record(self, **kwargs: str) -> Record:\n        \"\"\"Issue a GetRecord request.\"\"\"\n        params = kwargs\n        params.update({\"verb\": \"GetRecord\"})\n        record = self.iterator(self, params).next()\n        return record\n\n    def list_metadataformats(self, **kwargs: str) -> BaseOAIIterator:\n        \"\"\"Issue a ListMetadataFormats request.\"\"\"\n        params = kwargs\n        params.update({\"verb\": \"ListMetadataFormats\"})\n        return self.iterator(self, params)\n\n    def ListRecords(self, ignore_deleted: bool = False, **kwargs: str) -> BaseOAIIterator:\n        warnings.warn(\"ListRecords is deprecated, use list_records instead\", DeprecationWarning, stacklevel=2)\n        return self.list_records(ignore_deleted, **kwargs)\n\n    def ListIdentifiers(self, ignore_deleted: bool = False, **kwargs: str) -> BaseOAIIterator:\n        warnings.warn(\"ListIdentifiers is deprecated, use list_identifiers instead\", DeprecationWarning, stacklevel=2)\n        return self.list_identifiers(ignore_deleted, **kwargs)\n\n    def ListSets(self, **kwargs: str) -> BaseOAIIterator:\n        warnings.warn(\"ListSets is deprecated, use list_sets instead\", DeprecationWarning, stacklevel=2)\n        return self.list_sets(**kwargs)\n\n    def Identify(self) -> Identify:\n        warnings.warn(\"Identify is deprecated, use identify instead\", DeprecationWarning, stacklevel=2)\n        return self.identify()\n\n    def GetRecord(self, **kwargs: str) -> Record:\n        warnings.warn(\"GetRecord is deprecated, use get_record instead\", DeprecationWarning, stacklevel=2)\n        return self.get_record(**kwargs)\n\n    def ListMetadataFormats(self, **kwargs: str) -> BaseOAIIterator:\n        warnings.warn(\n            \"ListMetadataFormats is deprecated, use list_metadataformats instead\", DeprecationWarning, stacklevel=2\n        )\n        return self.list_metadataformats(**kwargs)\n\n    def get_retry_after(self, http_response: Response) -> int:\n        if http_response.status_code == 503:\n            try:\n                return int(http_response.headers.get(\"retry-after\"))\n            except TypeError:\n                return self.default_retry_after\n        return self.default_retry_after\n\n    @staticmethod\n    def _is_error_code(status_code: int) -> bool:\n        return status_code >= 400\n
"},{"location":"api/#oaipmh_scythe.app.Scythe.get_record","title":"get_record(**kwargs)","text":"

Issue a GetRecord request.

Source code in src/oaipmh_scythe/app.py
def get_record(self, **kwargs: str) -> Record:\n    \"\"\"Issue a GetRecord request.\"\"\"\n    params = kwargs\n    params.update({\"verb\": \"GetRecord\"})\n    record = self.iterator(self, params).next()\n    return record\n
"},{"location":"api/#oaipmh_scythe.app.Scythe.harvest","title":"harvest(**kwargs)","text":"

Make HTTP requests to the OAI server.

:param kwargs: OAI HTTP parameters.

Source code in src/oaipmh_scythe/app.py
def harvest(self, **kwargs: str) -> OAIResponse:\n    \"\"\"Make HTTP requests to the OAI server.\n\n    :param kwargs: OAI HTTP parameters.\n    \"\"\"\n    http_response = self._request(kwargs)\n    for _ in range(self.max_retries):\n        if self._is_error_code(http_response.status_code) and http_response.status_code in self.retry_status_codes:\n            retry_after = self.get_retry_after(http_response)\n            logger.warning(\"HTTP %d! Retrying after %d seconds...\" % (http_response.status_code, retry_after))\n            time.sleep(retry_after)\n            http_response = self._request(kwargs)\n    http_response.raise_for_status()\n    if self.encoding:\n        http_response.encoding = self.encoding\n    return OAIResponse(http_response, params=kwargs)\n
"},{"location":"api/#oaipmh_scythe.app.Scythe.identify","title":"identify()","text":"

Issue an Identify request.

Source code in src/oaipmh_scythe/app.py
def identify(self) -> Identify:\n    \"\"\"Issue an Identify request.\"\"\"\n    params = {\"verb\": \"Identify\"}\n    return Identify(self.harvest(**params))\n
"},{"location":"api/#oaipmh_scythe.app.Scythe.list_identifiers","title":"list_identifiers(ignore_deleted=False, **kwargs)","text":"

Issue a ListIdentifiers request.

:param ignore_deleted: If set to :obj:True, the resulting iterator will skip records flagged as deleted.

Source code in src/oaipmh_scythe/app.py
def list_identifiers(self, ignore_deleted: bool = False, **kwargs: str) -> BaseOAIIterator:\n    \"\"\"Issue a ListIdentifiers request.\n\n    :param ignore_deleted: If set to :obj:`True`, the resulting\n                          iterator will skip records flagged as deleted.\n    \"\"\"\n    params = kwargs\n    params.update({\"verb\": \"ListIdentifiers\"})\n    return self.iterator(self, params, ignore_deleted=ignore_deleted)\n
"},{"location":"api/#oaipmh_scythe.app.Scythe.list_metadataformats","title":"list_metadataformats(**kwargs)","text":"

Issue a ListMetadataFormats request.

Source code in src/oaipmh_scythe/app.py
def list_metadataformats(self, **kwargs: str) -> BaseOAIIterator:\n    \"\"\"Issue a ListMetadataFormats request.\"\"\"\n    params = kwargs\n    params.update({\"verb\": \"ListMetadataFormats\"})\n    return self.iterator(self, params)\n
"},{"location":"api/#oaipmh_scythe.app.Scythe.list_records","title":"list_records(ignore_deleted=False, **kwargs)","text":"

Issue a ListRecords request.

:param ignore_deleted: If set to :obj:True, the resulting iterator will skip records flagged as deleted.

Source code in src/oaipmh_scythe/app.py
def list_records(self, ignore_deleted: bool = False, **kwargs: str) -> BaseOAIIterator:\n    \"\"\"Issue a ListRecords request.\n\n    :param ignore_deleted: If set to :obj:`True`, the resulting\n                          iterator will skip records flagged as deleted.\n    \"\"\"\n    params = kwargs\n    params.update({\"verb\": \"ListRecords\"})\n    return self.iterator(self, params, ignore_deleted=ignore_deleted)\n
"},{"location":"api/#oaipmh_scythe.app.Scythe.list_sets","title":"list_sets(**kwargs)","text":"

Issue a ListSets request.

Source code in src/oaipmh_scythe/app.py
def list_sets(self, **kwargs: str) -> BaseOAIIterator:\n    \"\"\"Issue a ListSets request.\"\"\"\n    params = kwargs\n    params.update({\"verb\": \"ListSets\"})\n    return self.iterator(self, params)\n
"},{"location":"api/#working-with-oai-responses","title":"Working with OAI Responses","text":""},{"location":"api/#iterating-over-oai-items","title":"Iterating over OAI Items","text":"

Bases: BaseOAIIterator

Iterator over OAI records/identifiers/sets transparently aggregated via OAI-PMH.

Can be used to conveniently iterate through the records of a repository.

:param scythe: The Scythe object that issued the first request. :param params: The OAI arguments. :type params: dict :param ignore_deleted: Flag for whether to ignore deleted records.

Source code in src/oaipmh_scythe/iterator.py
class OAIItemIterator(BaseOAIIterator):\n    \"\"\"Iterator over OAI records/identifiers/sets transparently aggregated via OAI-PMH.\n\n    Can be used to conveniently iterate through the records of a repository.\n\n    :param scythe: The Scythe object that issued the first request.\n    :param params: The OAI arguments.\n    :type params:  dict\n    :param ignore_deleted: Flag for whether to ignore deleted records.\n    \"\"\"\n\n    def __init__(self, scythe: Scythe, params: dict[str, str], ignore_deleted: bool = False) -> None:\n        self.mapper = scythe.class_mapping[params.get(\"verb\")]\n        self.element = VERBS_ELEMENTS[params.get(\"verb\")]\n        super().__init__(scythe, params, ignore_deleted)\n\n    def _next_response(self):\n        super()._next_response()\n        self._items = self.oai_response.xml.iterfind(\".//\" + self.scythe.oai_namespace + self.element)\n\n    def next(self):\n        \"\"\"Return the next record/header/set.\"\"\"\n        while True:\n            for item in self._items:\n                mapped = self.mapper(item)\n                if self.ignore_deleted and mapped.deleted:\n                    continue\n                return mapped\n            if self.resumption_token and self.resumption_token.token:\n                self._next_response()\n            else:\n                raise StopIteration\n
"},{"location":"api/#oaipmh_scythe.iterator.OAIItemIterator.next","title":"next()","text":"

Return the next record/header/set.

Source code in src/oaipmh_scythe/iterator.py
def next(self):\n    \"\"\"Return the next record/header/set.\"\"\"\n    while True:\n        for item in self._items:\n            mapped = self.mapper(item)\n            if self.ignore_deleted and mapped.deleted:\n                continue\n            return mapped\n        if self.resumption_token and self.resumption_token.token:\n            self._next_response()\n        else:\n            raise StopIteration\n
"},{"location":"api/#iterating-over-oai-responses","title":"Iterating over OAI Responses","text":"

Bases: BaseOAIIterator

Iterator over OAI responses.

Source code in src/oaipmh_scythe/iterator.py
class OAIResponseIterator(BaseOAIIterator):\n    \"\"\"Iterator over OAI responses.\"\"\"\n\n    def next(self):\n        \"\"\"Return the next response.\"\"\"\n        while True:\n            if self.oai_response:\n                response = self.oai_response\n                self.oai_response = None\n                return response\n            elif self.resumption_token and self.resumption_token.token:\n                self._next_response()\n            else:\n                raise StopIteration\n
"},{"location":"api/#oaipmh_scythe.iterator.OAIResponseIterator.next","title":"next()","text":"

Return the next response.

Source code in src/oaipmh_scythe/iterator.py
def next(self):\n    \"\"\"Return the next response.\"\"\"\n    while True:\n        if self.oai_response:\n            response = self.oai_response\n            self.oai_response = None\n            return response\n        elif self.resumption_token and self.resumption_token.token:\n            self._next_response()\n        else:\n            raise StopIteration\n
"},{"location":"api/#classes-for-oai-items","title":"Classes for OAI Items","text":""},{"location":"api/#identify","title":"Identify","text":""},{"location":"api/#record","title":"Record","text":"

Record objects represent single OAI records.

Bases: OAIItem

Represents an OAI record.

:param record_element: The XML element 'record'. :type record_element: :class:lxml.etree._Element :param strip_ns: Flag for whether to remove the namespaces from the element names.

Source code in src/oaipmh_scythe/models.py
class Record(OAIItem):\n    \"\"\"Represents an OAI record.\n\n    :param record_element: The XML element 'record'.\n    :type record_element: :class:`lxml.etree._Element`\n    :param strip_ns: Flag for whether to remove the namespaces from the\n                     element names.\n    \"\"\"\n\n    def __init__(self, record_element: etree._Element, strip_ns: bool = True) -> None:\n        super().__init__(record_element, strip_ns=strip_ns)\n        self.header = Header(self.xml.find(\".//\" + self._oai_namespace + \"header\"))\n        self.deleted = self.header.deleted\n        if not self.deleted:\n            self.metadata = self.get_metadata()\n\n    def __repr__(self) -> str:\n        if self.header.deleted:\n            return f\"<Record {self.header.identifier} [deleted]>\"\n        return f\"<Record {self.header.identifier}>\"\n\n    def __iter__(self):\n        return iter(self.metadata.items())\n\n    def get_metadata(self):\n        # We want to get record/metadata/<container>/*\n        # <container> would be the element ``dc``\n        # in the ``oai_dc`` case.\n        return xml_to_dict(\n            self.xml.find(\".//\" + self._oai_namespace + \"metadata\").getchildren()[0],\n            strip_ns=self._strip_ns,\n        )\n
"},{"location":"api/#header","title":"Header","text":""},{"location":"api/#set","title":"Set","text":""},{"location":"api/#metadataformat","title":"MetadataFormat","text":""},{"location":"changelog/","title":"Changelog","text":""},{"location":"changelog/#changelog","title":"Changelog","text":""},{"location":"changelog/#unreleased","title":"Unreleased","text":""},{"location":"changelog/#070-2020-05-17","title":"0.7.0 (2020-05-17)","text":""},{"location":"changelog/#065-2020-01-12","title":"0.6.5 (2020-01-12)","text":""},{"location":"changelog/#064-2018-10-02","title":"0.6.4 (2018-10-02)","text":""},{"location":"changelog/#063-2018-04-08","title":"0.6.3 (2018-04-08)","text":""},{"location":"changelog/#062-2017-08-11","title":"0.6.2 (2017-08-11)","text":""},{"location":"changelog/#061-2016-11-13","title":"0.6.1 (2016-11-13)","text":""},{"location":"changelog/#05-2015-11-12","title":"0.5 (2015-11-12)","text":""},{"location":"changelog/#04-2015-05-31","title":"0.4 (2015-05-31)","text":""},{"location":"changelog/#03-2013-04-17","title":"0.3 (2013-04-17)","text":""},{"location":"changelog/#02-2013-02-26","title":"0.2 (2013-02-26)","text":""},{"location":"changelog/#01-2013-02-20","title":"0.1 (2013-02-20)","text":"

First public release.

"},{"location":"credits/","title":"Credits","text":""},{"location":"customizing/","title":"Harvesting other Metadata Formats than OAI-DC","text":"

By default, oaipmh-scythe's mapping of the record XML into Python dictionaries is tailored to work only with Dublin-Core-encoded metadata payloads. Other formats most probably won't be mapped correctly, especially if they are more hierarchically structured than Dublin Core.

In case you want to harvest these more complex formats, you have to write your own record model class by subclassing the default implementation that unpacks the metadata XML:

from oaipmh_scythe.models import Record\n\nclass MyRecord(Record):\n    # Your XML unpacking implementation goes here.\n    pass\n

Note

Take a look at the implementation of oaipmh_scythe.models.Record to get an idea of how to do this.

Next, associate your implementation with OAI verbs in the oaipmh_scythe.app.Scythe object. In this case, we want the oaipmh_scythe.app.Scythe object to use our implementation to represent items returned by ListRecords and GetRecord responses:

scythe = Scythe('http://...')\nscythe.class_mapping['ListRecords'] = MyRecord\nscythe.class_mapping['GetRecord'] = MyRecord\n

If you need to rewrite all item implementations, you can also provide a complete mapping to the oaipmh_scythe.app.Scythe object at instantiation:

my_mapping = {\n    'ListRecords': MyRecord,\n    'GetRecord': MyRecord,\n    # ...\n}\n\nscythe = Scythe('https://...', class_mapping=my_mapping)\n
"},{"location":"development/","title":"Development","text":""},{"location":"license/","title":"License","text":"

Copyright (c) 2013 by Mathias Loesch.

Some rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright\n  notice, this list of conditions and the following disclaimer.\n\n* Redistributions in binary form must reproduce the above\n  copyright notice, this list of conditions and the following\n  disclaimer in the documentation and/or other materials provided\n  with the distribution.\n\n* The names of the contributors may not be used to endorse or\n  promote products derived from this software without specific\n  prior written permission.\n

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS \"AS IS\" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

"},{"location":"oaipmh/","title":"OAI-PMH Primer","text":"

This section gives a basic overview of the Open Archives Protocol for Metadata Harvesting (OAI-PMH). For more detailed information, please refer to the protocol specification.

"},{"location":"oaipmh/#glossary-of-important-oai-pmh-concepts","title":"Glossary of Important OAI-PMH Concepts","text":"

Repository

A repository is a server-side application that exposes metadata via OAI-PMH.

Harvester

OAI-PMH client applications like Sickle are called harvesters.

record

A record is the XML-encoded container for the metadata of a single publication item. It consists of a header and a metadata section.

header

The record header contains a unique identifier and a datestamp.

metadata

The record metadata contains the publication metadata in a defined metadata format.

set

A structure for grouping records for selective harvesting.

harvesting

The process of requesting records from the repository by the harvester.

"},{"location":"oaipmh/#oai-verbs","title":"OAI Verbs","text":"

OAI-PMH features six main API methods (so-called \"OAI verbs\") that can be issued by harvesters. Some verbs can be combined with further arguments:

Identify

Returns information about the repository. Arguments: None.

GetRecord

Returns a single record. Arguments:

ListRecords

Returns the records in the repository in batches (possibly filtered by a timestamp or a set). Arguments:

ListIdentifiers

Like ListRecords but returns only the record headers.

ListSets

Returns the list of sets supported by this repository. Arguments: None

ListMetadataFormats

Returns the list of metadata formats supported by this repository. Arguments: None

"},{"location":"oaipmh/#metadata-formats","title":"Metadata Formats","text":"

OAI interfaces may expose metadata records in multiple metadata formats. These formats are identified by so-called \"metadata prefixes\". For instance, the prefix oai_dc refers to the OAI-DC format, which by definition has to be exposed by every valid OAI interface. OAI-DC is based on the 15 metadata elements specified in the Dublin Core Metadata Element Set.

Note

oaipmh-scythe only supports the OAI-DC format out of the box. See the section on customizing for information on how to extend oaipmh-scythe for retrieving metadata in other formats.

"},{"location":"tutorial/","title":"Tutorial","text":"

This section gives a brief overview on how to use oaipmh-scythe for querying OAI interfaces.

"},{"location":"tutorial/#initialize-an-oai-interface","title":"Initialize an OAI Interface","text":"

To make a connection to an OAI interface, you need to import the Scythe class:

from oaipmh_scythe import Scythe\n

Next, you can initialize the connection by passing it the base URL. In our example, we use the OAI interface of Zenodo:

scythe = Scythe(\"https://zenodo.org/oai2d\")\n
"},{"location":"tutorial/#issuing-requests","title":"Issuing Requests","text":"

oaipmh-scythe provides methods for each of the six OAI verbs (ListRecords, GetRecord, Idenitfy, ListSets, ListMetadataFormats, ListIdentifiers).

Start with a ListRecords request:

records = scythe.list_records(metadataPrefix=\"oai_dc\")\n

Note that all keyword arguments you provide to this function are passed to the OAI interface as HTTP parameters. Therefore, the example request would send the parameters verb=ListRecords&metadataPrefix=oai_dc. We can add additional parameters, like, for example, an OAI set:

records = scythe.list_records(metadataPrefix=\"oai_dc\", set=\"user-cfa\")\n
"},{"location":"tutorial/#consecutive-harvesting","title":"Consecutive Harvesting","text":"

Since most OAI verbs yield more than one element, their respective Scythe methods return iterator objects which can be used to iterate over the records of a repository:

records = scythe.list_records(metadataPrefix=\"oai_dc\")\nrecords.next()\n# <Record oai:zenodo.org:4574771>\n

Note that this works with all verbs that return more than one element. These are: list_records(), list_identifiers(), list_sets(), and list_metadataformats().

The following example shows how to iterate over the headers returned by ListIdentifiers:

headers = scythe.list_identifiers(metadataPrefix=\"oai_dc\")\nheaders.next()\n# <Header oai:zenodo.org:4574771>\n

Iterating over the sets returned by ListSets works similarly:

sets = scythe.list_sets()\nsets.next()\n# <Set European Middleware Initiative>\n
"},{"location":"tutorial/#using-the-from-parameter","title":"Using the from Parameter","text":"

If you need to perform selective harvesting by date using the from parameter, you may face the problem that from is a reserved word in Python:

>>> records = scythe.list_records(metadataPrefix=\"oai_dc\", from=\"2023-10-10\")\n  File \"<stdin>\", line 1\n    records = scythe.list_records(metadataPrefix=\"oai_dc\", from=\"2023-10-10\")\n                                                           ^^^^\nSyntaxError: invalid syntax\n

Fortunately, you can circumvent this problem by using a dictionary together with the ** operator:

>>> records = scythe.list_records(**{\"metadataPrefix\": \"oai_dc\", \"from\": \"2023-10-10\"})\n
"},{"location":"tutorial/#getting-a-single-record","title":"Getting a Single Record","text":"

OAI-PMH allows you to get a single record by using the GetRecord verb:

>>> scythe.get_record(identifier=\"oai:zenodo.org:4574771\", metadataPrefix=\"oai_dc\")\n<Record oai:eprints.rclis.org:4088>\n
"},{"location":"tutorial/#harvesting-oai-items-vs-oai-responses","title":"Harvesting OAI Items vs. OAI Responses","text":"

Scythe supports two harvesting modes that differ in the type of the returned objects. The default mode returns OAI-specific items (records, headers etc.) encoded as Python objects as seen earlier. If you want to save the whole XML response returned by the server, you have to pass the oaipmh_scythe.iterator.OAIResponseIterator during the instantiation of the Scythe object:

>>> from oaipmh_scythe.iterator import OAIResponseIterator\n>>> scythe = Scythe(\"https://zenodo.org/oai2d\", iterator=OAIResponseIterator)\n>>> responses = scythe.list_records(metadataPrefix=\"oai_dc\")\n>>> responses.next()\n<OAIResponse ListRecords>\n

You could then save the returned responses to disk:

>>> with open(\"response.xml\", \"w\") as f:\n...     f.write(responses.next().raw.encode(\"utf8\"))\n
"},{"location":"tutorial/#ignoring-deleted-records","title":"Ignoring Deleted Records","text":"

The list_records() and ListIdentifiers() methods accept an optional parameter ignore_deleted. If set to True, the returned OAIItemIterator will skip deleted records/headers:

>>> records = scythe.list_records(metadataPrefix=\"oai_dc\", ignore_deleted=True)\n

Note

This works only using the oaipmh_scythe.iterator.OAIItemIterator. If you use the oaipmh_scythe.iterator.OAIResponseIterator, the resulting OAI responses will still contain the deleted records.

"}]} \ No newline at end of file diff --git a/sitemap.xml.gz b/sitemap.xml.gz index 257ea05..4a0bba4 100644 Binary files a/sitemap.xml.gz and b/sitemap.xml.gz differ diff --git a/tutorial/index.html b/tutorial/index.html index 27e8dff..a075380 100644 --- a/tutorial/index.html +++ b/tutorial/index.html @@ -531,68 +531,64 @@

Tutorial

This section gives a brief overview on how to use oaipmh-scythe for querying OAI interfaces.

Initialize an OAI Interface

To make a connection to an OAI interface, you need to import the Scythe class:

-
>>> from oaipmh_scythe import Scythe
+
from oaipmh_scythe import Scythe
 

Next, you can initialize the connection by passing it the base URL. In our example, we use the OAI interface of Zenodo:

-
>>> scythe = Scythe("https://zenodo.org/oai2d")
+
scythe = Scythe("https://zenodo.org/oai2d")
 

Issuing Requests

oaipmh-scythe provides methods for each of the six OAI verbs (ListRecords, -GetRecord, Idenitfy, ListSets, ListMetadataFormats, ListIdentifiers). -Start with a ListRecords request:

-
>>> records = scythe.ListRecords(metadataPrefix='oai_dc')
+GetRecord, Idenitfy, ListSets, ListMetadataFormats, ListIdentifiers).

+

Start with a ListRecords request:

+
records = scythe.list_records(metadataPrefix="oai_dc")
 

Note that all keyword arguments you provide to this function are passed -to the OAI interface as HTTP parameters. Therefore the example request +to the OAI interface as HTTP parameters. Therefore, the example request would send the parameters verb=ListRecords&metadataPrefix=oai_dc. We can add additional parameters, like, for example, an OAI set:

-
>>> records = scythe.ListRecords(metadataPrefix="oai_dc", set="driver")
+
records = scythe.list_records(metadataPrefix="oai_dc", set="user-cfa")
 

Consecutive Harvesting

Since most OAI verbs yield more than one element, their respective Scythe methods return iterator objects which can be used to iterate over the records of a repository:

-
>>> records = scythe.ListRecords(metadataPrefix="oai_dc")
->>> records.next()
-<Record oai:oai:zenodo.org:4574771>
+
records = scythe.list_records(metadataPrefix="oai_dc")
+records.next()
+# <Record oai:zenodo.org:4574771>
 

Note that this works with all verbs that return more than one element. -These are: [ListRecords()][oaipmh_scythe.app.Scythe.ListRecords], -[ListIdentifiers()][oaipmh_scythe.app.Scythe.ListIdentifiers], [ListSets()][oaipmh_scythe.app.Scythe.ListSets], -and [ListMetadataFormats()][oaipmh_scythe.app.Scythe.ListMetadataFormats].

+These are: list_records(), +list_identifiers(), list_sets(), +and list_metadataformats().

The following example shows how to iterate over the headers returned by ListIdentifiers:

-
>>> headers = scythe.ListIdentifiers(metadataPrefix="oai_dc")
->>> headers.next()
-<Header oai:eprints.rclis.org:4088>
+
headers = scythe.list_identifiers(metadataPrefix="oai_dc")
+headers.next()
+# <Header oai:zenodo.org:4574771>
 

Iterating over the sets returned by ListSets works similarly:

-
>>> sets = scythe.ListSets()
->>> sets.next()
-<Set Status = In Press>
+
sets = scythe.list_sets()
+sets.next()
+# <Set European Middleware Initiative>
 

Using the from Parameter

If you need to perform selective harvesting by date using the from parameter, you may face the problem that from is a reserved word in Python:

-
>>> records = scythe.ListRecords(metadataPrefix="oai_dc", from="2012-12-12")
+
>>> records = scythe.list_records(metadataPrefix="oai_dc", from="2023-10-10")
   File "<stdin>", line 1
-    records = scythe.ListRecords(metadataPrefix="oai_dc", from="2012-12-12")
-                                                              ^
+    records = scythe.list_records(metadataPrefix="oai_dc", from="2023-10-10")
+                                                           ^^^^
 SyntaxError: invalid syntax
 

Fortunately, you can circumvent this problem by using a dictionary together with the ** operator:

-
>>> records = scythe.ListRecords(
-...             **{'metadataPrefix': 'oai_dc',
-...             'from': '2012-12-12'
-...            })
+
>>> records = scythe.list_records(**{"metadataPrefix": "oai_dc", "from": "2023-10-10"})
 

Getting a Single Record

OAI-PMH allows you to get a single record by using the GetRecord verb:

-
>>> scythe.GetRecord(identifier='oai:eprints.rclis.org:4088',
-...                  metadataPrefix='oai_dc')
-<Record oai:eprints.rclis.org:4088>
+
>>> scythe.get_record(identifier="oai:zenodo.org:4574771", metadataPrefix="oai_dc")
+<Record oai:eprints.rclis.org:4088>
 

Harvesting OAI Items vs. OAI Responses

Scythe supports two harvesting modes that differ in the type of the @@ -601,20 +597,21 @@

Harvesting OAI Items vs. OAI Resp you want to save the whole XML response returned by the server, you have to pass the oaipmh_scythe.iterator.OAIResponseIterator during the instantiation of the Scythe object:

-
>>> scythe = Scythe('http://elis.da.ulcc.ac.uk/cgi/oai2', iterator=OAIResponseIterator)
->>> responses = Scythe.ListRecords(metadataPrefix='oai_dc')
->>> responses.next()
-<OAIResponse ListRecords>
+
>>> from oaipmh_scythe.iterator import OAIResponseIterator
+>>> scythe = Scythe("https://zenodo.org/oai2d", iterator=OAIResponseIterator)
+>>> responses = scythe.list_records(metadataPrefix="oai_dc")
+>>> responses.next()
+<OAIResponse ListRecords>
 

You could then save the returned responses to disk:

>>> with open("response.xml", "w") as f:
 ...     f.write(responses.next().raw.encode("utf8"))
 

Ignoring Deleted Records

-

The [ListRecords()][oaipmh_scythe.app.Scythe.ListRecords] and +

The list_records() and [ListIdentifiers()][oaipmh_scythe.app.Scythe.ListIdentifiers] methods accept an optional parameter ignore_deleted. If set to True, the returned OAIItemIterator will skip deleted records/headers:

-
>>> records = scythe.ListRecords(metadataPrefix="oai_dc", ignore_deleted=True)
+
>>> records = scythe.list_records(metadataPrefix="oai_dc", ignore_deleted=True)
 

Note