diff --git a/search/search_index.json b/search/search_index.json index 5a64799..619198f 100644 --- a/search/search_index.json +++ b/search/search_index.json @@ -1 +1 @@ -{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Home","text":""},{"location":"#oaipmh-scythe-oai-pmh-for-humans","title":"oaipmh-scythe: OAI-PMH for Humans","text":"
This is a community maintained fork of the original sickle.
CI Docs Metaoaipmh-scythe is a lightweight OAI-PMH client library written in Python. It has been designed for conveniently retrieving data from OAI interfaces the Pythonic way:
>>> from oaipmh_scythe import Scythe\n>>> scythe = Scythe(\"https://zenodo.org/oai2d\")\n>>> records = scythe.list_records(metadataPrefix=\"oai_dc\")\n>>> records.next()\n<Record oai:zenodo.org:4574771>\n
"},{"location":"#features","title":"Features","text":"Python >= 3.8
"},{"location":"#installation","title":"Installation","text":"python -m pip install oaipmh-scythe\n
"},{"location":"#documentation","title":"Documentation","text":"The documentation is made with Material for MkDocs and is hosted by GitHub Pages.
"},{"location":"#license","title":"License","text":"oaipmh-scythe is distributed under the terms of the BSD license.
"},{"location":"api/","title":"API","text":""},{"location":"api/#the-scythe-client","title":"The Scythe Client","text":"Client for harvesting OAI interfaces.
Use it like this:
>>> scythe = Scythe(\"https://zenodo.org/oai2d\")\n>>> records = scythe.list_records(metadataPrefix=\"oai_dc\")\n>>> records.next()\n<Record oai:zenodo.org:4574771>\n
:param endpoint: The endpoint of the OAI interface. :param http_method: Method used for requests (GET or POST, default: GET). :param protocol_version: The OAI protocol version. :param iterator: The type of the returned iterator (default: :class:sickle.iterator.OAIItemIterator
) :param max_retries: Number of retry attempts if an HTTP request fails (default: 0 = request only once). Sickle will use the value from the retry-after header (if present) and will wait the specified number of seconds between retries. :param retry_status_codes: HTTP status codes to retry (default will only retry on 503) :param default_retry_after: default number of seconds to wait between retries in case no retry-after header is found on the response (defaults to 60 seconds) :param class_mapping: A dictionary that maps OAI verbs to classes representing OAI items. If not provided, :data:sickle.app.DEFAULT_CLASS_MAPPING
will be used. :param encoding: Can be used to override the encoding used when decoding the server response. If not specified, requests
will use the encoding returned by the server in the content-type
header. However, if the charset
information is missing, requests
will fallback to 'ISO-8859-1'
. :param request_args: Arguments to be passed to requests when issuing HTTP requests. Useful examples are auth=('username', 'password')
for basic auth-protected endpoints or timeout=<int>
. See the documentation of requests <http://docs.python-requests.org/en/master/api/#main-interface>
_ for all available parameters.
src/oaipmh_scythe/app.py
class Scythe:\n \"\"\"Client for harvesting OAI interfaces.\n\n Use it like this:\n\n >>> scythe = Scythe(\"https://zenodo.org/oai2d\")\n >>> records = scythe.list_records(metadataPrefix=\"oai_dc\")\n >>> records.next()\n <Record oai:zenodo.org:4574771>\n\n :param endpoint: The endpoint of the OAI interface.\n :param http_method: Method used for requests (GET or POST, default: GET).\n :param protocol_version: The OAI protocol version.\n :param iterator: The type of the returned iterator\n (default: :class:`sickle.iterator.OAIItemIterator`)\n :param max_retries: Number of retry attempts if an HTTP request fails (default: 0 = request only once). Sickle will\n use the value from the retry-after header (if present) and will wait the specified number of\n seconds between retries.\n :param retry_status_codes: HTTP status codes to retry (default will only retry on 503)\n :param default_retry_after: default number of seconds to wait between retries in case no retry-after header is found\n on the response (defaults to 60 seconds)\n :param class_mapping: A dictionary that maps OAI verbs to classes representing\n OAI items. If not provided,\n :data:`sickle.app.DEFAULT_CLASS_MAPPING` will be used.\n :param encoding: Can be used to override the encoding used when decoding\n the server response. If not specified, `requests` will\n use the encoding returned by the server in the\n `content-type` header. However, if the `charset`\n information is missing, `requests` will fallback to\n `'ISO-8859-1'`.\n :param request_args: Arguments to be passed to requests when issuing HTTP\n requests. Useful examples are `auth=('username', 'password')`\n for basic auth-protected endpoints or `timeout=<int>`.\n See the `documentation of requests <http://docs.python-requests.org/en/master/api/#main-interface>`_\n for all available parameters.\n \"\"\"\n\n def __init__(\n self,\n endpoint: str,\n http_method: str = \"GET\",\n protocol_version: str = \"2.0\",\n iterator: BaseOAIIterator = OAIItemIterator,\n max_retries: int = 0,\n retry_status_codes: Iterable[int] | None = None,\n default_retry_after: int = 60,\n class_mapping: dict[str, OAIItem] | None = None,\n encoding: str | None = None,\n timeout: int = 60,\n **request_args: str,\n ):\n self.endpoint = endpoint\n if http_method not in (\"GET\", \"POST\"):\n raise ValueError(\"Invalid HTTP method: %s! Must be GET or POST.\")\n if protocol_version not in (\"2.0\", \"1.0\"):\n raise ValueError(\"Invalid protocol version: %s! Must be 1.0 or 2.0.\")\n self.http_method = http_method\n self.protocol_version = protocol_version\n if inspect.isclass(iterator) and issubclass(iterator, BaseOAIIterator):\n self.iterator = iterator\n else:\n raise TypeError(\"Argument 'iterator' must be subclass of %s\" % BaseOAIIterator.__name__)\n self.max_retries = max_retries\n self.retry_status_codes = retry_status_codes or (503,)\n self.default_retry_after = default_retry_after\n self.oai_namespace = OAI_NAMESPACE % self.protocol_version\n self.class_mapping = class_mapping or DEFAULT_CLASS_MAP\n self.encoding = encoding\n self.timeout = timeout\n self.request_args = request_args\n\n def harvest(self, **kwargs: str) -> OAIResponse:\n \"\"\"Make HTTP requests to the OAI server.\n\n :param kwargs: OAI HTTP parameters.\n \"\"\"\n http_response = self._request(kwargs)\n for _ in range(self.max_retries):\n if self._is_error_code(http_response.status_code) and http_response.status_code in self.retry_status_codes:\n retry_after = self.get_retry_after(http_response)\n logger.warning(\"HTTP %d! Retrying after %d seconds...\" % (http_response.status_code, retry_after))\n time.sleep(retry_after)\n http_response = self._request(kwargs)\n http_response.raise_for_status()\n if self.encoding:\n http_response.encoding = self.encoding\n return OAIResponse(http_response, params=kwargs)\n\n def _request(self, kwargs: str) -> Response:\n if self.http_method == \"GET\":\n return requests.get(self.endpoint, timeout=self.timeout, params=kwargs, **self.request_args)\n return requests.post(self.endpoint, data=kwargs, timeout=self.timeout, **self.request_args)\n\n def list_records(self, ignore_deleted: bool = False, **kwargs: str) -> BaseOAIIterator:\n \"\"\"Issue a ListRecords request.\n\n :param ignore_deleted: If set to :obj:`True`, the resulting\n iterator will skip records flagged as deleted.\n \"\"\"\n params = kwargs\n params.update({\"verb\": \"ListRecords\"})\n return self.iterator(self, params, ignore_deleted=ignore_deleted)\n\n def list_identifiers(self, ignore_deleted: bool = False, **kwargs: str) -> BaseOAIIterator:\n \"\"\"Issue a ListIdentifiers request.\n\n :param ignore_deleted: If set to :obj:`True`, the resulting\n iterator will skip records flagged as deleted.\n \"\"\"\n params = kwargs\n params.update({\"verb\": \"ListIdentifiers\"})\n return self.iterator(self, params, ignore_deleted=ignore_deleted)\n\n def list_sets(self, **kwargs: str) -> BaseOAIIterator:\n \"\"\"Issue a ListSets request.\"\"\"\n params = kwargs\n params.update({\"verb\": \"ListSets\"})\n return self.iterator(self, params)\n\n def identify(self) -> Identify:\n \"\"\"Issue an Identify request.\"\"\"\n params = {\"verb\": \"Identify\"}\n return Identify(self.harvest(**params))\n\n def get_record(self, **kwargs: str) -> Record:\n \"\"\"Issue a GetRecord request.\"\"\"\n params = kwargs\n params.update({\"verb\": \"GetRecord\"})\n record = self.iterator(self, params).next()\n return record\n\n def list_metadataformats(self, **kwargs: str) -> BaseOAIIterator:\n \"\"\"Issue a ListMetadataFormats request.\"\"\"\n params = kwargs\n params.update({\"verb\": \"ListMetadataFormats\"})\n return self.iterator(self, params)\n\n def ListRecords(self, ignore_deleted: bool = False, **kwargs: str) -> BaseOAIIterator:\n warnings.warn(\"ListRecords is deprecated, use list_records instead\", DeprecationWarning, stacklevel=2)\n return self.list_records(ignore_deleted, **kwargs)\n\n def ListIdentifiers(self, ignore_deleted: bool = False, **kwargs: str) -> BaseOAIIterator:\n warnings.warn(\"ListIdentifiers is deprecated, use list_identifiers instead\", DeprecationWarning, stacklevel=2)\n return self.list_identifiers(ignore_deleted, **kwargs)\n\n def ListSets(self, **kwargs: str) -> BaseOAIIterator:\n warnings.warn(\"ListSets is deprecated, use list_sets instead\", DeprecationWarning, stacklevel=2)\n return self.list_sets(**kwargs)\n\n def Identify(self) -> Identify:\n warnings.warn(\"Identify is deprecated, use identify instead\", DeprecationWarning, stacklevel=2)\n return self.identify()\n\n def GetRecord(self, **kwargs: str) -> Record:\n warnings.warn(\"GetRecord is deprecated, use get_record instead\", DeprecationWarning, stacklevel=2)\n return self.get_record(**kwargs)\n\n def ListMetadataFormats(self, **kwargs: str) -> BaseOAIIterator:\n warnings.warn(\n \"ListMetadataFormats is deprecated, use list_metadataformats instead\", DeprecationWarning, stacklevel=2\n )\n return self.list_metadataformats(**kwargs)\n\n def get_retry_after(self, http_response: Response) -> int:\n if http_response.status_code == 503:\n try:\n return int(http_response.headers.get(\"retry-after\"))\n except TypeError:\n return self.default_retry_after\n return self.default_retry_after\n\n @staticmethod\n def _is_error_code(status_code: int) -> bool:\n return status_code >= 400\n
"},{"location":"api/#oaipmh_scythe.app.Scythe.get_record","title":"get_record(**kwargs)
","text":"Issue a GetRecord request.
Source code insrc/oaipmh_scythe/app.py
def get_record(self, **kwargs: str) -> Record:\n \"\"\"Issue a GetRecord request.\"\"\"\n params = kwargs\n params.update({\"verb\": \"GetRecord\"})\n record = self.iterator(self, params).next()\n return record\n
"},{"location":"api/#oaipmh_scythe.app.Scythe.harvest","title":"harvest(**kwargs)
","text":"Make HTTP requests to the OAI server.
:param kwargs: OAI HTTP parameters.
Source code insrc/oaipmh_scythe/app.py
def harvest(self, **kwargs: str) -> OAIResponse:\n \"\"\"Make HTTP requests to the OAI server.\n\n :param kwargs: OAI HTTP parameters.\n \"\"\"\n http_response = self._request(kwargs)\n for _ in range(self.max_retries):\n if self._is_error_code(http_response.status_code) and http_response.status_code in self.retry_status_codes:\n retry_after = self.get_retry_after(http_response)\n logger.warning(\"HTTP %d! Retrying after %d seconds...\" % (http_response.status_code, retry_after))\n time.sleep(retry_after)\n http_response = self._request(kwargs)\n http_response.raise_for_status()\n if self.encoding:\n http_response.encoding = self.encoding\n return OAIResponse(http_response, params=kwargs)\n
"},{"location":"api/#oaipmh_scythe.app.Scythe.identify","title":"identify()
","text":"Issue an Identify request.
Source code insrc/oaipmh_scythe/app.py
def identify(self) -> Identify:\n \"\"\"Issue an Identify request.\"\"\"\n params = {\"verb\": \"Identify\"}\n return Identify(self.harvest(**params))\n
"},{"location":"api/#oaipmh_scythe.app.Scythe.list_identifiers","title":"list_identifiers(ignore_deleted=False, **kwargs)
","text":"Issue a ListIdentifiers request.
:param ignore_deleted: If set to :obj:True
, the resulting iterator will skip records flagged as deleted.
src/oaipmh_scythe/app.py
def list_identifiers(self, ignore_deleted: bool = False, **kwargs: str) -> BaseOAIIterator:\n \"\"\"Issue a ListIdentifiers request.\n\n :param ignore_deleted: If set to :obj:`True`, the resulting\n iterator will skip records flagged as deleted.\n \"\"\"\n params = kwargs\n params.update({\"verb\": \"ListIdentifiers\"})\n return self.iterator(self, params, ignore_deleted=ignore_deleted)\n
"},{"location":"api/#oaipmh_scythe.app.Scythe.list_metadataformats","title":"list_metadataformats(**kwargs)
","text":"Issue a ListMetadataFormats request.
Source code insrc/oaipmh_scythe/app.py
def list_metadataformats(self, **kwargs: str) -> BaseOAIIterator:\n \"\"\"Issue a ListMetadataFormats request.\"\"\"\n params = kwargs\n params.update({\"verb\": \"ListMetadataFormats\"})\n return self.iterator(self, params)\n
"},{"location":"api/#oaipmh_scythe.app.Scythe.list_records","title":"list_records(ignore_deleted=False, **kwargs)
","text":"Issue a ListRecords request.
:param ignore_deleted: If set to :obj:True
, the resulting iterator will skip records flagged as deleted.
src/oaipmh_scythe/app.py
def list_records(self, ignore_deleted: bool = False, **kwargs: str) -> BaseOAIIterator:\n \"\"\"Issue a ListRecords request.\n\n :param ignore_deleted: If set to :obj:`True`, the resulting\n iterator will skip records flagged as deleted.\n \"\"\"\n params = kwargs\n params.update({\"verb\": \"ListRecords\"})\n return self.iterator(self, params, ignore_deleted=ignore_deleted)\n
"},{"location":"api/#oaipmh_scythe.app.Scythe.list_sets","title":"list_sets(**kwargs)
","text":"Issue a ListSets request.
Source code insrc/oaipmh_scythe/app.py
def list_sets(self, **kwargs: str) -> BaseOAIIterator:\n \"\"\"Issue a ListSets request.\"\"\"\n params = kwargs\n params.update({\"verb\": \"ListSets\"})\n return self.iterator(self, params)\n
"},{"location":"api/#working-with-oai-responses","title":"Working with OAI Responses","text":""},{"location":"api/#iterating-over-oai-items","title":"Iterating over OAI Items","text":" Bases: BaseOAIIterator
Iterator over OAI records/identifiers/sets transparently aggregated via OAI-PMH.
Can be used to conveniently iterate through the records of a repository.
:param scythe: The Scythe object that issued the first request. :param params: The OAI arguments. :type params: dict :param ignore_deleted: Flag for whether to ignore deleted records.
Source code insrc/oaipmh_scythe/iterator.py
class OAIItemIterator(BaseOAIIterator):\n \"\"\"Iterator over OAI records/identifiers/sets transparently aggregated via OAI-PMH.\n\n Can be used to conveniently iterate through the records of a repository.\n\n :param scythe: The Scythe object that issued the first request.\n :param params: The OAI arguments.\n :type params: dict\n :param ignore_deleted: Flag for whether to ignore deleted records.\n \"\"\"\n\n def __init__(self, scythe: Scythe, params: dict[str, str], ignore_deleted: bool = False) -> None:\n self.mapper = scythe.class_mapping[params.get(\"verb\")]\n self.element = VERBS_ELEMENTS[params.get(\"verb\")]\n super().__init__(scythe, params, ignore_deleted)\n\n def _next_response(self):\n super()._next_response()\n self._items = self.oai_response.xml.iterfind(\".//\" + self.scythe.oai_namespace + self.element)\n\n def next(self):\n \"\"\"Return the next record/header/set.\"\"\"\n while True:\n for item in self._items:\n mapped = self.mapper(item)\n if self.ignore_deleted and mapped.deleted:\n continue\n return mapped\n if self.resumption_token and self.resumption_token.token:\n self._next_response()\n else:\n raise StopIteration\n
"},{"location":"api/#oaipmh_scythe.iterator.OAIItemIterator.next","title":"next()
","text":"Return the next record/header/set.
Source code insrc/oaipmh_scythe/iterator.py
def next(self):\n \"\"\"Return the next record/header/set.\"\"\"\n while True:\n for item in self._items:\n mapped = self.mapper(item)\n if self.ignore_deleted and mapped.deleted:\n continue\n return mapped\n if self.resumption_token and self.resumption_token.token:\n self._next_response()\n else:\n raise StopIteration\n
"},{"location":"api/#iterating-over-oai-responses","title":"Iterating over OAI Responses","text":" Bases: BaseOAIIterator
Iterator over OAI responses.
Source code insrc/oaipmh_scythe/iterator.py
class OAIResponseIterator(BaseOAIIterator):\n \"\"\"Iterator over OAI responses.\"\"\"\n\n def next(self):\n \"\"\"Return the next response.\"\"\"\n while True:\n if self.oai_response:\n response = self.oai_response\n self.oai_response = None\n return response\n elif self.resumption_token and self.resumption_token.token:\n self._next_response()\n else:\n raise StopIteration\n
"},{"location":"api/#oaipmh_scythe.iterator.OAIResponseIterator.next","title":"next()
","text":"Return the next response.
Source code insrc/oaipmh_scythe/iterator.py
def next(self):\n \"\"\"Return the next response.\"\"\"\n while True:\n if self.oai_response:\n response = self.oai_response\n self.oai_response = None\n return response\n elif self.resumption_token and self.resumption_token.token:\n self._next_response()\n else:\n raise StopIteration\n
"},{"location":"api/#classes-for-oai-items","title":"Classes for OAI Items","text":""},{"location":"api/#identify","title":"Identify","text":""},{"location":"api/#record","title":"Record","text":"Record objects represent single OAI records.
Bases: OAIItem
Represents an OAI record.
:param record_element: The XML element 'record'. :type record_element: :class:lxml.etree._Element
:param strip_ns: Flag for whether to remove the namespaces from the element names.
src/oaipmh_scythe/models.py
class Record(OAIItem):\n \"\"\"Represents an OAI record.\n\n :param record_element: The XML element 'record'.\n :type record_element: :class:`lxml.etree._Element`\n :param strip_ns: Flag for whether to remove the namespaces from the\n element names.\n \"\"\"\n\n def __init__(self, record_element: etree._Element, strip_ns: bool = True) -> None:\n super().__init__(record_element, strip_ns=strip_ns)\n self.header = Header(self.xml.find(\".//\" + self._oai_namespace + \"header\"))\n self.deleted = self.header.deleted\n if not self.deleted:\n self.metadata = self.get_metadata()\n\n def __repr__(self) -> str:\n if self.header.deleted:\n return f\"<Record {self.header.identifier} [deleted]>\"\n return f\"<Record {self.header.identifier}>\"\n\n def __iter__(self):\n return iter(self.metadata.items())\n\n def get_metadata(self):\n # We want to get record/metadata/<container>/*\n # <container> would be the element ``dc``\n # in the ``oai_dc`` case.\n return xml_to_dict(\n self.xml.find(\".//\" + self._oai_namespace + \"metadata\").getchildren()[0],\n strip_ns=self._strip_ns,\n )\n
"},{"location":"api/#header","title":"Header","text":""},{"location":"api/#set","title":"Set","text":""},{"location":"api/#metadataformat","title":"MetadataFormat","text":""},{"location":"changelog/","title":"Changelog","text":""},{"location":"changelog/#changelog","title":"Changelog","text":""},{"location":"changelog/#unreleased","title":"Unreleased","text":"Record.get_metadata()
) to make subclassing easier (mloesch/sickle#38)max_retries
parameter now refers to no. of retries, not counting the initial request anymoreFirst public release.
"},{"location":"credits/","title":"Credits","text":"By default, oaipmh-scythe's mapping of the record XML into Python dictionaries is tailored to work only with Dublin-Core-encoded metadata payloads. Other formats most probably won't be mapped correctly, especially if they are more hierarchically structured than Dublin Core.
In case you want to harvest these more complex formats, you have to write your own record model class by subclassing the default implementation that unpacks the metadata XML:
from oaipmh_scythe.models import Record\n\nclass MyRecord(Record):\n # Your XML unpacking implementation goes here.\n pass\n
Note
Take a look at the implementation of oaipmh_scythe.models.Record to get an idea of how to do this.
Next, associate your implementation with OAI verbs in the oaipmh_scythe.app.Scythe object. In this case, we want the oaipmh_scythe.app.Scythe object to use our implementation to represent items returned by ListRecords and GetRecord responses:
scythe = Scythe('http://...')\nscythe.class_mapping['ListRecords'] = MyRecord\nscythe.class_mapping['GetRecord'] = MyRecord\n
If you need to rewrite all item implementations, you can also provide a complete mapping to the oaipmh_scythe.app.Scythe object at instantiation:
my_mapping = {\n 'ListRecords': MyRecord,\n 'GetRecord': MyRecord,\n # ...\n}\n\nscythe = Scythe('https://...', class_mapping=my_mapping)\n
"},{"location":"development/","title":"Development","text":""},{"location":"license/","title":"License","text":"Copyright (c) 2013 by Mathias Loesch.
Some rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright\n notice, this list of conditions and the following disclaimer.\n\n* Redistributions in binary form must reproduce the above\n copyright notice, this list of conditions and the following\n disclaimer in the documentation and/or other materials provided\n with the distribution.\n\n* The names of the contributors may not be used to endorse or\n promote products derived from this software without specific\n prior written permission.\n
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS \"AS IS\" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
"},{"location":"oaipmh/","title":"OAI-PMH Primer","text":"This section gives a basic overview of the Open Archives Protocol for Metadata Harvesting (OAI-PMH). For more detailed information, please refer to the protocol specification.
"},{"location":"oaipmh/#glossary-of-important-oai-pmh-concepts","title":"Glossary of Important OAI-PMH Concepts","text":"Repository
A repository is a server-side application that exposes metadata via OAI-PMH.
Harvester
OAI-PMH client applications like Sickle are called harvesters.
record
A record is the XML-encoded container for the metadata of a single publication item. It consists of a header and a metadata section.
header
The record header contains a unique identifier and a datestamp.
metadata
The record metadata contains the publication metadata in a defined metadata format.
set
A structure for grouping records for selective harvesting.
harvesting
The process of requesting records from the repository by the harvester.
"},{"location":"oaipmh/#oai-verbs","title":"OAI Verbs","text":"OAI-PMH features six main API methods (so-called \"OAI verbs\") that can be issued by harvesters. Some verbs can be combined with further arguments:
Identify
Returns information about the repository. Arguments: None.
GetRecord
Returns a single record. Arguments:
identifier
(the unique identifier of the record, required)metadataPrefix
(the prefix identifying the metadata format, required)ListRecords
Returns the records in the repository in batches (possibly filtered by a timestamp or a set
). Arguments:
metadataPrefix
(the prefix identifying the metadata format, required)from
(the earliest timestamp of the records, optional)until
(the latest timestamp of the records, optional)set
(a set for selective harvesting, optional)resumptionToken
(used for getting the next result batch if the number of records returned by the previous request exceeds the repository's maximum batch size, exclusive)ListIdentifiers
Like ListRecords
but returns only the record headers.
ListSets
Returns the list of sets supported by this repository. Arguments: None
ListMetadataFormats
Returns the list of metadata formats supported by this repository. Arguments: None
"},{"location":"oaipmh/#metadata-formats","title":"Metadata Formats","text":"OAI interfaces may expose metadata records in multiple metadata formats. These formats are identified by so-called \"metadata prefixes\". For instance, the prefix oai_dc
refers to the OAI-DC format, which by definition has to be exposed by every valid OAI interface. OAI-DC is based on the 15 metadata elements specified in the Dublin Core Metadata Element Set.
Note
oaipmh-scythe only supports the OAI-DC format out of the box. See the section on customizing for information on how to extend oaipmh-scythe for retrieving metadata in other formats.
"},{"location":"tutorial/","title":"Tutorial","text":"This section gives a brief overview on how to use oaipmh-scythe for querying OAI interfaces.
"},{"location":"tutorial/#initialize-an-oai-interface","title":"Initialize an OAI Interface","text":"To make a connection to an OAI interface, you need to import the Scythe class:
>>> from oaipmh_scythe import Scythe\n
Next, you can initialize the connection by passing it the base URL. In our example, we use the OAI interface of Zenodo:
>>> scythe = Scythe(\"https://zenodo.org/oai2d\")\n
"},{"location":"tutorial/#issuing-requests","title":"Issuing Requests","text":"oaipmh-scythe provides methods for each of the six OAI verbs (ListRecords, GetRecord, Idenitfy, ListSets, ListMetadataFormats, ListIdentifiers). Start with a ListRecords request:
>>> records = scythe.ListRecords(metadataPrefix='oai_dc')\n
Note that all keyword arguments you provide to this function are passed to the OAI interface as HTTP parameters. Therefore the example request would send the parameters verb=ListRecords&metadataPrefix=oai_dc
. We can add additional parameters, like, for example, an OAI set
:
>>> records = scythe.ListRecords(metadataPrefix=\"oai_dc\", set=\"driver\")\n
"},{"location":"tutorial/#consecutive-harvesting","title":"Consecutive Harvesting","text":"Since most OAI verbs yield more than one element, their respective Scythe methods return iterator objects which can be used to iterate over the records of a repository:
>>> records = scythe.ListRecords(metadataPrefix=\"oai_dc\")\n>>> records.next()\n<Record oai:oai:zenodo.org:4574771>\n
Note that this works with all verbs that return more than one element. These are: ListRecords(), ListIdentifiers(), ListSets(), and ListMetadataFormats().
The following example shows how to iterate over the headers returned by ListIdentifiers
:
>>> headers = scythe.ListIdentifiers(metadataPrefix=\"oai_dc\")\n>>> headers.next()\n<Header oai:eprints.rclis.org:4088>\n
Iterating over the sets returned by ListSets
works similarly:
>>> sets = scythe.ListSets()\n>>> sets.next()\n<Set Status = In Press>\n
"},{"location":"tutorial/#using-the-from-parameter","title":"Using the from
Parameter","text":"If you need to perform selective harvesting by date using the from
parameter, you may face the problem that from
is a reserved word in Python:
>>> records = scythe.ListRecords(metadataPrefix=\"oai_dc\", from=\"2012-12-12\")\n File \"<stdin>\", line 1\n records = scythe.ListRecords(metadataPrefix=\"oai_dc\", from=\"2012-12-12\")\n ^\nSyntaxError: invalid syntax\n
Fortunately, you can circumvent this problem by using a dictionary together with the **
operator:
>>> records = scythe.ListRecords(\n... **{'metadataPrefix': 'oai_dc',\n... 'from': '2012-12-12'\n... })\n
"},{"location":"tutorial/#getting-a-single-record","title":"Getting a Single Record","text":"OAI-PMH allows you to get a single record by using the GetRecord
verb:
>>> scythe.GetRecord(identifier='oai:eprints.rclis.org:4088',\n... metadataPrefix='oai_dc')\n<Record oai:eprints.rclis.org:4088>\n
"},{"location":"tutorial/#harvesting-oai-items-vs-oai-responses","title":"Harvesting OAI Items vs. OAI Responses","text":"Scythe supports two harvesting modes that differ in the type of the returned objects. The default mode returns OAI-specific items (records, headers etc.) encoded as Python objects as seen earlier. If you want to save the whole XML response returned by the server, you have to pass the oaipmh_scythe.iterator.OAIResponseIterator during the instantiation of the Scythe object:
>>> scythe = Scythe('http://elis.da.ulcc.ac.uk/cgi/oai2', iterator=OAIResponseIterator)\n>>> responses = Scythe.ListRecords(metadataPrefix='oai_dc')\n>>> responses.next()\n<OAIResponse ListRecords>\n
You could then save the returned responses to disk:
>>> with open(\"response.xml\", \"w\") as f:\n... f.write(responses.next().raw.encode(\"utf8\"))\n
"},{"location":"tutorial/#ignoring-deleted-records","title":"Ignoring Deleted Records","text":"The ListRecords() and ListIdentifiers() methods accept an optional parameter ignore_deleted
. If set to True
, the returned OAIItemIterator will skip deleted records/headers:
>>> records = scythe.ListRecords(metadataPrefix=\"oai_dc\", ignore_deleted=True)\n
Note
This works only using the oaipmh_scythe.iterator.OAIItemIterator. If you use the oaipmh_scythe.iterator.OAIResponseIterator, the resulting OAI responses will still contain the deleted records.
"}]} \ No newline at end of file +{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Home","text":""},{"location":"#oaipmh-scythe-oai-pmh-for-humans","title":"oaipmh-scythe: OAI-PMH for Humans","text":"This is a community maintained fork of the original sickle.
CI Docs Metaoaipmh-scythe is a lightweight OAI-PMH client library written in Python. It has been designed for conveniently retrieving data from OAI interfaces the Pythonic way:
>>> from oaipmh_scythe import Scythe\n>>> scythe = Scythe(\"https://zenodo.org/oai2d\")\n>>> records = scythe.list_records(metadataPrefix=\"oai_dc\")\n>>> records.next()\n<Record oai:zenodo.org:4574771>\n
"},{"location":"#features","title":"Features","text":"Python >= 3.8
"},{"location":"#installation","title":"Installation","text":"python -m pip install oaipmh-scythe\n
"},{"location":"#documentation","title":"Documentation","text":"The documentation is made with Material for MkDocs and is hosted by GitHub Pages.
"},{"location":"#license","title":"License","text":"oaipmh-scythe is distributed under the terms of the BSD license.
"},{"location":"api/","title":"API","text":""},{"location":"api/#the-scythe-client","title":"The Scythe Client","text":"Client for harvesting OAI interfaces.
Use it like this:
>>> scythe = Scythe(\"https://zenodo.org/oai2d\")\n>>> records = scythe.list_records(metadataPrefix=\"oai_dc\")\n>>> records.next()\n<Record oai:zenodo.org:4574771>\n
:param endpoint: The endpoint of the OAI interface. :param http_method: Method used for requests (GET or POST, default: GET). :param protocol_version: The OAI protocol version. :param iterator: The type of the returned iterator (default: :class:sickle.iterator.OAIItemIterator
) :param max_retries: Number of retry attempts if an HTTP request fails (default: 0 = request only once). Sickle will use the value from the retry-after header (if present) and will wait the specified number of seconds between retries. :param retry_status_codes: HTTP status codes to retry (default will only retry on 503) :param default_retry_after: default number of seconds to wait between retries in case no retry-after header is found on the response (defaults to 60 seconds) :param class_mapping: A dictionary that maps OAI verbs to classes representing OAI items. If not provided, :data:sickle.app.DEFAULT_CLASS_MAPPING
will be used. :param encoding: Can be used to override the encoding used when decoding the server response. If not specified, requests
will use the encoding returned by the server in the content-type
header. However, if the charset
information is missing, requests
will fallback to 'ISO-8859-1'
. :param request_args: Arguments to be passed to requests when issuing HTTP requests. Useful examples are auth=('username', 'password')
for basic auth-protected endpoints or timeout=<int>
. See the documentation of requests <http://docs.python-requests.org/en/master/api/#main-interface>
_ for all available parameters.
src/oaipmh_scythe/app.py
class Scythe:\n \"\"\"Client for harvesting OAI interfaces.\n\n Use it like this:\n\n >>> scythe = Scythe(\"https://zenodo.org/oai2d\")\n >>> records = scythe.list_records(metadataPrefix=\"oai_dc\")\n >>> records.next()\n <Record oai:zenodo.org:4574771>\n\n :param endpoint: The endpoint of the OAI interface.\n :param http_method: Method used for requests (GET or POST, default: GET).\n :param protocol_version: The OAI protocol version.\n :param iterator: The type of the returned iterator\n (default: :class:`sickle.iterator.OAIItemIterator`)\n :param max_retries: Number of retry attempts if an HTTP request fails (default: 0 = request only once). Sickle will\n use the value from the retry-after header (if present) and will wait the specified number of\n seconds between retries.\n :param retry_status_codes: HTTP status codes to retry (default will only retry on 503)\n :param default_retry_after: default number of seconds to wait between retries in case no retry-after header is found\n on the response (defaults to 60 seconds)\n :param class_mapping: A dictionary that maps OAI verbs to classes representing\n OAI items. If not provided,\n :data:`sickle.app.DEFAULT_CLASS_MAPPING` will be used.\n :param encoding: Can be used to override the encoding used when decoding\n the server response. If not specified, `requests` will\n use the encoding returned by the server in the\n `content-type` header. However, if the `charset`\n information is missing, `requests` will fallback to\n `'ISO-8859-1'`.\n :param request_args: Arguments to be passed to requests when issuing HTTP\n requests. Useful examples are `auth=('username', 'password')`\n for basic auth-protected endpoints or `timeout=<int>`.\n See the `documentation of requests <http://docs.python-requests.org/en/master/api/#main-interface>`_\n for all available parameters.\n \"\"\"\n\n def __init__(\n self,\n endpoint: str,\n http_method: str = \"GET\",\n protocol_version: str = \"2.0\",\n iterator: BaseOAIIterator = OAIItemIterator,\n max_retries: int = 0,\n retry_status_codes: Iterable[int] | None = None,\n default_retry_after: int = 60,\n class_mapping: dict[str, OAIItem] | None = None,\n encoding: str | None = None,\n timeout: int = 60,\n **request_args: str,\n ):\n self.endpoint = endpoint\n if http_method not in (\"GET\", \"POST\"):\n raise ValueError(\"Invalid HTTP method: %s! Must be GET or POST.\")\n if protocol_version not in (\"2.0\", \"1.0\"):\n raise ValueError(\"Invalid protocol version: %s! Must be 1.0 or 2.0.\")\n self.http_method = http_method\n self.protocol_version = protocol_version\n if inspect.isclass(iterator) and issubclass(iterator, BaseOAIIterator):\n self.iterator = iterator\n else:\n raise TypeError(\"Argument 'iterator' must be subclass of %s\" % BaseOAIIterator.__name__)\n self.max_retries = max_retries\n self.retry_status_codes = retry_status_codes or (503,)\n self.default_retry_after = default_retry_after\n self.oai_namespace = OAI_NAMESPACE % self.protocol_version\n self.class_mapping = class_mapping or DEFAULT_CLASS_MAP\n self.encoding = encoding\n self.timeout = timeout\n self.request_args = request_args\n\n def harvest(self, **kwargs: str) -> OAIResponse:\n \"\"\"Make HTTP requests to the OAI server.\n\n :param kwargs: OAI HTTP parameters.\n \"\"\"\n http_response = self._request(kwargs)\n for _ in range(self.max_retries):\n if self._is_error_code(http_response.status_code) and http_response.status_code in self.retry_status_codes:\n retry_after = self.get_retry_after(http_response)\n logger.warning(\"HTTP %d! Retrying after %d seconds...\" % (http_response.status_code, retry_after))\n time.sleep(retry_after)\n http_response = self._request(kwargs)\n http_response.raise_for_status()\n if self.encoding:\n http_response.encoding = self.encoding\n return OAIResponse(http_response, params=kwargs)\n\n def _request(self, kwargs: str) -> Response:\n if self.http_method == \"GET\":\n return requests.get(self.endpoint, timeout=self.timeout, params=kwargs, **self.request_args)\n return requests.post(self.endpoint, data=kwargs, timeout=self.timeout, **self.request_args)\n\n def list_records(self, ignore_deleted: bool = False, **kwargs: str) -> BaseOAIIterator:\n \"\"\"Issue a ListRecords request.\n\n :param ignore_deleted: If set to :obj:`True`, the resulting\n iterator will skip records flagged as deleted.\n \"\"\"\n params = kwargs\n params.update({\"verb\": \"ListRecords\"})\n return self.iterator(self, params, ignore_deleted=ignore_deleted)\n\n def list_identifiers(self, ignore_deleted: bool = False, **kwargs: str) -> BaseOAIIterator:\n \"\"\"Issue a ListIdentifiers request.\n\n :param ignore_deleted: If set to :obj:`True`, the resulting\n iterator will skip records flagged as deleted.\n \"\"\"\n params = kwargs\n params.update({\"verb\": \"ListIdentifiers\"})\n return self.iterator(self, params, ignore_deleted=ignore_deleted)\n\n def list_sets(self, **kwargs: str) -> BaseOAIIterator:\n \"\"\"Issue a ListSets request.\"\"\"\n params = kwargs\n params.update({\"verb\": \"ListSets\"})\n return self.iterator(self, params)\n\n def identify(self) -> Identify:\n \"\"\"Issue an Identify request.\"\"\"\n params = {\"verb\": \"Identify\"}\n return Identify(self.harvest(**params))\n\n def get_record(self, **kwargs: str) -> Record:\n \"\"\"Issue a GetRecord request.\"\"\"\n params = kwargs\n params.update({\"verb\": \"GetRecord\"})\n record = self.iterator(self, params).next()\n return record\n\n def list_metadataformats(self, **kwargs: str) -> BaseOAIIterator:\n \"\"\"Issue a ListMetadataFormats request.\"\"\"\n params = kwargs\n params.update({\"verb\": \"ListMetadataFormats\"})\n return self.iterator(self, params)\n\n def ListRecords(self, ignore_deleted: bool = False, **kwargs: str) -> BaseOAIIterator:\n warnings.warn(\"ListRecords is deprecated, use list_records instead\", DeprecationWarning, stacklevel=2)\n return self.list_records(ignore_deleted, **kwargs)\n\n def ListIdentifiers(self, ignore_deleted: bool = False, **kwargs: str) -> BaseOAIIterator:\n warnings.warn(\"ListIdentifiers is deprecated, use list_identifiers instead\", DeprecationWarning, stacklevel=2)\n return self.list_identifiers(ignore_deleted, **kwargs)\n\n def ListSets(self, **kwargs: str) -> BaseOAIIterator:\n warnings.warn(\"ListSets is deprecated, use list_sets instead\", DeprecationWarning, stacklevel=2)\n return self.list_sets(**kwargs)\n\n def Identify(self) -> Identify:\n warnings.warn(\"Identify is deprecated, use identify instead\", DeprecationWarning, stacklevel=2)\n return self.identify()\n\n def GetRecord(self, **kwargs: str) -> Record:\n warnings.warn(\"GetRecord is deprecated, use get_record instead\", DeprecationWarning, stacklevel=2)\n return self.get_record(**kwargs)\n\n def ListMetadataFormats(self, **kwargs: str) -> BaseOAIIterator:\n warnings.warn(\n \"ListMetadataFormats is deprecated, use list_metadataformats instead\", DeprecationWarning, stacklevel=2\n )\n return self.list_metadataformats(**kwargs)\n\n def get_retry_after(self, http_response: Response) -> int:\n if http_response.status_code == 503:\n try:\n return int(http_response.headers.get(\"retry-after\"))\n except TypeError:\n return self.default_retry_after\n return self.default_retry_after\n\n @staticmethod\n def _is_error_code(status_code: int) -> bool:\n return status_code >= 400\n
"},{"location":"api/#oaipmh_scythe.app.Scythe.get_record","title":"get_record(**kwargs)
","text":"Issue a GetRecord request.
Source code insrc/oaipmh_scythe/app.py
def get_record(self, **kwargs: str) -> Record:\n \"\"\"Issue a GetRecord request.\"\"\"\n params = kwargs\n params.update({\"verb\": \"GetRecord\"})\n record = self.iterator(self, params).next()\n return record\n
"},{"location":"api/#oaipmh_scythe.app.Scythe.harvest","title":"harvest(**kwargs)
","text":"Make HTTP requests to the OAI server.
:param kwargs: OAI HTTP parameters.
Source code insrc/oaipmh_scythe/app.py
def harvest(self, **kwargs: str) -> OAIResponse:\n \"\"\"Make HTTP requests to the OAI server.\n\n :param kwargs: OAI HTTP parameters.\n \"\"\"\n http_response = self._request(kwargs)\n for _ in range(self.max_retries):\n if self._is_error_code(http_response.status_code) and http_response.status_code in self.retry_status_codes:\n retry_after = self.get_retry_after(http_response)\n logger.warning(\"HTTP %d! Retrying after %d seconds...\" % (http_response.status_code, retry_after))\n time.sleep(retry_after)\n http_response = self._request(kwargs)\n http_response.raise_for_status()\n if self.encoding:\n http_response.encoding = self.encoding\n return OAIResponse(http_response, params=kwargs)\n
"},{"location":"api/#oaipmh_scythe.app.Scythe.identify","title":"identify()
","text":"Issue an Identify request.
Source code insrc/oaipmh_scythe/app.py
def identify(self) -> Identify:\n \"\"\"Issue an Identify request.\"\"\"\n params = {\"verb\": \"Identify\"}\n return Identify(self.harvest(**params))\n
"},{"location":"api/#oaipmh_scythe.app.Scythe.list_identifiers","title":"list_identifiers(ignore_deleted=False, **kwargs)
","text":"Issue a ListIdentifiers request.
:param ignore_deleted: If set to :obj:True
, the resulting iterator will skip records flagged as deleted.
src/oaipmh_scythe/app.py
def list_identifiers(self, ignore_deleted: bool = False, **kwargs: str) -> BaseOAIIterator:\n \"\"\"Issue a ListIdentifiers request.\n\n :param ignore_deleted: If set to :obj:`True`, the resulting\n iterator will skip records flagged as deleted.\n \"\"\"\n params = kwargs\n params.update({\"verb\": \"ListIdentifiers\"})\n return self.iterator(self, params, ignore_deleted=ignore_deleted)\n
"},{"location":"api/#oaipmh_scythe.app.Scythe.list_metadataformats","title":"list_metadataformats(**kwargs)
","text":"Issue a ListMetadataFormats request.
Source code insrc/oaipmh_scythe/app.py
def list_metadataformats(self, **kwargs: str) -> BaseOAIIterator:\n \"\"\"Issue a ListMetadataFormats request.\"\"\"\n params = kwargs\n params.update({\"verb\": \"ListMetadataFormats\"})\n return self.iterator(self, params)\n
"},{"location":"api/#oaipmh_scythe.app.Scythe.list_records","title":"list_records(ignore_deleted=False, **kwargs)
","text":"Issue a ListRecords request.
:param ignore_deleted: If set to :obj:True
, the resulting iterator will skip records flagged as deleted.
src/oaipmh_scythe/app.py
def list_records(self, ignore_deleted: bool = False, **kwargs: str) -> BaseOAIIterator:\n \"\"\"Issue a ListRecords request.\n\n :param ignore_deleted: If set to :obj:`True`, the resulting\n iterator will skip records flagged as deleted.\n \"\"\"\n params = kwargs\n params.update({\"verb\": \"ListRecords\"})\n return self.iterator(self, params, ignore_deleted=ignore_deleted)\n
"},{"location":"api/#oaipmh_scythe.app.Scythe.list_sets","title":"list_sets(**kwargs)
","text":"Issue a ListSets request.
Source code insrc/oaipmh_scythe/app.py
def list_sets(self, **kwargs: str) -> BaseOAIIterator:\n \"\"\"Issue a ListSets request.\"\"\"\n params = kwargs\n params.update({\"verb\": \"ListSets\"})\n return self.iterator(self, params)\n
"},{"location":"api/#working-with-oai-responses","title":"Working with OAI Responses","text":""},{"location":"api/#iterating-over-oai-items","title":"Iterating over OAI Items","text":" Bases: BaseOAIIterator
Iterator over OAI records/identifiers/sets transparently aggregated via OAI-PMH.
Can be used to conveniently iterate through the records of a repository.
:param scythe: The Scythe object that issued the first request. :param params: The OAI arguments. :type params: dict :param ignore_deleted: Flag for whether to ignore deleted records.
Source code insrc/oaipmh_scythe/iterator.py
class OAIItemIterator(BaseOAIIterator):\n \"\"\"Iterator over OAI records/identifiers/sets transparently aggregated via OAI-PMH.\n\n Can be used to conveniently iterate through the records of a repository.\n\n :param scythe: The Scythe object that issued the first request.\n :param params: The OAI arguments.\n :type params: dict\n :param ignore_deleted: Flag for whether to ignore deleted records.\n \"\"\"\n\n def __init__(self, scythe: Scythe, params: dict[str, str], ignore_deleted: bool = False) -> None:\n self.mapper = scythe.class_mapping[params.get(\"verb\")]\n self.element = VERBS_ELEMENTS[params.get(\"verb\")]\n super().__init__(scythe, params, ignore_deleted)\n\n def _next_response(self):\n super()._next_response()\n self._items = self.oai_response.xml.iterfind(\".//\" + self.scythe.oai_namespace + self.element)\n\n def next(self):\n \"\"\"Return the next record/header/set.\"\"\"\n while True:\n for item in self._items:\n mapped = self.mapper(item)\n if self.ignore_deleted and mapped.deleted:\n continue\n return mapped\n if self.resumption_token and self.resumption_token.token:\n self._next_response()\n else:\n raise StopIteration\n
"},{"location":"api/#oaipmh_scythe.iterator.OAIItemIterator.next","title":"next()
","text":"Return the next record/header/set.
Source code insrc/oaipmh_scythe/iterator.py
def next(self):\n \"\"\"Return the next record/header/set.\"\"\"\n while True:\n for item in self._items:\n mapped = self.mapper(item)\n if self.ignore_deleted and mapped.deleted:\n continue\n return mapped\n if self.resumption_token and self.resumption_token.token:\n self._next_response()\n else:\n raise StopIteration\n
"},{"location":"api/#iterating-over-oai-responses","title":"Iterating over OAI Responses","text":" Bases: BaseOAIIterator
Iterator over OAI responses.
Source code insrc/oaipmh_scythe/iterator.py
class OAIResponseIterator(BaseOAIIterator):\n \"\"\"Iterator over OAI responses.\"\"\"\n\n def next(self):\n \"\"\"Return the next response.\"\"\"\n while True:\n if self.oai_response:\n response = self.oai_response\n self.oai_response = None\n return response\n elif self.resumption_token and self.resumption_token.token:\n self._next_response()\n else:\n raise StopIteration\n
"},{"location":"api/#oaipmh_scythe.iterator.OAIResponseIterator.next","title":"next()
","text":"Return the next response.
Source code insrc/oaipmh_scythe/iterator.py
def next(self):\n \"\"\"Return the next response.\"\"\"\n while True:\n if self.oai_response:\n response = self.oai_response\n self.oai_response = None\n return response\n elif self.resumption_token and self.resumption_token.token:\n self._next_response()\n else:\n raise StopIteration\n
"},{"location":"api/#classes-for-oai-items","title":"Classes for OAI Items","text":""},{"location":"api/#identify","title":"Identify","text":""},{"location":"api/#record","title":"Record","text":"Record objects represent single OAI records.
Bases: OAIItem
Represents an OAI record.
:param record_element: The XML element 'record'. :type record_element: :class:lxml.etree._Element
:param strip_ns: Flag for whether to remove the namespaces from the element names.
src/oaipmh_scythe/models.py
class Record(OAIItem):\n \"\"\"Represents an OAI record.\n\n :param record_element: The XML element 'record'.\n :type record_element: :class:`lxml.etree._Element`\n :param strip_ns: Flag for whether to remove the namespaces from the\n element names.\n \"\"\"\n\n def __init__(self, record_element: etree._Element, strip_ns: bool = True) -> None:\n super().__init__(record_element, strip_ns=strip_ns)\n self.header = Header(self.xml.find(\".//\" + self._oai_namespace + \"header\"))\n self.deleted = self.header.deleted\n if not self.deleted:\n self.metadata = self.get_metadata()\n\n def __repr__(self) -> str:\n if self.header.deleted:\n return f\"<Record {self.header.identifier} [deleted]>\"\n return f\"<Record {self.header.identifier}>\"\n\n def __iter__(self):\n return iter(self.metadata.items())\n\n def get_metadata(self):\n # We want to get record/metadata/<container>/*\n # <container> would be the element ``dc``\n # in the ``oai_dc`` case.\n return xml_to_dict(\n self.xml.find(\".//\" + self._oai_namespace + \"metadata\").getchildren()[0],\n strip_ns=self._strip_ns,\n )\n
"},{"location":"api/#header","title":"Header","text":""},{"location":"api/#set","title":"Set","text":""},{"location":"api/#metadataformat","title":"MetadataFormat","text":""},{"location":"changelog/","title":"Changelog","text":""},{"location":"changelog/#changelog","title":"Changelog","text":""},{"location":"changelog/#unreleased","title":"Unreleased","text":"Record.get_metadata()
) to make subclassing easier (mloesch/sickle#38)max_retries
parameter now refers to no. of retries, not counting the initial request anymoreFirst public release.
"},{"location":"credits/","title":"Credits","text":"By default, oaipmh-scythe's mapping of the record XML into Python dictionaries is tailored to work only with Dublin-Core-encoded metadata payloads. Other formats most probably won't be mapped correctly, especially if they are more hierarchically structured than Dublin Core.
In case you want to harvest these more complex formats, you have to write your own record model class by subclassing the default implementation that unpacks the metadata XML:
from oaipmh_scythe.models import Record\n\nclass MyRecord(Record):\n # Your XML unpacking implementation goes here.\n pass\n
Note
Take a look at the implementation of oaipmh_scythe.models.Record to get an idea of how to do this.
Next, associate your implementation with OAI verbs in the oaipmh_scythe.app.Scythe object. In this case, we want the oaipmh_scythe.app.Scythe object to use our implementation to represent items returned by ListRecords and GetRecord responses:
scythe = Scythe('http://...')\nscythe.class_mapping['ListRecords'] = MyRecord\nscythe.class_mapping['GetRecord'] = MyRecord\n
If you need to rewrite all item implementations, you can also provide a complete mapping to the oaipmh_scythe.app.Scythe object at instantiation:
my_mapping = {\n 'ListRecords': MyRecord,\n 'GetRecord': MyRecord,\n # ...\n}\n\nscythe = Scythe('https://...', class_mapping=my_mapping)\n
"},{"location":"development/","title":"Development","text":""},{"location":"license/","title":"License","text":"Copyright (c) 2013 by Mathias Loesch.
Some rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright\n notice, this list of conditions and the following disclaimer.\n\n* Redistributions in binary form must reproduce the above\n copyright notice, this list of conditions and the following\n disclaimer in the documentation and/or other materials provided\n with the distribution.\n\n* The names of the contributors may not be used to endorse or\n promote products derived from this software without specific\n prior written permission.\n
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS \"AS IS\" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
"},{"location":"oaipmh/","title":"OAI-PMH Primer","text":"This section gives a basic overview of the Open Archives Protocol for Metadata Harvesting (OAI-PMH). For more detailed information, please refer to the protocol specification.
"},{"location":"oaipmh/#glossary-of-important-oai-pmh-concepts","title":"Glossary of Important OAI-PMH Concepts","text":"Repository
A repository is a server-side application that exposes metadata via OAI-PMH.
Harvester
OAI-PMH client applications like Sickle are called harvesters.
record
A record is the XML-encoded container for the metadata of a single publication item. It consists of a header and a metadata section.
header
The record header contains a unique identifier and a datestamp.
metadata
The record metadata contains the publication metadata in a defined metadata format.
set
A structure for grouping records for selective harvesting.
harvesting
The process of requesting records from the repository by the harvester.
"},{"location":"oaipmh/#oai-verbs","title":"OAI Verbs","text":"OAI-PMH features six main API methods (so-called \"OAI verbs\") that can be issued by harvesters. Some verbs can be combined with further arguments:
Identify
Returns information about the repository. Arguments: None.
GetRecord
Returns a single record. Arguments:
identifier
(the unique identifier of the record, required)metadataPrefix
(the prefix identifying the metadata format, required)ListRecords
Returns the records in the repository in batches (possibly filtered by a timestamp or a set
). Arguments:
metadataPrefix
(the prefix identifying the metadata format, required)from
(the earliest timestamp of the records, optional)until
(the latest timestamp of the records, optional)set
(a set for selective harvesting, optional)resumptionToken
(used for getting the next result batch if the number of records returned by the previous request exceeds the repository's maximum batch size, exclusive)ListIdentifiers
Like ListRecords
but returns only the record headers.
ListSets
Returns the list of sets supported by this repository. Arguments: None
ListMetadataFormats
Returns the list of metadata formats supported by this repository. Arguments: None
"},{"location":"oaipmh/#metadata-formats","title":"Metadata Formats","text":"OAI interfaces may expose metadata records in multiple metadata formats. These formats are identified by so-called \"metadata prefixes\". For instance, the prefix oai_dc
refers to the OAI-DC format, which by definition has to be exposed by every valid OAI interface. OAI-DC is based on the 15 metadata elements specified in the Dublin Core Metadata Element Set.
Note
oaipmh-scythe only supports the OAI-DC format out of the box. See the section on customizing for information on how to extend oaipmh-scythe for retrieving metadata in other formats.
"},{"location":"tutorial/","title":"Tutorial","text":"This section gives a brief overview on how to use oaipmh-scythe for querying OAI interfaces.
"},{"location":"tutorial/#initialize-an-oai-interface","title":"Initialize an OAI Interface","text":"To make a connection to an OAI interface, you need to import the Scythe class:
from oaipmh_scythe import Scythe\n
Next, you can initialize the connection by passing it the base URL. In our example, we use the OAI interface of Zenodo:
scythe = Scythe(\"https://zenodo.org/oai2d\")\n
"},{"location":"tutorial/#issuing-requests","title":"Issuing Requests","text":"oaipmh-scythe provides methods for each of the six OAI verbs (ListRecords, GetRecord, Idenitfy, ListSets, ListMetadataFormats, ListIdentifiers).
Start with a ListRecords request:
records = scythe.list_records(metadataPrefix=\"oai_dc\")\n
Note that all keyword arguments you provide to this function are passed to the OAI interface as HTTP parameters. Therefore, the example request would send the parameters verb=ListRecords&metadataPrefix=oai_dc
. We can add additional parameters, like, for example, an OAI set
:
records = scythe.list_records(metadataPrefix=\"oai_dc\", set=\"user-cfa\")\n
"},{"location":"tutorial/#consecutive-harvesting","title":"Consecutive Harvesting","text":"Since most OAI verbs yield more than one element, their respective Scythe methods return iterator objects which can be used to iterate over the records of a repository:
records = scythe.list_records(metadataPrefix=\"oai_dc\")\nrecords.next()\n# <Record oai:zenodo.org:4574771>\n
Note that this works with all verbs that return more than one element. These are: list_records(), list_identifiers(), list_sets(), and list_metadataformats().
The following example shows how to iterate over the headers returned by ListIdentifiers
:
headers = scythe.list_identifiers(metadataPrefix=\"oai_dc\")\nheaders.next()\n# <Header oai:zenodo.org:4574771>\n
Iterating over the sets returned by ListSets
works similarly:
sets = scythe.list_sets()\nsets.next()\n# <Set European Middleware Initiative>\n
"},{"location":"tutorial/#using-the-from-parameter","title":"Using the from
Parameter","text":"If you need to perform selective harvesting by date using the from
parameter, you may face the problem that from
is a reserved word in Python:
>>> records = scythe.list_records(metadataPrefix=\"oai_dc\", from=\"2023-10-10\")\n File \"<stdin>\", line 1\n records = scythe.list_records(metadataPrefix=\"oai_dc\", from=\"2023-10-10\")\n ^^^^\nSyntaxError: invalid syntax\n
Fortunately, you can circumvent this problem by using a dictionary together with the **
operator:
>>> records = scythe.list_records(**{\"metadataPrefix\": \"oai_dc\", \"from\": \"2023-10-10\"})\n
"},{"location":"tutorial/#getting-a-single-record","title":"Getting a Single Record","text":"OAI-PMH allows you to get a single record by using the GetRecord
verb:
>>> scythe.get_record(identifier=\"oai:zenodo.org:4574771\", metadataPrefix=\"oai_dc\")\n<Record oai:eprints.rclis.org:4088>\n
"},{"location":"tutorial/#harvesting-oai-items-vs-oai-responses","title":"Harvesting OAI Items vs. OAI Responses","text":"Scythe supports two harvesting modes that differ in the type of the returned objects. The default mode returns OAI-specific items (records, headers etc.) encoded as Python objects as seen earlier. If you want to save the whole XML response returned by the server, you have to pass the oaipmh_scythe.iterator.OAIResponseIterator during the instantiation of the Scythe object:
>>> from oaipmh_scythe.iterator import OAIResponseIterator\n>>> scythe = Scythe(\"https://zenodo.org/oai2d\", iterator=OAIResponseIterator)\n>>> responses = scythe.list_records(metadataPrefix=\"oai_dc\")\n>>> responses.next()\n<OAIResponse ListRecords>\n
You could then save the returned responses to disk:
>>> with open(\"response.xml\", \"w\") as f:\n... f.write(responses.next().raw.encode(\"utf8\"))\n
"},{"location":"tutorial/#ignoring-deleted-records","title":"Ignoring Deleted Records","text":"The list_records() and ListIdentifiers() methods accept an optional parameter ignore_deleted
. If set to True
, the returned OAIItemIterator will skip deleted records/headers:
>>> records = scythe.list_records(metadataPrefix=\"oai_dc\", ignore_deleted=True)\n
Note
This works only using the oaipmh_scythe.iterator.OAIItemIterator. If you use the oaipmh_scythe.iterator.OAIResponseIterator, the resulting OAI responses will still contain the deleted records.
"}]} \ No newline at end of file diff --git a/sitemap.xml.gz b/sitemap.xml.gz index 257ea05..4a0bba4 100644 Binary files a/sitemap.xml.gz and b/sitemap.xml.gz differ diff --git a/tutorial/index.html b/tutorial/index.html index 27e8dff..a075380 100644 --- a/tutorial/index.html +++ b/tutorial/index.html @@ -531,68 +531,64 @@This section gives a brief overview on how to use oaipmh-scythe for querying OAI interfaces.
To make a connection to an OAI interface, you need to import the Scythe class:
->>> from oaipmh_scythe import Scythe
+
Next, you can initialize the connection by passing it the base URL. In
our example, we use the OAI interface of Zenodo:
->>> scythe = Scythe("https://zenodo.org/oai2d")
+
Issuing Requests
oaipmh-scythe provides methods for each of the six OAI verbs (ListRecords,
-GetRecord, Idenitfy, ListSets, ListMetadataFormats, ListIdentifiers).
-Start with a ListRecords request:
->>> records = scythe.ListRecords(metadataPrefix='oai_dc')
+GetRecord, Idenitfy, ListSets, ListMetadataFormats, ListIdentifiers).
+Start with a ListRecords request:
+
Note that all keyword arguments you provide to this function are passed
-to the OAI interface as HTTP parameters. Therefore the example request
+to the OAI interface as HTTP parameters. Therefore, the example request
would send the parameters verb=ListRecords&metadataPrefix=oai_dc
. We
can add additional parameters, like, for example, an OAI set
:
->>> records = scythe.ListRecords(metadataPrefix="oai_dc", set="driver")
+
Consecutive Harvesting
Since most OAI verbs yield more than one element, their respective
Scythe methods return iterator objects which can be used to iterate over
the records of a repository:
->>> records = scythe.ListRecords(metadataPrefix="oai_dc")
->>> records.next()
-<Record oai:oai:zenodo.org:4574771>
+records = scythe.list_records(metadataPrefix="oai_dc")
+records.next()
+# <Record oai:zenodo.org:4574771>
Note that this works with all verbs that return more than one element.
-These are: [ListRecords()][oaipmh_scythe.app.Scythe.ListRecords],
-[ListIdentifiers()][oaipmh_scythe.app.Scythe.ListIdentifiers], [ListSets()][oaipmh_scythe.app.Scythe.ListSets],
-and [ListMetadataFormats()][oaipmh_scythe.app.Scythe.ListMetadataFormats].
+These are: list_records(),
+list_identifiers(), list_sets(),
+and list_metadataformats().
The following example shows how to iterate over the headers returned by
ListIdentifiers
:
->>> headers = scythe.ListIdentifiers(metadataPrefix="oai_dc")
->>> headers.next()
-<Header oai:eprints.rclis.org:4088>
+headers = scythe.list_identifiers(metadataPrefix="oai_dc")
+headers.next()
+# <Header oai:zenodo.org:4574771>
Iterating over the sets returned by ListSets
works similarly:
->>> sets = scythe.ListSets()
->>> sets.next()
-<Set Status = In Press>
+
Using the from
Parameter
If you need to perform selective harvesting by date using the from
parameter, you may face the problem that from
is a reserved word in
Python:
->>> records = scythe.ListRecords(metadataPrefix="oai_dc", from="2012-12-12")
+>>> records = scythe.list_records(metadataPrefix="oai_dc", from="2023-10-10")
File "<stdin>", line 1
- records = scythe.ListRecords(metadataPrefix="oai_dc", from="2012-12-12")
- ^
+ records = scythe.list_records(metadataPrefix="oai_dc", from="2023-10-10")
+ ^^^^
SyntaxError: invalid syntax
Fortunately, you can circumvent this problem by using a dictionary together with the **
operator:
->>> records = scythe.ListRecords(
-... **{'metadataPrefix': 'oai_dc',
-... 'from': '2012-12-12'
-... })
+
Getting a Single Record
OAI-PMH allows you to get a single record by using the GetRecord
verb:
->>> scythe.GetRecord(identifier='oai:eprints.rclis.org:4088',
-... metadataPrefix='oai_dc')
-<Record oai:eprints.rclis.org:4088>
+>>> scythe.get_record(identifier="oai:zenodo.org:4574771", metadataPrefix="oai_dc")
+<Record oai:eprints.rclis.org:4088>
Harvesting OAI Items vs. OAI Responses
Scythe supports two harvesting modes that differ in the type of the
@@ -601,20 +597,21 @@
Harvesting OAI Items vs. OAI Resp
you want to save the whole XML response returned by the server, you have
to pass the oaipmh_scythe.iterator.OAIResponseIterator during the instantiation of the
Scythe object:
->>> scythe = Scythe('http://elis.da.ulcc.ac.uk/cgi/oai2', iterator=OAIResponseIterator)
->>> responses = Scythe.ListRecords(metadataPrefix='oai_dc')
->>> responses.next()
-<OAIResponse ListRecords>
+>>> from oaipmh_scythe.iterator import OAIResponseIterator
+>>> scythe = Scythe("https://zenodo.org/oai2d", iterator=OAIResponseIterator)
+>>> responses = scythe.list_records(metadataPrefix="oai_dc")
+>>> responses.next()
+<OAIResponse ListRecords>
You could then save the returned responses to disk:
Ignoring Deleted Records
-The [ListRecords()][oaipmh_scythe.app.Scythe.ListRecords] and
+
The list_records() and
[ListIdentifiers()][oaipmh_scythe.app.Scythe.ListIdentifiers] methods accept an optional parameter ignore_deleted
.
If set to True
, the returned OAIItemIterator will skip deleted records/headers:
-