afuetterer · afuetterer · Jan 25, 2024 · May 17, 2024 · May 17, 2024 · May 17, 2024
@@ -72,9 +72,11 @@ repos:
   - id: mypy
     args: [--config-file=pyproject.toml]
     additional_dependencies:
-    - httpx==0.27.0
+    - httpx==0.27
     - lxml-stubs==0.5.1
-    exclude: tests
+    - pytest==8.3.2
+    - xsdata==24.7
+    exclude: tests # TODO: remove this exclusion
 
 - repo: https://github.com/scientific-python/cookie
   rev: 35368e874265d105e1ca3355df7ef51bbca8eba6  # frozen: 2024.08.19
@@ -86,7 +88,9 @@ repos:
   hooks:
   - id: typos
     args: [--force-exclude]
-    exclude: CHANGELOG.md # the commit hashes in changelog trigger the spell checker
+    # CHANGELOG.md: the commit hashes in changelog trigger the spell checker
+    # src/oaipmh_scythe/models: autogenerated python modules by xsdata
+    exclude: ^CHANGELOG.md|^src/oaipmh_scythe/models/.*
 
 - repo: https://github.com/FHPythonUtils/LicenseCheck/
   rev: b2b50f4d40c95b15478279a7a00553a1dc2925ef  # frozen: 2024.2

@@ -18,15 +18,15 @@ from oaipmh_scythe import Scythe
 with Scythe("https://zenodo.org/oai2d") as scythe:
     records = scythe.list_records()
     next(records)
-# <Record oai:zenodo.org:4574771>
+# Record(header=Header(identifier='oai:zenodo.org:10654826', datestamp='2024-02-13T15:38:50Z', set_spec=['software'], status=None), metadata=Metadata(other_element=Dc(title=[Title(value='Research Data Management Organiser (RDMO)', lang=None)], creator=[Creator(value='Klar, Jochen', lang=None), Creator(value='Michaelis, Olaf', lang=None), Creator(value='Wallace, David', lang=None), Creator(value='Schröder, Max', lang=None), Creator(value='Fütterer, Heinz-Alexander', lang=None), Creator(value='Lanza, Giacomo', lang=None), Creator(value='Martínez Muñoz, David', lang=None), Creator(value='Pilori, Dario', lang=None), Creator(value='Harry, Enke', lang=None)], subject=[], description=[Description(value='&lt;h2&gt;&lt;a href="https://github.com/rdmorganiser/rdmo/compare/2.1.2...2.1.3"&gt;RDMO 2.1.3&lt;/a&gt; (Feb 13, 2024)&lt;/h2&gt;\n&lt;ul&gt;\n&lt;li&gt;Fix the migration of options with additional_input (#912)&lt;/li&gt;\n&lt;li&gt;Fix export urls in management when using BASE_PATH (#915)&lt;/li&gt;\n&lt;/ul&gt;\n&lt;h2&gt;How to upgrade&lt;/h2&gt;\n&lt;p&gt;In case you are upgrading from an RDMO version below 2.0.0 please read these &lt;a href="https://rdmo.readthedocs.io/en/latest/upgrade/index.html#upgrade-to-version-2-0-0"&gt;upgrade instructions&lt;/a&gt; before you proceed.&lt;/p&gt;\n&lt;pre&gt;&lt;code&gt;pip install --upgrade rdmo\npython manage.py upgrade\n&lt;/code&gt;&lt;/pre&gt;', lang=None), Description(value='If you refer to this software in a publication, please cite it as below.', lang=None)], publisher=[Publisher(value='Zenodo', lang=None)], contributor=[], date=[Date(value='2024-02-13', lang=None)], type_value=[TypeType(value='info:eu-repo/semantics/other', lang=None)], format=[], identifier=[Identifier(value='https://doi.org/10.5281/zenodo.10654826', lang=None), Identifier(value='oai:zenodo.org:10654826', lang=None)], source=[], language=[], relation=[Relation(value='https://github.com/rdmorganiser/rdmo/tree/2.1.3', lang=None), Relation(value='https://doi.org/10.5281/zenodo.596581', lang=None)], coverage=[], rights=[Rights(value='info:eu-repo/semantics/openAccess', lang=None), Rights(value='Apache License 2.0', lang=None), Rights(value='http://www.apache.org/licenses/LICENSE-2.0', lang=None)])), about=[])
 ```
 
 ## Features
 
 - Easy harvesting of OAI-compliant interfaces
 - Support for all six OAI verbs
-- Convenient object representations of OAI items (records, headers, sets, ...)
-- Automatic de-serialization of Dublin Core-encoded metadata payloads to Python dictionaries
+- Convenient object representations of OAI items (records, headers, sets, ...) as dataclasses
+- Automatic de-serialization of metadata payloads to dataclasses for Dublin Core, DataCite, Marcxml
 - Option for ignoring deleted items
 
 ## Requirements
@@ -36,14 +36,14 @@ with Scythe("https://zenodo.org/oai2d") as scythe:
 `oaipmh-scythe` is built with:
 
 - [httpx](https://github.com/encode/httpx) for issuing HTTP requests
-- [lxml](https://github.com/lxml/lxml) for parsing XML responses
+- [xsdata](https://github.com/tefra/xsdata) for parsing XML responses
 
 ## Installation
 
 You can install `oaipmh-scythe` via pip from [PyPI][pypi-url]:
 
-```console
-python -m pip install oaipmh-scythe
+```shell-session
+$ python -m pip install oaipmh-scythe
 ```
 
 ## Documentation

@@ -38,6 +38,7 @@ theme:
   - content.action.view
   - content.code.annotate
   - content.code.copy
+  - content.code.select
   - navigation.footer
   palette:
   - media: '(prefers-color-scheme: light)'
@@ -69,6 +70,7 @@ plugins:
 markdown_extensions:
 - admonition
 - pymdownx.highlight:
+    use_pygments: true
     anchor_linenums: true
     line_spans: __span
     pygments_lang_class: true
@@ -80,3 +82,6 @@ extra:
   version:
     provider: mike
     alias: true
+
+extra_css:
+- css/code_select.css
@@ -0,0 +1,8 @@
+.language-pycon .gp, .language-pycon .go {
+  user-select: none;
+}
+
+
+/* .highlight .gp {
+  user-select: none;
+} */
@@ -5,46 +5,46 @@ SPDX-FileCopyrightText: 2023 Heinz-Alexander Fütterer
 SPDX-License-Identifier: BSD-3-Clause
 -->
 
-# Harvesting other Metadata Formats than OAI-DC
+# Harvesting other Metadata Formats
 
-By default, oaipmh-scythe's mapping of the record XML into Python dictionaries is tailored to work only with
-Dublin-Core-encoded metadata payloads. Other formats most probably won't be mapped correctly, especially if they are
-more hierarchically structured than Dublin Core.
-
-In case you want to harvest these more complex formats, you have to write your own record model class by subclassing the
-default implementation that unpacks the metadata XML:
-
-```python
-from oaipmh_scythe.models import Record
+By default, `oaipmh-scythe`'s mapping of the record XML into Python dataclasses is tailored to work best with
+Dublin-Core-encoded metadata payloads (i.e. `metadata_prefix="oai_dc"`).
 
+```pycon
+>>> from oaipmh_scythe import Scythe
+>>> scythe = Scythe("https://export.arxiv.org/oai2")
+>>> record = scythe.get_record("oai:arXiv.org:2203.05794", metadata_prefix="oai_dc")
+>>> record.get_metadata()
+Dc(title=[Title(value='BERTopic: Neural topic modeling with a class-based TF-IDF procedure', lang=None)], creator=[Creator(value='Grootendorst, Maarten', lang=None)], subject=[Subject(value='Computer Science - Computation and Language', lang=None)], description=[Description(value='  Topic models can be useful tools to discover latent topics in collections of\ndocuments. Recent studies have shown the feasibility of approach topic modeling\nas a clustering task. We present BERTopic, a topic model that extends this\nprocess by extracting coherent topic representation through the development of\na class-based variation of TF-IDF. More specifically, BERTopic generates\ndocument embedding with pre-trained transformer-based language models, clusters\nthese embeddings, and finally, generates topic representations with the\nclass-based TF-IDF procedure. BERTopic generates coherent topics and remains\ncompetitive across a variety of benchmarks involving classical models and those\nthat follow the more recent clustering approach of topic modeling.\n', lang=None), Description(value='Comment: BERTopic has a python implementation, see\n  https://github.com/MaartenGr/BERTopic', lang=None)], publisher=[], contributor=[], date=[Date(value='2022-03-11', lang=None)], type_value=[TypeType(value='text', lang=None)], format=[], identifier=[Identifier(value='http://arxiv.org/abs/2203.05794', lang=None)], source=[], language=[], relation=[], coverage=[], rights=[])
+```
 
-class MyRecord(Record):
-    # Your XML unpacking implementation goes here.
-    pass
+```pycon
+>>> record = scythe.get_record("oai:arXiv.org:2203.05794", metadata_prefix="arXiv")
+>>> record
+# Record(header=Header(identifier='oai:arXiv.org:2203.05794', datestamp=XmlDate(2022, 3, 14), set_spec=['cs'], status=None), metadata=Metadata(other_element=AnyElement(qname='{http://arxiv.org/OAI/arXiv/}arXiv', text='', tail=None, children=[AnyElement(qname='{http://arxiv.org/OAI/arXiv/}id', text='2203.05794', tail=None, children=[], attributes={}), AnyElement(qname='{http://arxiv.org/OAI/arXiv/}created', text='2022-03-11', tail=None, children=[], attributes={}), AnyElement(qname='{http://arxiv.org/OAI/arXiv/}authors', text='', tail=None, children=[AnyElement(qname='{http://arxiv.org/OAI/arXiv/}author', text='', tail=None, children=[AnyElement(qname='{http://arxiv.org/OAI/arXiv/}keyname', text='Grootendorst', tail=None, children=[], attributes={}), AnyElement(qname='{http://arxiv.org/OAI/arXiv/}forenames', text='Maarten', tail=None, children=[], attributes={})], attributes={})], attributes={}), AnyElement(qname='{http://arxiv.org/OAI/arXiv/}title', text='BERTopic: Neural topic modeling with a class-based TF-IDF procedure', tail=None, children=[], attributes={}), AnyElement(qname='{http://arxiv.org/OAI/arXiv/}categories', text='cs.CL', tail=None, children=[], attributes={}), AnyElement(qname='{http://arxiv.org/OAI/arXiv/}comments', text='BERTopic has a python implementation, see\n  https://github.com/MaartenGr/BERTopic', tail=None, children=[], attributes={}), AnyElement(qname='{http://arxiv.org/OAI/arXiv/}license', text='http://arxiv.org/licenses/nonexclusive-distrib/1.0/', tail=None, children=[], attributes={}), AnyElement(qname='{http://arxiv.org/OAI/arXiv/}abstract', text='  Topic models can be useful tools to discover latent topics in collections of\ndocuments. Recent studies have shown the feasibility of approach topic modeling\nas a clustering task. We present BERTopic, a topic model that extends this\nprocess by extracting coherent topic representation through the development of\na class-based variation of TF-IDF. More specifically, BERTopic generates\ndocument embedding with pre-trained transformer-based language models, clusters\nthese embeddings, and finally, generates topic representations with the\nclass-based TF-IDF procedure. BERTopic generates coherent topics and remains\ncompetitive across a variety of benchmarks involving classical models and those\nthat follow the more recent clustering approach of topic modeling.\n', tail=None, children=[], attributes={})], attributes={'{http://www.w3.org/2001/XMLSchema-instance}schemaLocation': 'http://arxiv.org/OAI/arXiv/ http://arxiv.org/OAI/arXiv.xsd'})), about=[])
 ```
 
 !!! note
-    Take a look at the implementation of [oaipmh_scythe.models.Record][] to get an idea of how to do this.
+    The response still gets parsed into a dataclass, but the metadata has attributes of type `AnyElement`, e.g.
+    `AnyElement(qname='{http://arxiv.org/OAI/arXiv/}arXiv'`.
 
-Next, associate your implementation with OAI verbs in the [oaipmh_scythe.client.Scythe][] object. In this case, we want
-the [oaipmh_scythe.client.Scythe][] object to use our implementation to represent items returned by ListRecords and
-GetRecord responses:
+https://xsdata.readthedocs.io/en/latest/codegen/intro/#command-line-tool
 
-```python
-scythe = Scythe("http://...")
-scythe.class_mapping["ListRecords"] = MyRecord
-scythe.class_mapping["GetRecord"] = MyRecord
+```bash
+$ python -m pip install "xsdata[cli]>=24.5"
+$ xsdata generate --package=arxiv http://arxiv.org/OAI/arXiv.xsd
 ```
 
-If you need to rewrite *all* item implementations, you can also provide a complete mapping to the
-[oaipmh_scythe.client.Scythe][] object at instantiation:
-
 ```python
-my_mapping = {
-    "ListRecords": MyRecord,
-    "GetRecord": MyRecord,
-    # ...
-}
+from arxiv import ArXiv
 
-scythe = Scythe("https://...", class_mapping=my_mapping)
+record = scythe.get_record("oai:arXiv.org:2203.05794", metadata_prefix="arXiv")
+record
+# Record(header=Header(identifier='oai:arXiv.org:2203.05794', datestamp=XmlDate(2022, 3, 14), set_spec=['cs'], status=None), metadata=Metadata(other_element=ArXiv(id=['2203.05794'], created=['2022-03-11'], updated=[], authors=[AuthorsType(author=[AuthorType(keyname='Grootendorst', forenames='Maarten', suffix=None, affiliation=[])])], title=['BERTopic: Neural topic modeling with a class-based TF-IDF procedure'], msc_class=[], acm_class=[], report_no=[], journal_ref=[], comments=['BERTopic has a python implementation, see\n  https://github.com/MaartenGr/BERTopic'], abstract=['  Topic models can be useful tools to discover latent topics in collections of\ndocuments. Recent studies have shown the feasibility of approach topic modeling\nas a clustering task. We present BERTopic, a topic model that extends this\nprocess by extracting coherent topic representation through the development of\na class-based variation of TF-IDF. More specifically, BERTopic generates\ndocument embedding with pre-trained transformer-based language models, clusters\nthese embeddings, and finally, generates topic representations with the\nclass-based TF-IDF procedure. BERTopic generates coherent topics and remains\ncompetitive across a variety of benchmarks involving classical models and those\nthat follow the more recent clustering approach of topic modeling.\n'], categories=['cs.CL'], doi=[], proxy=[], license=['http://arxiv.org/licenses/nonexclusive-distrib/1.0/'])), about=[])
 ```
+
+!!! note
+    The response gets parsed into a Record dataclass, and the metadata is of type `ArXiv`.
+
+!!! note
+    Take a look at the models