Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: add xsdata models #251

Draft
wants to merge 16 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 7 additions & 3 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -72,9 +72,11 @@ repos:
- id: mypy
args: [--config-file=pyproject.toml]
additional_dependencies:
- httpx==0.27.0
- httpx==0.27
- lxml-stubs==0.5.1
exclude: tests
- pytest==8.3.2
- xsdata==24.7
exclude: tests # TODO: remove this exclusion

- repo: https://github.com/scientific-python/cookie
rev: 35368e874265d105e1ca3355df7ef51bbca8eba6 # frozen: 2024.08.19
Expand All @@ -86,7 +88,9 @@ repos:
hooks:
- id: typos
args: [--force-exclude]
exclude: CHANGELOG.md # the commit hashes in changelog trigger the spell checker
# CHANGELOG.md: the commit hashes in changelog trigger the spell checker
# src/oaipmh_scythe/models: autogenerated python modules by xsdata
exclude: ^CHANGELOG.md|^src/oaipmh_scythe/models/.*

- repo: https://github.com/FHPythonUtils/LicenseCheck/
rev: b2b50f4d40c95b15478279a7a00553a1dc2925ef # frozen: 2024.2
Expand Down
12 changes: 6 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,15 +18,15 @@ from oaipmh_scythe import Scythe
with Scythe("https://zenodo.org/oai2d") as scythe:
records = scythe.list_records()
next(records)
# <Record oai:zenodo.org:4574771>
# Record(header=Header(identifier='oai:zenodo.org:10654826', datestamp='2024-02-13T15:38:50Z', set_spec=['software'], status=None), metadata=Metadata(other_element=Dc(title=[Title(value='Research Data Management Organiser (RDMO)', lang=None)], creator=[Creator(value='Klar, Jochen', lang=None), Creator(value='Michaelis, Olaf', lang=None), Creator(value='Wallace, David', lang=None), Creator(value='Schröder, Max', lang=None), Creator(value='Fütterer, Heinz-Alexander', lang=None), Creator(value='Lanza, Giacomo', lang=None), Creator(value='Martínez Muñoz, David', lang=None), Creator(value='Pilori, Dario', lang=None), Creator(value='Harry, Enke', lang=None)], subject=[], description=[Description(value='&lt;h2&gt;&lt;a href="https://github.com/rdmorganiser/rdmo/compare/2.1.2...2.1.3"&gt;RDMO 2.1.3&lt;/a&gt; (Feb 13, 2024)&lt;/h2&gt;\n&lt;ul&gt;\n&lt;li&gt;Fix the migration of options with additional_input (#912)&lt;/li&gt;\n&lt;li&gt;Fix export urls in management when using BASE_PATH (#915)&lt;/li&gt;\n&lt;/ul&gt;\n&lt;h2&gt;How to upgrade&lt;/h2&gt;\n&lt;p&gt;In case you are upgrading from an RDMO version below 2.0.0 please read these &lt;a href="https://rdmo.readthedocs.io/en/latest/upgrade/index.html#upgrade-to-version-2-0-0"&gt;upgrade instructions&lt;/a&gt; before you proceed.&lt;/p&gt;\n&lt;pre&gt;&lt;code&gt;pip install --upgrade rdmo\npython manage.py upgrade\n&lt;/code&gt;&lt;/pre&gt;', lang=None), Description(value='If you refer to this software in a publication, please cite it as below.', lang=None)], publisher=[Publisher(value='Zenodo', lang=None)], contributor=[], date=[Date(value='2024-02-13', lang=None)], type_value=[TypeType(value='info:eu-repo/semantics/other', lang=None)], format=[], identifier=[Identifier(value='https://doi.org/10.5281/zenodo.10654826', lang=None), Identifier(value='oai:zenodo.org:10654826', lang=None)], source=[], language=[], relation=[Relation(value='https://github.com/rdmorganiser/rdmo/tree/2.1.3', lang=None), Relation(value='https://doi.org/10.5281/zenodo.596581', lang=None)], coverage=[], rights=[Rights(value='info:eu-repo/semantics/openAccess', lang=None), Rights(value='Apache License 2.0', lang=None), Rights(value='http://www.apache.org/licenses/LICENSE-2.0', lang=None)])), about=[])
```

## Features

- Easy harvesting of OAI-compliant interfaces
- Support for all six OAI verbs
- Convenient object representations of OAI items (records, headers, sets, ...)
- Automatic de-serialization of Dublin Core-encoded metadata payloads to Python dictionaries
- Convenient object representations of OAI items (records, headers, sets, ...) as dataclasses
- Automatic de-serialization of metadata payloads to dataclasses for Dublin Core, DataCite, Marcxml
- Option for ignoring deleted items

## Requirements
Expand All @@ -36,14 +36,14 @@ with Scythe("https://zenodo.org/oai2d") as scythe:
`oaipmh-scythe` is built with:

- [httpx](https://github.com/encode/httpx) for issuing HTTP requests
- [lxml](https://github.com/lxml/lxml) for parsing XML responses
- [xsdata](https://github.com/tefra/xsdata) for parsing XML responses

## Installation

You can install `oaipmh-scythe` via pip from [PyPI][pypi-url]:

```console
python -m pip install oaipmh-scythe
```shell-session
$ python -m pip install oaipmh-scythe
```

## Documentation
Expand Down
5 changes: 5 additions & 0 deletions docs/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ theme:
- content.action.view
- content.code.annotate
- content.code.copy
- content.code.select
- navigation.footer
palette:
- media: '(prefers-color-scheme: light)'
Expand Down Expand Up @@ -69,6 +70,7 @@ plugins:
markdown_extensions:
- admonition
- pymdownx.highlight:
use_pygments: true
anchor_linenums: true
line_spans: __span
pygments_lang_class: true
Expand All @@ -80,3 +82,6 @@ extra:
version:
provider: mike
alias: true

extra_css:
- css/code_select.css
8 changes: 8 additions & 0 deletions docs/src/css/code_select.css
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
.language-pycon .gp, .language-pycon .go {
user-select: none;
}


/* .highlight .gp {
user-select: none;
} */
60 changes: 30 additions & 30 deletions docs/src/customizing.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,46 +5,46 @@ SPDX-FileCopyrightText: 2023 Heinz-Alexander Fütterer
SPDX-License-Identifier: BSD-3-Clause
-->

# Harvesting other Metadata Formats than OAI-DC
# Harvesting other Metadata Formats

By default, oaipmh-scythe's mapping of the record XML into Python dictionaries is tailored to work only with
Dublin-Core-encoded metadata payloads. Other formats most probably won't be mapped correctly, especially if they are
more hierarchically structured than Dublin Core.

In case you want to harvest these more complex formats, you have to write your own record model class by subclassing the
default implementation that unpacks the metadata XML:

```python
from oaipmh_scythe.models import Record
By default, `oaipmh-scythe`'s mapping of the record XML into Python dataclasses is tailored to work best with
Dublin-Core-encoded metadata payloads (i.e. `metadata_prefix="oai_dc"`).

```pycon
>>> from oaipmh_scythe import Scythe
>>> scythe = Scythe("https://export.arxiv.org/oai2")
>>> record = scythe.get_record("oai:arXiv.org:2203.05794", metadata_prefix="oai_dc")
>>> record.get_metadata()
Dc(title=[Title(value='BERTopic: Neural topic modeling with a class-based TF-IDF procedure', lang=None)], creator=[Creator(value='Grootendorst, Maarten', lang=None)], subject=[Subject(value='Computer Science - Computation and Language', lang=None)], description=[Description(value=' Topic models can be useful tools to discover latent topics in collections of\ndocuments. Recent studies have shown the feasibility of approach topic modeling\nas a clustering task. We present BERTopic, a topic model that extends this\nprocess by extracting coherent topic representation through the development of\na class-based variation of TF-IDF. More specifically, BERTopic generates\ndocument embedding with pre-trained transformer-based language models, clusters\nthese embeddings, and finally, generates topic representations with the\nclass-based TF-IDF procedure. BERTopic generates coherent topics and remains\ncompetitive across a variety of benchmarks involving classical models and those\nthat follow the more recent clustering approach of topic modeling.\n', lang=None), Description(value='Comment: BERTopic has a python implementation, see\n https://github.com/MaartenGr/BERTopic', lang=None)], publisher=[], contributor=[], date=[Date(value='2022-03-11', lang=None)], type_value=[TypeType(value='text', lang=None)], format=[], identifier=[Identifier(value='http://arxiv.org/abs/2203.05794', lang=None)], source=[], language=[], relation=[], coverage=[], rights=[])
```

class MyRecord(Record):
# Your XML unpacking implementation goes here.
pass
```pycon
>>> record = scythe.get_record("oai:arXiv.org:2203.05794", metadata_prefix="arXiv")
>>> record
# Record(header=Header(identifier='oai:arXiv.org:2203.05794', datestamp=XmlDate(2022, 3, 14), set_spec=['cs'], status=None), metadata=Metadata(other_element=AnyElement(qname='{http://arxiv.org/OAI/arXiv/}arXiv', text='', tail=None, children=[AnyElement(qname='{http://arxiv.org/OAI/arXiv/}id', text='2203.05794', tail=None, children=[], attributes={}), AnyElement(qname='{http://arxiv.org/OAI/arXiv/}created', text='2022-03-11', tail=None, children=[], attributes={}), AnyElement(qname='{http://arxiv.org/OAI/arXiv/}authors', text='', tail=None, children=[AnyElement(qname='{http://arxiv.org/OAI/arXiv/}author', text='', tail=None, children=[AnyElement(qname='{http://arxiv.org/OAI/arXiv/}keyname', text='Grootendorst', tail=None, children=[], attributes={}), AnyElement(qname='{http://arxiv.org/OAI/arXiv/}forenames', text='Maarten', tail=None, children=[], attributes={})], attributes={})], attributes={}), AnyElement(qname='{http://arxiv.org/OAI/arXiv/}title', text='BERTopic: Neural topic modeling with a class-based TF-IDF procedure', tail=None, children=[], attributes={}), AnyElement(qname='{http://arxiv.org/OAI/arXiv/}categories', text='cs.CL', tail=None, children=[], attributes={}), AnyElement(qname='{http://arxiv.org/OAI/arXiv/}comments', text='BERTopic has a python implementation, see\n https://github.com/MaartenGr/BERTopic', tail=None, children=[], attributes={}), AnyElement(qname='{http://arxiv.org/OAI/arXiv/}license', text='http://arxiv.org/licenses/nonexclusive-distrib/1.0/', tail=None, children=[], attributes={}), AnyElement(qname='{http://arxiv.org/OAI/arXiv/}abstract', text=' Topic models can be useful tools to discover latent topics in collections of\ndocuments. Recent studies have shown the feasibility of approach topic modeling\nas a clustering task. We present BERTopic, a topic model that extends this\nprocess by extracting coherent topic representation through the development of\na class-based variation of TF-IDF. More specifically, BERTopic generates\ndocument embedding with pre-trained transformer-based language models, clusters\nthese embeddings, and finally, generates topic representations with the\nclass-based TF-IDF procedure. BERTopic generates coherent topics and remains\ncompetitive across a variety of benchmarks involving classical models and those\nthat follow the more recent clustering approach of topic modeling.\n', tail=None, children=[], attributes={})], attributes={'{http://www.w3.org/2001/XMLSchema-instance}schemaLocation': 'http://arxiv.org/OAI/arXiv/ http://arxiv.org/OAI/arXiv.xsd'})), about=[])
```

!!! note
Take a look at the implementation of [oaipmh_scythe.models.Record][] to get an idea of how to do this.
The response still gets parsed into a dataclass, but the metadata has attributes of type `AnyElement`, e.g.
`AnyElement(qname='{http://arxiv.org/OAI/arXiv/}arXiv'`.

Next, associate your implementation with OAI verbs in the [oaipmh_scythe.client.Scythe][] object. In this case, we want
the [oaipmh_scythe.client.Scythe][] object to use our implementation to represent items returned by ListRecords and
GetRecord responses:
https://xsdata.readthedocs.io/en/latest/codegen/intro/#command-line-tool

```python
scythe = Scythe("http://...")
scythe.class_mapping["ListRecords"] = MyRecord
scythe.class_mapping["GetRecord"] = MyRecord
```bash
$ python -m pip install "xsdata[cli]>=24.5"
$ xsdata generate --package=arxiv http://arxiv.org/OAI/arXiv.xsd
```

If you need to rewrite *all* item implementations, you can also provide a complete mapping to the
[oaipmh_scythe.client.Scythe][] object at instantiation:

```python
my_mapping = {
"ListRecords": MyRecord,
"GetRecord": MyRecord,
# ...
}
from arxiv import ArXiv

scythe = Scythe("https://...", class_mapping=my_mapping)
record = scythe.get_record("oai:arXiv.org:2203.05794", metadata_prefix="arXiv")
record
# Record(header=Header(identifier='oai:arXiv.org:2203.05794', datestamp=XmlDate(2022, 3, 14), set_spec=['cs'], status=None), metadata=Metadata(other_element=ArXiv(id=['2203.05794'], created=['2022-03-11'], updated=[], authors=[AuthorsType(author=[AuthorType(keyname='Grootendorst', forenames='Maarten', suffix=None, affiliation=[])])], title=['BERTopic: Neural topic modeling with a class-based TF-IDF procedure'], msc_class=[], acm_class=[], report_no=[], journal_ref=[], comments=['BERTopic has a python implementation, see\n https://github.com/MaartenGr/BERTopic'], abstract=[' Topic models can be useful tools to discover latent topics in collections of\ndocuments. Recent studies have shown the feasibility of approach topic modeling\nas a clustering task. We present BERTopic, a topic model that extends this\nprocess by extracting coherent topic representation through the development of\na class-based variation of TF-IDF. More specifically, BERTopic generates\ndocument embedding with pre-trained transformer-based language models, clusters\nthese embeddings, and finally, generates topic representations with the\nclass-based TF-IDF procedure. BERTopic generates coherent topics and remains\ncompetitive across a variety of benchmarks involving classical models and those\nthat follow the more recent clustering approach of topic modeling.\n'], categories=['cs.CL'], doi=[], proxy=[], license=['http://arxiv.org/licenses/nonexclusive-distrib/1.0/'])), about=[])
```

!!! note
The response gets parsed into a Record dataclass, and the metadata is of type `ArXiv`.

!!! note
Take a look at the models
Loading