Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annif: identify a subject vocabulary #15

Open
3 tasks
Tracked by #14
hortongn opened this issue Mar 16, 2023 · 10 comments
Open
3 tasks
Tracked by #14

Annif: identify a subject vocabulary #15

hortongn opened this issue Mar 16, 2023 · 10 comments
Assignees

Comments

@hortongn
Copy link
Member

hortongn commented Mar 16, 2023

  • Download, convert, and load the LCSH as a vocabulary in Annif
  • Explore the FAST vocabulary (see below)
  • Explore developing a custom vocabulary from ETD keywords

https://github.com/NatLibFi/Annif/wiki/Subject-vocabulary-formats

https://github.com/NatLibFi/Annif-corpora

annif load-vocab yso /path/to/Annif-corpora/vocab/yso-skos.ttl

@scherztc
Copy link

https://github.com/samvera/questioning_authority/wiki
https://github.com/samvera/questioning_authority

Questioning Authority is a gem developed by the Samvera Community that might help with subject vocabularies

@scherztc
Copy link

@hortongn hortongn moved this from Todo to In Progress in App Dev AI Project Apr 6, 2023
@scherztc scherztc self-assigned this Apr 13, 2023
@haitzlm
Copy link
Contributor

haitzlm commented Apr 14, 2023

This might be interesting :

The Arxiv Academic Paper Dataset can be found on the Arxiv website, which is an open-access repository of scientific papers in various fields.

To access the dataset, go to the Arxiv website (https://arxiv.org/) and click on the "Bulk Data Access" link at the bottom of the page. From there, you can download a compressed file that contains the dataset in a format suitable for machine learning tasks. The dataset contains over 1.7 million papers in various fields, including computer science, physics, mathematics, and more. Each paper is labeled with one or more of 23 subject categories, which can be used for text classification tasks.

Note that the dataset is quite large, so you may need significant computing resources to work with it effectively. You may also need to preprocess and clean the data before training your AI model, as the dataset contains a wide variety of paper formats and styles.

@crowesn
Copy link

crowesn commented Apr 20, 2023

ETDs from the University of Cincinnati are available full text from Proquest as well as the Ohio ETD Center. Authors are encouraged to supply keywords to their papers, most of which are uncontrolled.

Initially, our project is small scale. I'd propose we pull the full dataset of ETDs from OhioLINK ETD Center via OAI-PMH.
Taking that, we extract all of the keywords and develop our own subject vocabulary from those keywords. This allows us to use a subset of the ETDs as training data, with the remaining corpus for validation/testing. Further, once we have a trained model, we could use documents from scholar to see how well the model does on more general documents from the repository.

https://github.com/NatLibFi/Annif/wiki/Subject-vocabulary-formats

@crowesn
Copy link

crowesn commented Apr 20, 2023

Also looking at https://www.oclc.org/research/areas/data-science/fast/download.html as a possibility for a vocab.

@scherztc
Copy link

@crowesn
Copy link

crowesn commented May 11, 2023

@hortongn hortongn moved this from In Progress to On Hold / Backlog in App Dev AI Project Jul 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: On Hold / Backlog
Development

No branches or pull requests

4 participants