-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Annif: identify a subject vocabulary #15
Comments
https://github.com/samvera/questioning_authority/wiki Questioning Authority is a gem developed by the Samvera Community that might help with subject vocabularies |
Structure of Subject https://github.com/NatLibFi/Annif/wiki/Subject-vocabulary-formats |
This might be interesting : The Arxiv Academic Paper Dataset can be found on the Arxiv website, which is an open-access repository of scientific papers in various fields. To access the dataset, go to the Arxiv website (https://arxiv.org/) and click on the "Bulk Data Access" link at the bottom of the page. From there, you can download a compressed file that contains the dataset in a format suitable for machine learning tasks. The dataset contains over 1.7 million papers in various fields, including computer science, physics, mathematics, and more. Each paper is labeled with one or more of 23 subject categories, which can be used for text classification tasks. Note that the dataset is quite large, so you may need significant computing resources to work with it effectively. You may also need to preprocess and clean the data before training your AI model, as the dataset contains a wide variety of paper formats and styles. |
ETDs from the University of Cincinnati are available full text from Proquest as well as the Ohio ETD Center. Authors are encouraged to supply keywords to their papers, most of which are uncontrolled. Initially, our project is small scale. I'd propose we pull the full dataset of ETDs from OhioLINK ETD Center via OAI-PMH. https://github.com/NatLibFi/Annif/wiki/Subject-vocabulary-formats |
Also looking at https://www.oclc.org/research/areas/data-science/fast/download.html as a possibility for a vocab. |
Scopus has a subject Vocabulary Thesarus, too : https://service.elsevier.com/app/answers/detail/a_id/14882/supporthub/scopus/~/what-are-the-most-frequent-subject-area-categories-and-classifications-used-in/ |
OAI-PMH url for UC ETDs at ohiolink: |
https://github.com/NatLibFi/Annif/wiki/Subject-vocabulary-formats
https://github.com/NatLibFi/Annif-corpora
annif load-vocab yso /path/to/Annif-corpora/vocab/yso-skos.ttl
The text was updated successfully, but these errors were encountered: