Annif: identify a subject vocabulary #15

To access the dataset, go to the Arxiv website (https://arxiv.org/) and click on the "Bulk Data Access" link at the bottom of the page. From there, you can download a compressed file that contains the dataset in a format suitable for machine learning tasks. The dataset contains over 1.7 million papers in various fields, including computer science, physics, mathematics, and more. Each paper is labeled with one or more of 23 subject categories, which can be used for text classification tasks.

Note that the dataset is quite large, so you may need significant computing resources to work with it effectively. You may also need to preprocess and clean the data before training your AI model, as the dataset contains a wide variety of paper formats and styles.

crowesn · 2023-04-20T16:03:22Z

ETDs from the University of Cincinnati are available full text from Proquest as well as the Ohio ETD Center. Authors are encouraged to supply keywords to their papers, most of which are uncontrolled.

Initially, our project is small scale. I'd propose we pull the full dataset of ETDs from OhioLINK ETD Center via OAI-PMH.
Taking that, we extract all of the keywords and develop our own subject vocabulary from those keywords. This allows us to use a subset of the ETDs as training data, with the remaining corpus for validation/testing. Further, once we have a trained model, we could use documents from scholar to see how well the model does on more general documents from the repository.

https://github.com/NatLibFi/Annif/wiki/Subject-vocabulary-formats

crowesn · 2023-04-20T16:50:50Z

Also looking at https://www.oclc.org/research/areas/data-science/fast/download.html as a possibility for a vocab.

crowesn · 2023-04-20T17:19:35Z

https://github.com/NatLibFi/Annif-tutorial/blob/master/exercises/OPT_custom_corpus.md

scherztc · 2023-04-20T18:33:54Z

https://github.com/jimfhahn/Annif-tutorial/tree/master/data-sets

scherztc · 2023-05-05T18:03:40Z

Scopus has a subject Vocabulary Thesarus, too : https://service.elsevier.com/app/answers/detail/a_id/14882/supporthub/scopus/~/what-are-the-most-frequent-subject-area-categories-and-classifications-used-in/

crowesn · 2023-05-11T16:56:23Z

OAI-PMH url for UC ETDs at ohiolink:
https://etd.ohiolink.edu/apexprod/!etd_search_oai?verb=ListRecords&metadataPrefix=oai_etdms&setSpec=ucin

hortongn added this to App Dev AI Project Mar 16, 2023

hortongn converted this from a draft issue Mar 16, 2023

hortongn mentioned this issue Mar 16, 2023

Annif: Explore for subject field metadata (epic) #14

Open

9 tasks

hortongn moved this from Todo to In Progress in App Dev AI Project Apr 6, 2023

scherztc self-assigned this Apr 13, 2023

hortongn moved this from In Progress to On Hold / Backlog in App Dev AI Project Jul 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Annif: identify a subject vocabulary #15

Annif: identify a subject vocabulary #15

hortongn commented Mar 16, 2023 •

edited

Loading

scherztc commented Mar 30, 2023

scherztc commented Mar 30, 2023

hortongn commented Apr 7, 2023

haitzlm commented Apr 14, 2023

crowesn commented Apr 20, 2023 •

edited

Loading

crowesn commented Apr 20, 2023

crowesn commented Apr 20, 2023

scherztc commented Apr 20, 2023

scherztc commented May 5, 2023

crowesn commented May 11, 2023

Annif: identify a subject vocabulary #15

Annif: identify a subject vocabulary #15

Comments

hortongn commented Mar 16, 2023 • edited Loading

scherztc commented Mar 30, 2023

scherztc commented Mar 30, 2023

hortongn commented Apr 7, 2023

haitzlm commented Apr 14, 2023

crowesn commented Apr 20, 2023 • edited Loading

crowesn commented Apr 20, 2023

crowesn commented Apr 20, 2023

scherztc commented Apr 20, 2023

scherztc commented May 5, 2023

crowesn commented May 11, 2023

hortongn commented Mar 16, 2023 •

edited

Loading

crowesn commented Apr 20, 2023 •

edited

Loading