This readme-file contains some cursory remarks. For greater details, see the pre-print at socarXiv.
This project's purpose is to generate a list of major academic publishers and their scholarly journals through webscraping.
As for the results, see the file Output\top-100-publishers.xlsx
(last updated in July 2022).
There are three key documents for adding/scraping publishers:
Data\04_publishers.xlsx
: the (adaptable) list of publishers to be scraped, including the URL and the relevant CSS selectors (many of the newer additions in July 2022 were counted manually, albeit the most relevant CSS selector containing the journal names or links are added);Script\Function\function_getjournals.R
: the scraping function;Script\Analysis_06_Extract-Journals.R
: activate the scraping function
The compilation of publishers was generated by drawing from the following four sources:
- DOAJ (using the data dump in Dec. 2020)
- Publons (using webscraping on 11 Dec 2020)
- Scopus (using the csv-formatted source list from Oct. 2020)
- Sherpa Romeo (using webscraping on 11 Dec 2020)
The list of journals was scraped from every respective publisher's website, using the URLs listed in Data\04_publishers.csv
.
The data extraction regarding the publishers occurs in the files 01 to 04 in the Script
-folder, mainly using R's rvest
-package.
Using the information from the four data sources, the script takes each publisher's highest journal count as assigned by these data sources (so that each publisher has up to four, often differing, journal counts). It then orders the list by each publisher's respective highest journal count. This is done in file 05 in the Script
-folder.
In a further step, the script harmonizes duplicated names of publishers (based on the data in Data\03_publishers_harmonization.txt
).
The rest was done manually, e.g. looking for the links of journal catalogues and collecting the relevant CSS selector for each publisher (in Data\04_publishers.xlsx
).
Finally, the publishers' websites are accessed via a uniform webscraping function (but with differing CSS selectors) so as to extract all of the publishers' journal names, including the URL to each journal. This is done in file 06 in the Script
-folder.
The various css selectors for each publisher is saved in Data\04_publishers.xlsx
.
The full list of the publishers and their journal counts is visible in Output\top-100-publishers.xlsx
.
The journal list is visible in Output\Journals\alljournals-2022-03-02.csv
. Note, however, that the list is incomplete as many publishers were not scraped (yet) but only their number of journals were counted based on CSS selectors.