diff --git a/documentation/APS/aps_fields_mapping.md b/documentation/APS/README.md similarity index 99% rename from documentation/APS/aps_fields_mapping.md rename to documentation/APS/README.md index 0126e131..9db980f4 100644 --- a/documentation/APS/aps_fields_mapping.md +++ b/documentation/APS/README.md @@ -1,6 +1,6 @@ # APS data processing flow -![aps_dags_process](./aps_diagram.png) +![aps_dags_process](./aps_doc_diagram.png) # Final fields diff --git a/documentation/APS/aps_diagram.png b/documentation/APS/aps_diagram.png deleted file mode 100644 index 25e04d6a..00000000 Binary files a/documentation/APS/aps_diagram.png and /dev/null differ diff --git a/documentation/APS/aps_doc_diagram.png b/documentation/APS/aps_doc_diagram.png new file mode 100644 index 00000000..29b28cc7 Binary files /dev/null and b/documentation/APS/aps_doc_diagram.png differ diff --git a/documentation/Elsevier/Elsevier_doc_diagram.png b/documentation/Elsevier/Elsevier_doc_diagram.png new file mode 100644 index 00000000..d6f151ca Binary files /dev/null and b/documentation/Elsevier/Elsevier_doc_diagram.png differ diff --git a/documentation/Elsevier/README.md b/documentation/Elsevier/README.md new file mode 100644 index 00000000..a904d114 --- /dev/null +++ b/documentation/Elsevier/README.md @@ -0,0 +1,206 @@ + +![elsevier_dags_process](./Elsevier_doc_diagram.png) + +# Practical information +Parsing Elsevier content involves a unique two-phase process, distinct from other publishers. + +Phase 1: The initial phase involves parsing the dataset.xml file, which contains all the records and some mandatory fields. This file serves as the foundation for gathering essential data across the dataset. + +Phase 2: In the second phase, the main.xml file is parsed for each individual record. The paths to the main.xml file, as well as any associated PDF and PDF/A files, are provided in the dataset.xml file parsed during Phase 1. This ensures that all necessary files and data are correctly linked and processed. + +# [Final fields](#final_fields) + +| Field | Processed | Subfield | Subsubfield | +| -------------------- | ---------------------------------------------------------------------------------------------------------------------------------- | -------------- | ----------- | +| dois | generic_parsing : [45] | value | | +| arxiv_eprints | enricher : [62] | value | | +| | | categories | | +| page_nr | parsing : [36] | | | +| authors | parsing : [37]
generic_parsing : [33] | surname | | +| | | given_names | | +| | | full_name | | +| | | affiliations | country | +| | | | institution | +| collaborations | generic_parsing : [35] | value | | +| license | parsing : [38] | url | | +| | | license | | +| publication_info | generic_parsing : [52] | journal_title | | +| | | journal_volume | | +| | | year | | +| | | artid | | +| | | material | | +| abstracts | enhancer : [53] | value | | +| acquisition_source | enhancer : [54] | source | | +| | | method | | +| | | date | | +| copyright | enhancer : [55] | year | | +| | | statement | | +| imprints | enhancer : [56] | date | | +| | | publisher | | +| record_creation_date | enhancer : [57] | | | +| titles | enhancer : [58] | title | | +| | | source | | +| $schema | enricher : [60] | | | | | | + + + +# [Enhancer](#enhancer) + +| Reference | Field | Enhancer | +| ------------------------------ | -------------------- | ---------------------------------------------------------------------------------- | +| [53] | abstracts | \_\_construct_abstracts | +| [54] | acquisition_source | \_\_construct_acquisition_source | +| [55] | copyright | \_\_construct_copyright | +| [56] | imprints | \_\_construct_imprints | +| [57] | record_creation_date | \_\_construct_record_creation_date | +| [58] | titles | \_\_construct_titles | +| [59] | | \_\_remove_country | + +### [\_\_construct_abstracts](#__construct_abstracts) + +| Reference | Subfield | Value | +| ------------------------------ | -------- | ------------------------------------------------------------------------------ | +| [60] | value | Take value from parsing abstract [2] | +| [61] | source | Constant: Elsevier | + +### [\_\_construct_acquisition_source](#__construct_acquisition_source) + +| Reference | Subfield | Value | +| ------------------------------ | -------- | ------------------------------------------------ | +| [62] | source | Constant: Elsevier | +| [63] | method | Constant: Elsevier | +| [64] | date | datetime.datetime.now().isoformat() | + +### [\_\_construct_copyright](#__construct_copyright) + +| Reference | Subfield | Value | +| ------------------------------ | --------- | ----------------------------------------------------------------------------------------- | +| [65] | year | Take value from parsing copyright_year [7] | +| [66] | statement | Take value from parsing copyright_statement [8] | + +### [\_\_construct_imprints](#__construct_imprints) + +| Reference | Subfield | Value | +| ------------------------------ | --------- | ---------------------------------------------------------------------------------------------------- | +| [67] | date | Take value from generic_parsing date_published [41] | +| [68] | publisher | Constant: Elsevier | + +### [\_\_construct_record_creation_date](#__construct_record_creation_date) + +| Reference | Subfield | Value | +| ------------------------------ | -------- | ------------------------------------------------ | +| [69] | | datetime.datetime.now().isoformat() | + +### [\_\_construct_titles](#__construct_titles) + +| Reference | Subfield | Value | +| ------------------------------ | -------- | ---------------------------------------------------------------------------------------------------- | +| [70] | title | Removed fn tags. `FN_REGEX = re.compile(r"")`
`FN_REGEX.sub("", item.pop("title", "")).strip()` | +| [71] | source | Constant: Elsevier | + +### [\_\_remove_country](#__remove_country) + +| Reference | Field | Value | Processing | +| ------------------------------ | ---------------------------------------------------------------------------------------- | ----- | -------------------------------------------- | +| [72] | From parsed JSON value: affiliations [17].country | | From parsed JSON value: affiliations.country | + + +# [generic_parsing](#generic_parsing) + +| Reference | Field | Subfield | Processing | Default value | +| ------------------------------ | ---------------------- | --------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------- | +| [33] | authors | full_name | Joins parsed surname [10] and parsed given_names [11]: "{0}, {1}".format(surname, given_names) | | +| [34] | abstract | | Cleans blank space characters | | +| [35] | collaborations | value | Puts the collaboration value into the array of dicts: [{"value": collaboration}] | | +| [36] | collections | primary | Puts the collections value into the array of dicts: [{"value": collections}] | | +| [37] | title | | Cleans blank space characters | | +| [38] | subtitle | | Cleans blank space characters, takes the first value. PS. HAVE NEVER SEEN IT | | +| [39] | journal_year | | Takes parsed journal_year [22] | | +| [40] | preprint_date | | NO SUCH FIELD IN ELSEVIER | | +| [41] | date_published | | Takes parsed date_published [21] | | +| [42] | related_article_doi | | Puts the related_article_doi value into the array of dicts: [{"value": related_article_doi}]. PS. HAVE NEVER SEEN IT | | +| [43] | free_keywords | | Cleans blank space characters. ELSEVIER DOESN'T HAVE THIS FIELD | | +| [44] | classification_numbers | | NO SUCH FIELD IN ELSEVIER | | +| [45] | dois | | Puts the doi value into the array of dicts: [{"value": doi}] | | +| [46] | thesis_supervisor | | NO SUCH FIELD IN ELSEVIER!!! Should take parsed thesis_supervisor and parse with generic parser in the same way as authors [33] | | +| [47] | thesis | | NO SUCH FIELD IN ELSEVIER!!! Should take parsed thesis value. | | +| [48] | urls | | NO SUCH FIELD IN ELSEVIER!!! Should take parsed urls value. | | +| [49] | local_files | | NO SUCH FIELD IN ELSEVIER!!! Should take parsed local_files value. | | +| [50] | record_creation_date | | NO SUCH FIELD IN ELSEVIER!!! Should take parsed record_creation_date value. | | +| [51] | control_field | | NO SUCH FIELD IN ELSEVIER!!! Should take parsed control_field value. | | +| [52] | publication_info | | \_generic_parsing_publication_info | | +| | journal_title | | REMOVED | +| | journal_volume | | REMOVED | +| | journal_year | | REMOVED | +| | journal_issue | | REMOVED | +| | journal_doctype | | REMOVED | + +### [\_generic_parsing_publication_info](#_generic_parsing_publication_info) + +| Reference | Subfield | Value | Default value | +| ------------------------------ | ---------------- | ------------------------------------------------------------------------- | ------------- | +| [27] | journal_title | Takes parsed journal_title [19] | "" | +| [28] | journal_volume | Takes parsed journal_volume [26] | "" | +| [29] | year | Takes parsed journal_year [22] | "" | +| [30] | artid | Takes parsed journal_artid [20] | "" | +| [31] | material | NO SUCH FIELD IN ELSEVIER | "" | +| [32] | pubinfo_freetext | NO SUCH FIELD IN ELSEVIER | "" | + + +# [metadata parsing](#metadataparsing) +Parses data.xml file + +| Reference | Field | Method | Source | Parsing | +|-----------|----------------|--------|----------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| [18] | dois | find | journal-item-unique-ids/doi | [dois] | +| [19] | journal_title | find | journal-item-unique-ids/jid-aid/jid | | +| [20] | journal_artid | find | journal-item-unique-ids/jid-aid/aid | | +| [21] | date_published | find | journal-item-properties/online-publication-date | | +| [22] | journal_year | | | get year from date_published | +| [23] | collections | | | same as journal_title | +| [24] | license | | | const: CC-BY-3.0 | +| [25] | files | | FOR PDF: files-info/web-pdf/pathname | `{ "pdf": pdf_file_path, "pdfa": os.path.join(os.path.split(pdf_file_path)[0], "main_a-2b.pdf"), "xml": os.path.join( files_dir_base, article.find("files-info/ml/pathname").text
), }` | +| [26] | journal_volume | find | vol_first=journal-issue/journal-issue-properties/volume-issue-number/vol-first \n suppl=journal-issue/journal-issue-properties/volume-issue-number/suppl | `f"{vol_first} {suppl}"` | + + +# [parsing](#parsing) + +| Reference | Field | Required | Method | Source | Parsing | +|-----------|---------------------|----------|--------|----------------------------------------|--------------------------------------------------------------------------| +| [1] | dois | True | find | item-info/doi | doi value is pushed to the array: [doi] | +| [2] | abstract | True | find | head/abstract/abstract-sec/simple-para | | +| [3] | title | True | find | head/title | | +| [4] | authors | True | find | | _get_authors | +| [5] | collaboration | False | find | author-group/collaboration/text/ | | +| [6] | copyright_holder | True | find | item-info/copyright | | +| [7] | copyright_year | False | find | item-info/copyrigh | get attribute value "year" | +| [8] | copyright_statement | True | find | item-info/copyright | | +| [9] | journal_doctype | False | find | . | get attribute value of "docsubctype" and map it with article_type_mapping | + +### [_get_authors](#_get_authors) + +| Reference | Field | Method | Source | Processing | +|-----------|-------|--------|----------------------------------------------|------------------------------------| +| | | find | head/author-group | used to parse _get_authors_details | +| | | find | head/author-group/collaboration/author-group | used to parse _get_authors_details | + + +### [_get_authors_details](#_get_authors_details) +| Reference | Field | Method | Source | Processing | +|-----------|--------------|--------------------------|------------|-------------------| +| [10] | surname | iterated findall results | surname | | +| [11] | given_names | iterated findall results | given-name | | +| [12] | affiliations | iterated findall results | | _get_affiliations | +| [13] | email | iterated findall results | e-address | | +| [14] | orcid | iterated findall results | orcid | | + + +### [_get_affiliations](#_get_affiliations) +| Reference | Field | Method | Source | Processing | +|------------------------|--------------|--------|----------------------------------------------------------------------------------------------------------------------|---------------------------------------------| +| additional information | ref_id_value | find | affiliation/[@id='{ref_id}'], and ref_if comes from `[cross.get("refid") for cross in author.findall("cross-ref")]` | THIS FIELD IS NOT PRESENT IN THE FINAL JSON | +| [15] | value | find | {ref_id_value}textfn | | +| [16] | organization | find | {ref_id_value}affiliation/organization | | +| [17] | country | find | {ref_id_value}affiliation/country | | + + diff --git a/documentation/Hindawi/hindawi_fields_mapping.md b/documentation/Hindawi/README.md similarity index 99% rename from documentation/Hindawi/hindawi_fields_mapping.md rename to documentation/Hindawi/README.md index 1824eaaf..024a2f7d 100644 --- a/documentation/Hindawi/hindawi_fields_mapping.md +++ b/documentation/Hindawi/README.md @@ -1,3 +1,5 @@ +![hindawi_doc_diagram](./hindawi_doc_diagram.png) + # [Final fields](#final_fields) | Field | Processed | Subfield | Subsubfield | diff --git a/documentation/Hindawi/hindawi_doc_diagram.png b/documentation/Hindawi/hindawi_doc_diagram.png new file mode 100644 index 00000000..14bb80b4 Binary files /dev/null and b/documentation/Hindawi/hindawi_doc_diagram.png differ diff --git a/documentation/IOP/iop_fields_mapping.md b/documentation/IOP/README.md similarity index 99% rename from documentation/IOP/iop_fields_mapping.md rename to documentation/IOP/README.md index f713d35b..71ccfa7f 100644 --- a/documentation/IOP/iop_fields_mapping.md +++ b/documentation/IOP/README.md @@ -1,4 +1,4 @@ -![iop_dags_process](./iop_diagram.png) +![iop_dags_process](./iop_doc_diagram.png) # [Final fields](#final_fields) diff --git a/documentation/IOP/iop_diagram.png b/documentation/IOP/iop_diagram.png deleted file mode 100644 index 8ef21482..00000000 Binary files a/documentation/IOP/iop_diagram.png and /dev/null differ diff --git a/documentation/IOP/iop_doc_diagram.png b/documentation/IOP/iop_doc_diagram.png new file mode 100644 index 00000000..200e7d39 Binary files /dev/null and b/documentation/IOP/iop_doc_diagram.png differ diff --git a/documentation/OUP/oup_fields_mapping.md b/documentation/OUP/README.md similarity index 99% rename from documentation/OUP/oup_fields_mapping.md rename to documentation/OUP/README.md index 1f5a6e80..fd1155f8 100644 --- a/documentation/OUP/oup_fields_mapping.md +++ b/documentation/OUP/README.md @@ -1,4 +1,4 @@ -![oup_dags_process](./oup_diagram.png) +![oup_dags_process](./oup_doc_diagram.png) # Final fields diff --git a/documentation/OUP/oup_diagram.png b/documentation/OUP/oup_diagram.png deleted file mode 100644 index fbea8de1..00000000 Binary files a/documentation/OUP/oup_diagram.png and /dev/null differ diff --git a/documentation/OUP/oup_doc_diagram.png b/documentation/OUP/oup_doc_diagram.png new file mode 100644 index 00000000..fba5155f Binary files /dev/null and b/documentation/OUP/oup_doc_diagram.png differ diff --git a/documentation/Springer/springer_fields_mapping.md b/documentation/Springer/README.md similarity index 99% rename from documentation/Springer/springer_fields_mapping.md rename to documentation/Springer/README.md index 574afc0b..08a3d9a5 100644 --- a/documentation/Springer/springer_fields_mapping.md +++ b/documentation/Springer/README.md @@ -1,4 +1,4 @@ -![springer_dags_process](./springer_diagram.png) +![springer_dags_process](./springer_doc_diagram.png) # Final fields diff --git a/documentation/Springer/springer_diagram.png b/documentation/Springer/springer_diagram.png deleted file mode 100644 index c6ef313e..00000000 Binary files a/documentation/Springer/springer_diagram.png and /dev/null differ diff --git a/documentation/Springer/springer_doc_diagram.png b/documentation/Springer/springer_doc_diagram.png new file mode 100644 index 00000000..80a4aba1 Binary files /dev/null and b/documentation/Springer/springer_doc_diagram.png differ