`read_refs()` doesn't load all RIS files properly #15

Matherion · 2021-04-08T16:03:46Z

For a systematic review (duh 😬) we're loading RIS files exported from:

ERIC (through its own interface, if I remember correctly)
PubMed (definitely through its own interface)
PsycINFO (through EBSCO)
Bielefeld Academic Search Engine (its own interface)
OpenGrey (through Exalead)
OAIster (through WorldCat)
Cinahl (through EBSCO)
Embase (through Ovid)
Reference search (through reference lists and Google Scholar)

These work fine for the first few, but the fields aren't imported properly from Cinahl and Embase. I can't figure out why - the Cinahl file, for example, seems pretty straightforward RIS (https://gitlab.com/extending-the-earcheck/living-review/-/blob/master/queries/literature_search_02/CINAHL_Ebsco_N236.ris), e.g.:

TY  - JOUR
ID  - 147887838
T1  - Hearing Problems Among the Members of the Defence Forces in Relation to Personal and Occupational Risk Factors.
AU  - Luha, Assar
AU  - Kaart, Tanel
AU  - Merisalu, Eda
AU  - Indermitte, Ene
AU  - Orru, Hans
Y1  - 2020/11//Nov/Dec2020
N1  - Accession Number: 147887838. Language: English. Entry Date: In Process. Revision Date: 20210107. Publication Type: journal article; research. Journal Subset: Biomedical; Expert Peer Reviewed; Peer Reviewed; USA. NLM UID: 2984771R. 
SP  - e2115
EP  - e2123
JO  - Military Medicine
JF  - Military Medicine
JA  - MILIT MED
VL  - 185
IS  - 11/12
PB  - Oxford University Press / USA
AB  - Introduction: The Defence Forces' members are exposed to high-level noise that increases their risk of hearing loss (HL). Besides military noise, the other risk factors include age and gender, ototoxic chemicals, vibration, and chronic stress. The current study was designed to study the effects of personal, work conditions-related risk factors, and other health-related traits on the presence of hearing problems.Materials and Methods: A cross-sectional study among active military service members was carried out. Altogether, 807 respondents completed a questionnaire about their health and personal and work-related risk factors in indoor and outdoor environments. The statistical analysis was performed using statistical package of social sciences (descriptive statistics) and R (correlation and regression analysis) software.Results: Almost half of the active service members reported HL during their service period. The most important risk factors predicting HL in the military appeared to be age, gender, and service duration. Also, working in a noisy environment with exposure to technological, vehicle, and impulse noise shows a statistically significant effect on hearing health. Moreover, we could identify the effect of stress on tinnitus and HL during the service period. Most importantly, active service members not using hearing protectors, tend to have more tinnitus than those who use it.Conclusions: The members of the Defence Forces experience noise from various sources, most of it resulting from outdoor activities. Personal and work conditions-related risk factors as well as stress increase the risk of hearing problems.
SN  - 0026-4075
AD  - Institute of Technology , Estonian University of Life Sciences, Kreutzwaldi 56/1, Tartu 51006, Estonia
AD  - Institute of Veterinary Medicine and Animal Sciences , Estonian University of Life Sciences , Kreutzwaldi 62, Tartu 51006, Estonia
AD  - Institute of Family Medicine and Public Health , University of Tartu , Ravila 19, Tartu 50411, Estonia
U2  - PMID: NLM32879984.
DO  - 10.1093/milmed/usaa224
UR  - http://login.ezproxy.ub.unimaas.nl/login?url=https://search.ebscohost.com/login.aspx?direct=true&db=cin20&AN=147887838&site=ehost-live&scope=site
DP  - EBSCOhost
DB  - cin20
ER  -

But if I then run:

refs <- synthesisr::read_refs(here::here("queries", "literature_search_02", "CINAHL_Ebsco_N236.ris"));
names(refs);

It shows:

 [1] "author"         "address"        "date_published" "issue"          "abstract"      
 [6] "DB"             "DO"             "EP"             "ID"             "JA"            
[11] "JF"             "JO"             "PB"             "SN"             "SP"            
[16] "TY"             "UR"             "VL"             "KW"             "CY"            
[21] "AV"

So it doesn't recorgnize the T1 field as title - but it does drop it from the data frame for some reason. I hope to figure out how the import functions were designed exactly so I can debug this myself (and submit a pull request), but I'm not sure I'll manage, and and also posting this here in case others run into similar problems.

The text was updated successfully, but these errors were encountered:

mjwestgate · 2021-05-24T23:23:43Z

Hi Gjalt-Jorn - sorry for the delay on getting to this - I think it's fixed now. Basically those files were getting incorrectly parsed usingparse_pubmed rather than parse_ris because they contain the string "PMID" in them. Following this commit, when I import your file I get:

[1] "database" "source_type" "author" "address" "date_published" "title"
[7] "journal" "volume" "issue" "start_page" "end_page" "abstract"
[13] "doi" "issn" "url" "publisher" "chemicals" "notes"
[19] "ID" "U2" "Y1" "keywords" "AV" "M1"
[25] "A2"

There are still some weird things in there (e.g. "chemicals" is probably wrong), but the basic information looks correct. Let me know if there is anything else you need me to check.

Matherion · 2021-05-26T20:13:53Z

Well, for our sysrev at least this is perfect! Awesome, thank you so much!

One that still acts oddly is the Embase/OVID one (in our repo it's at https://gitlab.com/extending-the-earcheck/living-review/-/blob/master/queries/literature_search_02/Embase_Ovid_N953.ris (check https://gitlab.com/extending-the-earcheck/living-review/-/raw/master/queries/literature_search_02/Embase_Ovid_N953.ris for the raw version), but I think that's because it probably violates the RIS standard, no?

It includes lines that show query information in between the records. I'll 'manually' (i.e. with script :-)) strip that for now. However, if this is how Ovid exports Embase results by default (might also have been a setting applied by my collaborator who exported the hits), maybe good to check for that? I'll send the link to the code once I pushed it :-)

For now, thank you very much! I'll leave this open for a little bit to maybe elaborate if something else turns up. For now, however, again, thank you very much!!! 🙏

Matherion · 2021-05-27T20:41:55Z

Awesome! Fixed that last error, and it all seems to work perfectly now!

I added some preprocessing, where all lines outside of records are removed from the RIS files.*

This code is here (especially these lines), but these are the most important lines:

        ### Get all lines that start with TY and ER; depends on whether
        ### the file is a pubmed export
        if (any(grepl("PMID-", fileAsText))) {
          TY_regex <- "PMID-";
          ER_regex <- "SO  -"
          ### Actually, no further processing required; pubmed isn't
          ### where the problem is
        } else {
          TY_regex <- "TY  -";
          ER_regex <- "ER  -"
          TY_lines <- grep(TY_regex, fileAsText);
          ER_lines <- grep(ER_regex, fileAsText);
          
          if (length(TY_lines) != length(ER_lines)) {
            stop("Something seems to have gone wrong; I found ",
                 length(TY_lines), " lines that start a RIS record ",
                 "but ", length(ER_lines), " lines that end one.");
          }
          
          ### Get all indices in between
          recordlines <-
            do.call(
              c,
              mapply(
                seq,
                TY_lines,
                ER_lines,
                SIMPLIFY = FALSE
              )
            );
          
          fileAsText <- fileAsText[recordlines];
          
        }

So basically, get all lines matching the TY tag and the ER tag, then all the lines in between each pair, then index the string to remove all other lines. That solves that Embase/Ovid problem, and should be safe to run in any case (which this code does, without apprently problems).

Would you like me to adapt this, integrate it in synthesisr, and submit a PR?

Except PubMed files - apparently PubMed uses some other weird RIS format that doesn't follow the RIS standard?

mjwestgate · 2021-06-04T07:42:00Z

Hi Gjalt-Jorn,

I just had a look at this - the files have moved a bit which might mean I've got something wrong - but using the GitHub version of synthesisr I can import both Embase_290521_N974.RIS and Embase_290521_N974_preprocessed_for_synthesisr.RIS using read_refs() to get the same result in each case, and no errors or warnings. There are still a few imperfections around the TY field, but otherwise it looks ok and I can't spot any obvious bugs.

Am I missing something? I'm happy to look at this again if that would help!

Martin

LukasWallrich mentioned this issue Feb 23, 2023

Problem with import leading to failed deduplication ESHackathon/CiteSource#96

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`read_refs()` doesn't load all RIS files properly #15

`read_refs()` doesn't load all RIS files properly #15

Matherion commented Apr 8, 2021

mjwestgate commented May 24, 2021

Matherion commented May 26, 2021

Matherion commented May 27, 2021

mjwestgate commented Jun 4, 2021

read_refs() doesn't load all RIS files properly #15

read_refs() doesn't load all RIS files properly #15

Comments

Matherion commented Apr 8, 2021

mjwestgate commented May 24, 2021

Matherion commented May 26, 2021

Matherion commented May 27, 2021

mjwestgate commented Jun 4, 2021

`read_refs()` doesn't load all RIS files properly #15

`read_refs()` doesn't load all RIS files properly #15