Initial commit for paper on guidelines

Co-authored-by: Jonathan de Bruin <[email protected]> Co-authored-by: Rens van de schoot <[email protected]> Co-authored-by: Sofie vd Brand <[email protected]> Co-authored-by: ingeborgvandusseldorp <[email protected]> Co-authored-by: wouterharmsen <[email protected]>
asreview · Jun 24, 2021 · 174aed0 · 174aed0
commit 174aed0
Show file tree

Hide file tree

Showing 298 changed files with 75,474 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,195 @@
+#raw_data/*.csv
+jobs.sh
+post-processing/*.R
+post-processing/data/*
+*.h5
+
+# Created by https://www.toptal.com/developers/gitignore/api/python,r
+# Edit at https://www.toptal.com/developers/gitignore?templates=python,r
+
+### Python ###
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+pip-wheel-metadata/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+pytestdebug.log
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+doc/_build/
+
+# PyBuilder
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+.python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+pythonenv*
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# pytype static type analyzer
+.pytype/
+
+# profiling data
+.prof
+
+### R ###
+# History files
+.Rhistory
+.Rapp.history
+
+# Session Data files
+.RData
+
+# User-specific files
+.Ruserdata
+
+# Example code in package build process
+*-Ex.R
+
+# Output files from R CMD build
+/*.tar.gz
+
+# Output files from R CMD check
+/*.Rcheck/
+
+# RStudio files
+.Rproj.user/
+
+# produced vignettes
+vignettes/*.html
+vignettes/*.pdf
+
+# OAuth2 token, see https://github.com/hadley/httr/releases/tag/v0.3
+.httr-oauth
+
+# knitr and R markdown default cache directories
+*_cache/
+/cache/
+
+# Temporary files created by R markdown
+*.utf8.md
+*.knit.md
+
+# R Environment Variables
+.Renviron
+
+### R.Bookdown Stack ###
+# R package: bookdown caching files
+/*_files/
+
+# End of https://www.toptal.com/developers/gitignore/api/python,r
diff --git a/.zenodo.json b/.zenodo.json
@@ -0,0 +1,61 @@
+{ "description":"This repository contains scripts to run simulations for 14 datasets openly published on the Dutch database for medical guidelines belonging to the paper Artificial intelligence supports literature screening in medical guideline development: towards up-to-date medical guidelines.",
+   "title":"Scripts for paper on Towards up-to-date medical guidelines",
+   "creators":[
+      {
+         "name":"Harmsen, Wouter",
+         "affiliation":"Knowledge Institute for Federation of Medical Specialists, Utrecht, The Netherlands",
+         "orcid":"0000-0003-1423-9445"
+      },
+      {
+         "name":"de Groot, Janke",
+         "affiliation":"Knowledge Institute for Federation of Medical Specialists, Utrecht, The Netherlands",
+         "orcid":"0000-0002-8545-2246"
+      },
+      {
+         "name":"Harkema, Albert",
+         "affiliation":"Department of Methodology and Statistics, Faculty of Social and Behavioral Sciences, Utrecht University, Utrecht, the Netherlands",
+         "orcid":"0000-0002-7091-1147"
+      },
+      {
+         "name":"van Dusseldorp, Ingeborg",
+         "affiliation":"Knowledge Institute for Federation of Medical Specialists, Utrecht, The Netherlands",
+         "orcid":"0000-0002-6551-1413"
+      },
+      {
+         "name": "De Bruin, Jonathan",
+         "affiliation": "Department of Research and Data Management Services, Information Technology Services, Utrecht University, Utrecht, the Netherlands",
+         "orcid": "0000-0002-4297-0502"
+      },
+            {
+         "name":"Van den Brand, Sofie",
+         "affiliation":"Department of Methodology and Statistics, Faculty of Social and Behavioral Sciences, Utrecht University, Utrecht, the Netherlands",
+         "orcid":"0000-0002-2408-3336"
+      },
+      {
+         "name":"Van de Schoot, Rens",
+         "affiliation":"Department of Methodology and Statistics, Faculty of Social and Behavioral Sciences, Utrecht University, Utrecht, the Netherlands",
+         "orcid":"0000-0001-7736-2091"
+      }
+
+   ],
+   "keywords":[
+      "guideline development",
+      "medical guidelines",
+      "medical",
+      "artificial intelligence",
+      "active learning", 
+      "machine learning", 
+      "systematic reviewing",
+      "systematic review",
+      "text data",
+      "natural language processing",
+   ],
+   "related_identifiers":[
+      {
+         "relation":"isSupplementTo",
+         "identifier":"https://www.nature.com/articles/s42256-020-00287-7"
+      }
+   ],
+   "license":"MIT",
+   "upload_type":"software"
+}
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2021 ASReview
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/README.md b/README.md
@@ -0,0 +1,107 @@
+# Scripts for paper on "Towards up-to-date medical guidelines"
+
+The purpose of this study was to evaluate the performance and feasibility of active learning to support the selection of relevant publications within the context of medical guideline development. This repository contains scripts to run and analyze simulations for 14 datasets openly published on the Dutch database for [medical guidelines](https://www.richtlijnendatabase.nl). The results are published in the paper "Artificial intelligence supports literature screening in medical guideline development: towards up-to-date medical guidelines". 
+
+
+## Installation
+
+The scripts in this repository require Python 3.6+. Install the extra dependencies with (in the command line):
+
+```
+pip install -r requirements.txt
+```
+
+## Datasets
+
+The raw data can be obtained via the Open Science Framework [OSF](https://osf.io/vt3n4/) and contains 14 published guidelines from the [Dutch Medical Guideline Database](https://richtlijnendatabase.nl/). The following files should be obtained from OSF and put in a folder `raw_data`:
+
+```
+Distal_radius_fractures_approach.csv
+Distal_radius_fractures_closed_reduction.csv
+Hallux_valgus_prognostic.csv
+Head_and_neck_cancer_bone.csv
+Head_and_neck_cancer_imaging.csv
+Obstetric_emergency_training.csv
+Post_intensive_care_treatment.csv
+Pregnancy_medication.csv
+Shoulder_replacement_diagnostic.csv
+Shoulder_replacement_surgery.csv
+Shoulderdystocia_positioning.csv
+Shoulderdystocia_recurrence.csv
+Total_knee_replacement.csv
+Vascular_access.csv
+```
+
+Each dataset contains
+
+```
+title
+abstract
+```
+
+and three columns with labeling decisions titled:
+
+```
+noisy_inclusion
+expert_inclusion
+fulltext_inclusion
+```
+
+The datasets in *raw_data* are split into three columns with labeling decisions. The resulting 42 datasets are generated by executing `job_splitfiles.sh`. The results are stored in the subfolder *data*. 
+
+
+## Descriptive dataset statistics
+
+To create descriptive statistics for each dataset run:
+
+```
+sh generate_dataset_characteristics.sh
+```
+
+The results are stored in `output/simulation/[NAME_DATASET]/descriptives/*.json`, are merged into one table (*csv* and *excel*) by running `python scripts/merge_descriptives.py`, and stored in `output/table/data_descriptives.*`. 
+
+
+## Create wordclouds
+
+To create wordclouds for each dataset run:
+
+```
+sh wordcloud_jobs.sh
+```
+
+The results are stored in `output/simulation/[NAME_DATASET]/descriptives/wordcloud`. 
+There are three version of the wordcloud available, a wordcloud based on the title/abstract words for:
+
+- the entire set of records;
+- for the relevant records only;
+- for the irrelevant records only. 
+
+
+## Simulation
+
+The simulation was conducted for each dataset with an equal amount of runs as the number of relevant records in the dataset with each relevant record being a prior inclusion and 10 randomly chosen irrelevant records. In each run, and for every dataset, the same 10 irrelevant records have been used. To extract information about the records that have been used, run `python scripts/get_prior_knowledge.py`, and the result is stored in `output/tables`. 
+
+
+To obtain the result of the simulation, run: 
+
+```
+sh run_simulation.sh
+```
+
+The results are stored in `output/simulation`. The dataset characteristics are obtained with `python scripts/merge_descriptives.py` and stored in `output/tables`. The metrics resulting from the simulation study per run, can be obtained with `python scripts/merge_metrics.py` and stored in `output/tables`.
+
+The raw `h5` files are 28.4Gb and are available on request, see the contact details. However, it is straightforward to obtain the results by running the simulation again by using ASReview v0.16. Seed values are set in  `run_simulation.sh`. 
+
+## Analyses
+
+The Jupyter notebook [analyses/analyses_guidelines_KIFMS.ipynb](analyses/analyses_guidelines_KIFMS.ipynb)
+ contains a detailed, step-by-step analysis of the simulations performed in this project. For more information about the analysis, read the [README](analyses). 
+
+## Licence 
+
+The content in this repository is published under the MIT license.
+
+## Contact
+
+For any questions or remarks, please send an email to [email protected].
+