Skip to content

Latest commit

 

History

History
239 lines (168 loc) · 11 KB

README.md

File metadata and controls

239 lines (168 loc) · 11 KB

NHANES DATA SCRIPTS

This set of Python scripts downloads, parses, and aggregates the public data from the National Health and Nutrition Examination Survey (NHANES), and outputs several files, among them a tsv table containing all the data aggregated into a single file, and xml files holding the variable metadata from the online codebooks. The data tsv file together with the dictionary file and a xml file with the grouping structure can be used as input for visualization with Mirador.

DEPENDENCIES

The scripts have the following dependencies:

  1. Python 3.7 or higher (not compatible with 2.x, tested with 3.7.5) and the following packages:
  1. R (tested with version 3.6.1), and the Hmisc package: https://cran.r-project.org/web/packages/Hmisc/index.html
  2. A convenient way to install all of the software tools mentioned above is through the Anaconda Python/R distribution, or with the minimal version of Anaconda, called Miniconda. In the latter case, you will still have to run pip install -r requirements.txt to install the additional Python dependencies (not included in Miniconda), as well as R and hmisc manually, which can be easily done with the conda package management tool included with Miniconda. This involves running the two following commands:
    conda install r-core
    conda install r-hmisc

CREATING AND MERGING DATASETS

The sequence of steps to generate a Mirador-valid dataset is to first download the individual data files from the NHANES ftp server, and then run the scripts that parse and aggregate these files into a single table. These scripts use the following folder structure:

/ root
|
\---- sources
|        |
|        \--- xpt
|        |
|        \--- csv   
|
\---- data
        |
        \--- mirador
               |
               \---- 1999-2000
               |
               \---- 2001-2002
                     ...   

where root is the folder containing all the python scripts and associated files. The raw data from NHANES is provided in the SAS Transport Files (.xpt), which the download script stores in sources/xpt. These files are converted into Comma-Separated Values (.csv) files, which are created in the sources/csv folder. The dataset for each cycle will be stored in the corresponding subfolder under data/mirador, as shown in the diagram. Consecutive cycles can also be aggregated into a single dataset, and the aggregation scripts take into account properly merging the sample and subsample weights (see appendix), and also the equivalence between variable names across cycles.

1) Downloads the data for a given cycle:

python getdata.py 1999-2000

2) Creates Mirador dataset:

python makedataset.py 1999-2000

3) Finalize dataset, by deleting temporary files and adding a Mirador configuration file. Once finalized, it cannot be used for merging (see below), because the merging scripts use temporary files that are removed by this step. The contents of the dataset folder are ready to load from Mirador:

python finaldataset.py 1999-2000

4) Once several consecutive cycles have been made, one can create an aggregated dataset, by merging all the cycles encompassed by the specified interval:

python mergedatasets.py 1999-2010

As mentioned above, this has to be done before finalizing the individual cycles. If the merging operation has to be redone several times, once can add the -keep parameter when finalizing the datasets:

python finaldataset.py 1999-2000 -keep

5) A conveniency bash script is included to run all previous steps for a given year range:

makeall.sh 1999 2018

This will create all the datasets for all the cycles between years 1999 and 2018, as well as the aggregated dataset 1999-2018. All datasets fill be finalized after running this script.

ADDING COMPOSITE VARIABLES

Composite variables are defined as function of existing variables in the dataset, and they can be added by using the composite script and providing a python script that defines the functional relationship. This script must implement a series of functions to be properly executed by composite.py, a fully commented template is provided in composites/template.py. The result of the calculation can simply overwrite the source dataset, or stored in another set of data, dictionary, and grouping files.

1) Adding a composite, overwriting the original dataset

python composite.py data/mirador/1999-2000 composites/obesity.py

2) Adding a composite, without overwriting the original dataset. The new files will be called data_obesity.tsv, dictionary_obesity.tsv, and groups_obesity.xml, and stored in the same dataset folder.

python composite.py data/mirador/1999-2000 composites/obesity.py _obesity

Note that there is no need to finalize the dataset after adding a composite variable. The composite script upadtes all required files in the dataset so it can be used righ away without further processing steps.

ADVANCED USE

STEP BY STEP EXECUTION

The getdata, makedataset, and mergedatasets scripts execute several intermediate steps, which can be run individually in the case an error occurs and one needs to isolate the source of the problem, and also to have more control on the location where the files are stored, etc.

1) Download data:

python download.py 1999-2000 data/sources/xpt/1999-2000

2) Convert to csv:

python xpt2csv.py data/sources/xpt/1999-2000 data/sources/csv/1999-2000

An alternative to use the provided xpt2csv script, which internall calls R to read the xpt files and then save them as csv is to use the xport reader/writer for Python.

3) Make metadata file, the additional argument -nodetails can be used to disable verbose output of messages:

python getweights.py 1999-2000 data/sources/csv/1999-2000 data/mirador/1999-2000/weights.xml
python makemeta.py 1999-2000 Demographics data/sources/csv/1999-2000 data/mirador/1999-2000/demo.xml -nodetails
python makemeta.py 1999-2000 Examination data/sources/csv/1999-2000 data/mirador/1999-2000/exam.xml -nodetails
python makemeta.py 1999-2000 Laboratory data/sources/csv/1999-2000 data/mirador/1999-2000/lab.xml -nodetails
python makemeta.py 1999-2000 Questionnaire data/sources/csv/1999-2000 data/mirador/1999-2000/question.xml -nodetails

Also, make sure of creating the mirador data folder, as these scripts will not create it if it is missing. In this case, the path would be data/mirador/1999-2000.

4) Validate metadata:

python checkmeta.py data/mirador/1999-2000/weights.xml
python checkmeta.py data/mirador/1999-2000/demo.xml
python checkmeta.py data/mirador/1999-2000/exam.xml
python checkmeta.py data/mirador/1999-2000/lab.xml
python checkmeta.py data/mirador/1999-2000/question.xml

5) Create aggregated data file:

python aggregate.py data/mirador/1999-2000 demo.xml lab.xml exam.xml question.xml weights.xml data.tsv

6) Create dictionary file:

python makedict.py data/mirador/1999-2000 demo.xml lab.xml exam.xml question.xml weights.xml data.tsv dictionary.tsv

7) Create groups file

python makegroups.py data/mirador/1999-2000 demo.xml exam.xml lab.xml question.xml weights.xml groups.xml

8) Check the aggregated file against the original csv files:

python checkdata.py data/mirador/1999-2000 demo.xml lab.xml exam.xml question.xml weights.xml data.tsv

Up to here, the steps concern creating a single cycle dataset. Once several consecutive datasets have been generated, they can be aggregated with the following steps:

9) Merge metadata from different cycles (and each step updates weights.list):

python mergemeta.py demo.xml 1999-2010 Demographics data/mirador data/mirador/1999-2010 varequiv
python mergemeta.py exam.xml 1999-2010 Examination data/mirador data/mirador/1999-2010 varequiv
python mergemeta.py lab.xml 1999-2010 Laboratory data/mirador data/mirador/1999-2010 varequiv
python mergemeta.py question.xml 1999-2010 Questionnaire data/mirador data/mirador/1999-2010 varequiv

10) Calculate merged weights csv and weights.xml:

python makeweights.py data/mirador/1999-2010 weights.list weights.csv weights.xml

11) Validate merged metadata:

python checkmeta.py data/mirador/1999-2010/weights.xml
python checkmeta.py data/mirador/1999-2010/demo.xml
python checkmeta.py data/mirador/1999-2010/exam.xml
python checkmeta.py data/mirador/1999-2010/lab.xml
python checkmeta.py data/mirador/1999-2010/question.xml

12) Created merged datafiles, using the aggregate script again:

python aggregate.py data/mirador/1999-2010 demo.xml lab.xml exam.xml question.xml weights.xml data.tsv

13) Create dictionary file

python makedict.py data/mirador/1999-2010 demo.xml lab.xml exam.xml question.xml weights.xml data.tsv dict.tsv

14) Create groups file

python makegroups.py data/mirador/1999-2010 demo.xml exam.xml lab.xml question.xml weights.xml groups.xml

15) Check the aggregated merged data against the original csv files.

python checkdata.py data/mirador/1999-2010 demo.xml lab.xml exam.xml question.xml weights.xml data.tsv

CUSTOM HTML PARSERS

The getweights.py and makemeta.py scripts parse the online NHANES codebooks using the BeautifulSoup library, and can use a custom HTML parser, specified the -parser option, and chose among the ones listed in this page. The default is html.parser, the other ones (html5lib, lxml) need to be installed separately.

ADDING/REMOVING COMPONENTS

The NHANES components to use in the parsing/aggregation can be set by editing the components file provide alongside the scripts

APPENDIX

1) Relevant links on NHANES weighting: