Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port to python3, stop storing DBs in git, and support downloading DBs from USDA #8

Open
wants to merge 37 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
a778e27
Initial import of parsing code.
May 8, 2013
b16110b
Added all table to parser and raw data files along with sqlite db.
May 8, 2013
31a5f00
Better messaging and only refreshes when data doesn't exist.
May 8, 2013
3c52ddb
Moved flat file data to sub folder.
May 8, 2013
cda581b
Adding conversion to document format for mongo.
May 8, 2013
8d3c769
Adding transforms to document for foods.
May 9, 2013
c6d7050
Finished initial conversion of nutrient data to document structure.
May 9, 2013
b46cb0f
Added mongo exporting.
May 9, 2013
ab2cb82
Adding script documentation.
May 9, 2013
db83e94
More documentation tweaks.
May 9, 2013
5ee3eab
Add a bit more clarity to readme.
May 9, 2013
af16215
Readme update.
May 9, 2013
b645ed6
Readme updates.
May 9, 2013
9fc1a99
readme updates again.
May 9, 2013
a52f573
Readme updates.
May 9, 2013
eba0138
Added more command line options and further info to readme. Fixed som…
May 10, 2013
64a69a8
Added code to langual info.
May 10, 2013
85aadcd
Adjusted document dump schema.
May 28, 2013
c7af387
Modified schema again.
May 28, 2013
db9c320
Convert port to int for pymongo plugin
gesinger Mar 5, 2014
e76606f
Updated README to use nutrientdb.py instead of nutrient.py
gesinger Mar 5, 2014
a1e112d
Merge pull request #2 from gesinger/bugfix/port_to_int
schirinos May 17, 2014
206bca8
Merge pull request #3 from gesinger/bugfix/update_readme
schirinos May 17, 2014
8976061
added numeric types for any fields with an "N" type in the sr28 file …
philtom Feb 15, 2016
6a6dfad
adding sr28 ascii data files and sqlite db
philtom Feb 15, 2016
49b4445
Set exec permissions bit on nutrientdb.py
asazernik Mar 30, 2016
b9bf03d
Mark data files as binary
asazernik Mar 30, 2016
d3fd91a
Order weights by ndb sequence number
asazernik Apr 5, 2016
8b2b920
Don't convert ndb_no to int
asazernik Apr 4, 2016
b7ae369
Set default DB release to sr28
asazernik Apr 16, 2016
82bd8a4
Merge pull request #7 from asazernik/for-pull
schirinos Apr 16, 2016
1649a5f
Merge pull request #5 from philtom/master
schirinos Apr 16, 2016
6a68baa
Ignoring data files
Jul 31, 2016
9cff388
Port to python3
Jul 31, 2016
2a1d804
Merge branch 'master' of github.com:lee-b/nutrient-db
Jul 31, 2016
b7dec99
Update docs
Jul 31, 2016
8db16fa
Another small docs update
Jul 31, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
data/sr*/* -diff
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -33,3 +33,4 @@ nosetests.xml
.mr.developer.cfg
.project
.pydevproject
nutrients.db
84 changes: 83 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,86 @@
nutrient-db
===========

A program to convert USDA nutrient database into various formats
Nutrient-db is a program to convert the USDA National Nutrient Database for Standard Reference (http://www.ars.usda.gov/ba/bhnrc/ndl) from the flat files they provide into a relational database and optional a collection of json documents.

Usage
-----------------

To generate an SQLite database file from the flat files included in the repo run:

<pre><code>python3 nutrientdb.py</code></pre>

By default it will look in the **data/sr28** directory for the required flat files and parse them into an SQLite database file named *nutrients.db* that will be stored in the current working directory.

If the *nutrients.db* file already exists and is a valid SQLite database with partial nutrient data in it the script will think you have already completed the parsing of the flat files and not create a new database file. To re-parse the flat files you need to pass the *-f* option to force recreation of the database file.

Command line options are available to help export the information into json format and directly to a mongo database.

### Command line options

#### Path to flat files
##### -p, --path [default: data/sr28]

The path with the flat files to be parsed are located.

<pre><code>python3 nutrientdb.py -p data/sr28</code></pre>


#### Download
##### -d, --download

Downloads the required data files directly from the USDA site, and extracts them automatically, to the path specified (or defaulted to) by --path (see above)

<pre><code>python3 nutrientdb.py -d</code></pre>

#### Force re-parse
##### -f, --force

Force recreation of SQLite database from flat files. Use this option to re-parse the data from the flat files and create a new database file. Useful if the database gets corrupted, a previous parse failed to complete or there are changes to the flat files you want to capture in the database.

<pre><code>python3 nutrientdb.py -f</code></pre>


#### Export data as json
##### -e, --export

Export the data as json by printing out each document to the console. The format of the json is a custom schema where each json document represents a unqiue food item from the food descriptions table. All other information is attached to these individual documents.

Since the program prints to standard out by defautl you can redirect the output to a file, for example:

<pre><code>python3 nutrientdb.py -e > nutrients.json</code></pre>

#### Export to mongo

To export the data to a mongodb you must provide the following options. Any missing options (except -mport which defaults to 27017) will result in the program not trying to export to mongo.

The program will always try an upsert based on the NDB_No of the food item. That means you can safely run the script multiple times to refresh existing info.

<pre><code>python3 nutrientdb.py --mhost localhost --mport 27017 --mdb mydatabase --mcoll mycollection</code></pre>

##### --mhost [default: localhost]

The hostname of the mongo instance.

##### --mport [default: 27017]

The port of the mongo instance.

##### --mdb

Name of the mongo database to connect to.

##### --mcoll

Name of the collection to insert the documents into.


Notes on Data
-----------------

The **data** directory stores the flat files to be parsed in subfolders for each full release of the USDA data. If you want to parse a different data set you can add it under a subfolder in this directory and specify the path to the files as a command line option. The program looks for a specific set of files as defined by the USDA schema. If any of these files are incorrectly named or missing parsing will fail.

The schema between releases may change. The program is designed for sr25--sr28. Modifications may be needed to the program for downloading and/or parsing previous or future releases.

USDA National Nutrient Database for Standard Reference (http://www.ars.usda.gov/ba/bhnrc/ndl)

2 changes: 2 additions & 0 deletions data/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
*
!.gitignore
Loading