schirinos · lee-b · May 8, 2013 · May 8, 2013 · May 8, 2013 · May 8, 2013
diff --git a/.gitattributes b/.gitattributes
@@ -0,0 +1 @@
+data/sr*/* -diff
diff --git a/.gitignore b/.gitignore
@@ -33,3 +33,4 @@ nosetests.xml
 .mr.developer.cfg
 .project
 .pydevproject
+nutrients.db
diff --git a/README.md b/README.md
@@ -1,4 +1,86 @@
 nutrient-db
 ===========
 
-A program to convert USDA nutrient database into various formats
+Nutrient-db is a program to convert the USDA National Nutrient Database for Standard Reference (http://www.ars.usda.gov/ba/bhnrc/ndl) from the flat files they provide into a relational database and optional a collection of json documents.
+
+Usage
+-----------------
+
+To generate an SQLite database file from the flat files included in the repo run: 
+
+<pre><code>python3 nutrientdb.py</code></pre>
+
+By default it will look in the **data/sr28** directory for the required flat files and parse them into an SQLite database file named *nutrients.db* that will be stored in the current working directory. 
+
+If the *nutrients.db* file already exists and is a valid SQLite database with partial nutrient data in it the script will think you have already completed the parsing of the flat files and not create a new database file. To re-parse the flat files you need to pass the *-f* option to force recreation of the database file.
+
+Command line options are available to help export the information into json format and directly to a mongo database.
+
+### Command line options
+
+#### Path to flat files
+##### -p, --path [default: data/sr28]
+
+The path with the flat files to be parsed are located.
+
+<pre><code>python3 nutrientdb.py -p data/sr28</code></pre>
+
+
+#### Download
+##### -d, --download
+
+Downloads the required data files directly from the USDA site, and extracts them automatically, to the path specified (or defaulted to) by --path (see above)
+
+<pre><code>python3 nutrientdb.py -d</code></pre>
+
+#### Force re-parse 
+##### -f, --force
+
+Force recreation of SQLite database from flat files. Use this option to re-parse the data from the flat files and create a new database file. Useful if the database gets corrupted, a previous parse failed to complete or there are changes to the flat files you want to capture in the database.
+
+<pre><code>python3 nutrientdb.py -f</code></pre>
+
+
+#### Export data as json
+##### -e, --export 
+
+Export the data as json by printing out each document to the console. The format of the json is a custom schema where each json document represents a unqiue food item from the food descriptions table. All other information is attached to these individual documents.
+
+Since the program prints to standard out by defautl you can redirect the output to a file, for example:
+
+<pre><code>python3 nutrientdb.py -e > nutrients.json</code></pre>
+
+#### Export to mongo
+
+To export the data to a mongodb you must provide the following options. Any missing options (except -mport which defaults to 27017) will result in the program not trying to export to mongo.
+
+The program will always try an upsert based on the NDB_No of the food item. That means you can safely run the script multiple times to refresh existing info.
+
+<pre><code>python3 nutrientdb.py --mhost localhost --mport 27017 --mdb mydatabase --mcoll mycollection</code></pre>
+
+##### --mhost [default: localhost]
+
+The hostname of the mongo instance.
+
+##### --mport [default: 27017]
+
+The port of the mongo instance.
+
+##### --mdb
+
+Name of the mongo database to connect to.
+
+##### --mcoll
+
+Name of the collection to insert the documents into.
+
+
+Notes on Data
+-----------------
+
+The **data** directory stores the flat files to be parsed in subfolders for each full release of the USDA data. If you want to parse a different data set you can add it under a subfolder in this directory and specify the path to the files as a command line option. The program looks for a specific set of files as defined by the USDA schema. If any of these files are incorrectly named or missing parsing will fail. 
+
+The schema between releases may change. The program is designed for sr25--sr28. Modifications may be needed to the program for downloading and/or parsing previous or future releases.
+
+USDA National Nutrient Database for Standard Reference (http://www.ars.usda.gov/ba/bhnrc/ndl)
+
diff --git a/data/.gitignore b/data/.gitignore
@@ -0,0 +1,2 @@
+*
+!.gitignore