diff --git a/README.md b/README.md index 1645a1b..2004bfd 100644 --- a/README.md +++ b/README.md @@ -25,19 +25,35 @@ From the raw article text, we generate the following features: ## Pipeline -The pipeline contains two sets of python code: +The pipeline contains four sets of python code: -- `test_train.py`: this code takes a set of labeled cases with features and builds out test & train models. It can be used to run cross-validation on a selected model or to run the model loop. -- `model_loop.py`: this code takes the output csvs from data_prep.py and runs them through various classification models. Results of those models performance on test sets is sent to an output csv for analysis. +- `model.py`: a class for models +- `model_loop.py`: a class for running a loop of classifiers; takes test-train data splits and various run params +- `run.py`: this code implements the model_loop class; it also implements our re-sampling of the data +- `transform_features.py`: this file executes all feature generation -We implement these two files differently via shell code depending on end goal. Each implementation will be named with its end goal in mind. +Running the code is done through `run.py` with the following options: + +1. `--models`: models to run +2. `--iterations`: number of tuning parameter iterations to run per model +3. `--output_dir`: a directory name to store the output; system will create the directory +4. `--dedupe`: whether to look for and remove duplicate content +5. `--features`: which feature set to run with, options include: + + - `both_only`: runs both PCFG and TFIDF + - `grammar_only`: runs only PCFG + - `tfidf_only`: runs only TFIDF + - `all`: runs all the above ## Example Pipeline Run -To execute the pipeline with Logistic Regression and Naive Bayes, navigate to the pipeline directory, run: +To execute the pipeline with Logistic Regression and Stochastic Gradient Descent, navigate to the pipeline directory, run: + ``` -python run.py ../articles_deduped.csv --models LR NB +source activate amlpp +python run.py /path/to/data --models LR SGD --iterations 50 --output_dir run_name --dedupe --reduce 500 --features both_only ``` +This is encapsulated in the `run.sh` file. The first argument is the path to the input datafile. The pipeline assumes that the text of each article is unique. If your texts are not unique, use the flag --dedupe to automatically remove duplicated articles during preprocessing. To see a description of all arguments, run: