Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
aldengolab authored Mar 16, 2017
1 parent 0395986 commit 9140e4e
Showing 1 changed file with 22 additions and 6 deletions.
28 changes: 22 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,19 +25,35 @@ From the raw article text, we generate the following features:

## Pipeline

The pipeline contains two sets of python code:
The pipeline contains four sets of python code:

- `test_train.py`: this code takes a set of labeled cases with features and builds out test & train models. It can be used to run cross-validation on a selected model or to run the model loop.
- `model_loop.py`: this code takes the output csvs from data_prep.py and runs them through various classification models. Results of those models performance on test sets is sent to an output csv for analysis.
- `model.py`: a class for models
- `model_loop.py`: a class for running a loop of classifiers; takes test-train data splits and various run params
- `run.py`: this code implements the model_loop class; it also implements our re-sampling of the data
- `transform_features.py`: this file executes all feature generation

We implement these two files differently via shell code depending on end goal. Each implementation will be named with its end goal in mind.
Running the code is done through `run.py` with the following options:

1. `--models`: models to run
2. `--iterations`: number of tuning parameter iterations to run per model
3. `--output_dir`: a directory name to store the output; system will create the directory
4. `--dedupe`: whether to look for and remove duplicate content
5. `--features`: which feature set to run with, options include:

- `both_only`: runs both PCFG and TFIDF
- `grammar_only`: runs only PCFG
- `tfidf_only`: runs only TFIDF
- `all`: runs all the above

## Example Pipeline Run

To execute the pipeline with Logistic Regression and Naive Bayes, navigate to the pipeline directory, run:
To execute the pipeline with Logistic Regression and Stochastic Gradient Descent, navigate to the pipeline directory, run:

```
python run.py ../articles_deduped.csv --models LR NB
source activate amlpp
python run.py /path/to/data --models LR SGD --iterations 50 --output_dir run_name --dedupe --reduce 500 --features both_only
```
This is encapsulated in the `run.sh` file.

The first argument is the path to the input datafile. The pipeline assumes that the text of each article is unique. If your texts are not unique, use the flag --dedupe to automatically remove duplicated articles during preprocessing. To see a description of all arguments, run:

Expand Down

0 comments on commit 9140e4e

Please sign in to comment.