Skip to content

Commit

Permalink
Add testing (#14)
Browse files Browse the repository at this point in the history
  • Loading branch information
qubixes authored Mar 26, 2020
1 parent 2efd014 commit 5535fe5
Show file tree
Hide file tree
Showing 16 changed files with 486 additions and 172 deletions.
42 changes: 42 additions & 0 deletions .github/workflows/ci-workflow.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
name: test-suite
on: [push, pull_request]
jobs:
test-master:
name: pytest
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
with:
path: asr-hyper
- uses: actions/setup-python@v1
with:
python-version: '3.6' # Version range or exact version of a Python version to use, using semvers version range syntax.
architecture: 'x64' # (x64 or x86)
- name: Install packages and run tests
run: |
pip install pytest
pip install --upgrade setuptools>=41.0.0
git clone https://github.com/asreview/asreview.git
pip install ./asreview[all]
pip install ./asr-hyper
pytest asr-hyper/tests
#test-older:
#name: pytest
#runs-on: ubuntu-latest
#strategy:
#matrix:
#asr_versions: ['0.7.2']
#steps:
#- uses: actions/checkout@v2
#- uses: actions/setup-python@v1
#with:
#python-version: '3.6' # Version range or exact version of a Python version to use, using semvers version range syntax.
#architecture: 'x64' # (x64 or x86)
#- name: Install packages and run tests
#run: |
#pip install pytest
#pip install --upgrade setuptools>=41.0.0
#pip install asreview[all]==${{ matrix.asr_versions }}
#pip install .
#pytest tests
64 changes: 33 additions & 31 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
## ASReview-hyperopt

![Deploy and release](https://github.com/msdslab/asreview-hyperopt/workflows/Deploy%20and%20release/badge.svg)
![Deploy and release](https://github.com/asreview/asreview-hyperopt/workflows/Deploy%20and%20release/badge.svg)![Build status](https://github.com/asreview/asreview-hyperopt/workflows/test-suite/badge.svg)

Hyper parameter optimization extension for
[ASReview](https://github.com/asreview/asreview). It uses the
Expand All @@ -11,7 +11,7 @@ automatically used for hyper parameter optimization.

### Installation

The easiest way to install the visualization package is to use the command line:
The easiest way to install the hyper parameter optimization package is to use the command line:

``` bash
pip install asreview-hyperopt
Expand Down Expand Up @@ -45,15 +45,29 @@ asreview hyper-active --help
Which results in the following options:

```bash
usage: /Users/qubix/Library/Python/3.6/bin/asreview [-h] [-m MODEL]
[-q QUERY_STRATEGY]
[-b BALANCE_STRATEGY]
[-e FEATURE_EXTRACTION]
[-n N_ITER] [-d DATASETS]
[--mpi]
usage: hyper-active [-h] [-n N_ITER] [-r N_RUN] [-d DATASETS] [--mpi]
[--data_dir DATA_DIR] [--output_dir OUTPUT_DIR]
[--server_job] [-m MODEL] [-q QUERY_STRATEGY]
[-b BALANCE_STRATEGY] [-e FEATURE_EXTRACTION]

optional arguments:
-h, --help show this help message and exit
-n N_ITER, --n_iter N_ITER
Number of iterations of Bayesian Optimization.
-r N_RUN, --n_run N_RUN
Number of runs per dataset.
-d DATASETS, --datasets DATASETS
Datasets to use in the hyper parameter optimization
Separate by commas to use multiple at the same time
[default: all].
--mpi Use the mpi implementation.
--data_dir DATA_DIR Base directory with data files.
--output_dir OUTPUT_DIR
Output directory for trials.
--server_job Run job on the server. It will incur less overhead of
used CPUs, but more latency of workers waiting for the
server to finish its own job. Only makes sense in
combination with the flag --mpi.
-m MODEL, --model MODEL
Prediction model for active learning.
-q QUERY_STRATEGY, --query_strategy QUERY_STRATEGY
Expand All @@ -62,22 +76,16 @@ optional arguments:
Balance strategy for active learning.
-e FEATURE_EXTRACTION, --feature_extraction FEATURE_EXTRACTION
Feature extraction method.
-n N_ITER, --n_iter N_ITER
Number of iterations of Bayesian Optimization.
-d DATASETS, --datasets DATASETS
Datasets to use in the hyper parameter optimization
Separate by commas to use multiple at the same time
[default: all].
--mpi Use the mpi implementation.

```

### Data structure

The extension will search for datasets in the `data` directory, relative to the current
working directory, so put your datasets there.
The extension will by default search for datasets in the `data` directory, relative to the current
working directory. Either put your datasets there, or specify and data directory.

The output of the runs will be stored in the `output` directory, again relative to the current path.
The output of the runs will by default be stored in the `output` directory, relative to
the current path.

An example of a structure that has been created:

Expand Down Expand Up @@ -161,20 +169,14 @@ The hyperopt extension has built-in support for MPI. MPI is used for paralleliza
a local PC with an MPI-implementation (like OpenMPI) installed, one could run with 4 cores:

```bash
mpirun -n 4 asreview hyper-active
mpirun -n 4 asreview hyper-active --mpi
```

On super computers one should sometimes replace `mpirun` with `srun`.


### Time measurements:
If you want to be slightly more efficient on a machine with a low number of cores, you can run
jobs on the MPI server as well:

#### inactive
```bash
mpirun -n 4 asreview hyper-active --mpi --server_job
```

nb, tfidf, double, max -> 53 seconds
svm, tfidf, double, max -> 1940 seconds
rf, tfidf, double, max -> 80 seconds
logistic, tfidf, double, max -> 250 seconds /4
dense_nn, tfidf, double, max -> ?
dense_nn, doc2vec, double, max -> 2750 seconds /1, /2
svm, doc2vec, ...
On super computers one should sometimes replace `mpirun` with `srun`.
2 changes: 1 addition & 1 deletion asreviewcontrib/hyperopt/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,5 +18,5 @@
from asreviewcontrib.hyperopt.show_trials import ShowTrialsEntryPoint
from asreviewcontrib.hyperopt.create_config import CreateConfigEntryPoint

__version__ = "0.1.4"
__version__ = "0.2.0"
__extension_name__ = "asreview-hyperopt"
52 changes: 12 additions & 40 deletions asreviewcontrib/hyperopt/active.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,13 @@
import argparse
import logging

from asreviewcontrib.hyperopt.mpi_executor import mpi_executor
from asreviewcontrib.hyperopt.mpi_executor import mpi_hyper_optimize
from asreview.entry_points import BaseEntryPoint

from asreviewcontrib.hyperopt.serial_executor import serial_executor
from asreviewcontrib.hyperopt.serial_executor import serial_hyper_optimize
from asreviewcontrib.hyperopt.job_utils import get_data_names
from asreviewcontrib.hyperopt.job_utils import get_data_names,\
_base_parse_arguments
from asreviewcontrib.hyperopt.active_job import ActiveJobRunner
from asreview.entry_points import BaseEntryPoint


class HyperActiveEntryPoint(BaseEntryPoint):
Expand All @@ -43,7 +43,7 @@ def execute(self, argv):


def _parse_arguments():
parser = argparse.ArgumentParser(prog=sys.argv[0])
parser = _base_parse_arguments(prog="hyper-active")
parser.add_argument(
"-m", "--model",
type=str,
Expand All @@ -67,39 +67,6 @@ def _parse_arguments():
type=str,
default="tfidf",
help="Feature extraction method.")
parser.add_argument(
"-n", "--n_iter",
type=int,
default=1,
help="Number of iterations of Bayesian Optimization."
)
parser.add_argument(
"-r", "--n_run",
type=int,
default=8,
help="Number of runs per dataset."
)
parser.add_argument(
"-d", "--datasets",
type=str,
default="all",
help="Datasets to use in the hyper parameter optimization "
"Separate by commas to use multiple at the same time [default: all].",
)
parser.add_argument(
"--mpi",
dest='use_mpi',
action='store_true',
help="Use the mpi implementation.",
)
parser.add_argument(
"--server_job",
dest='server_job',
action='store_true',
help='Run job on the server. It will incur less overhead of used CPUs,'
' but more latency of workers waiting for the server to finish its own'
' job. Only makes sense in combination with the flag --mpi.'
)
return parser


Expand All @@ -115,19 +82,24 @@ def main(argv=sys.argv[1:]):
use_mpi = args["use_mpi"]
n_run = args["n_run"]
server_job = args["server_job"]
data_dir = args["data_dir"]
output_dir = args["output_dir"]

data_names = get_data_names(datasets)
data_names = get_data_names(datasets, data_dir=data_dir)
if use_mpi:
from asreviewcontrib.hyperopt.mpi_executor import mpi_executor
executor = mpi_executor
else:
executor = serial_executor

job_runner = ActiveJobRunner(
data_names, model_name=model_name, query_name=query_name,
balance_name=balance_name, feature_name=feature_name,
executor=executor, n_run=n_run, server_job=server_job)
executor=executor, n_run=n_run, server_job=server_job,
data_dir=data_dir, output_dir=output_dir)

if use_mpi:
from asreviewcontrib.hyperopt.mpi_executor import mpi_hyper_optimize
mpi_hyper_optimize(job_runner, n_iter)
else:
serial_hyper_optimize(job_runner, n_iter)
7 changes: 4 additions & 3 deletions asreviewcontrib/hyperopt/active_job.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,12 +40,13 @@ class ActiveJobRunner():
def __init__(self, data_names, model_name, query_name, balance_name,
feature_name, executor=serial_executor,
n_run=8, n_papers=1502, n_instances=50, n_included=1,
n_excluded=1, server_job=False):
n_excluded=1, server_job=False, data_dir="data",
output_dir=None):

self.trials_dir, self.trials_fp = get_trial_fp(
data_names, model_name=model_name, balance_name=balance_name,
query_name=query_name, feature_name=feature_name,
hyper_type="active")
hyper_type="active", output_dir=output_dir)

self.feature_name = feature_name
self.balance_name = balance_name
Expand All @@ -61,7 +62,7 @@ def __init__(self, data_names, model_name, query_name, balance_name,
self.n_excluded = n_excluded

self.server_job = server_job
self.data_dir = "data"
self.data_dir = data_dir
self._cache = {data_name: {"priors": {}}
for data_name in data_names}

Expand Down
53 changes: 12 additions & 41 deletions asreviewcontrib/hyperopt/cluster.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,10 @@

from asreview.entry_points import BaseEntryPoint

from asreviewcontrib.hyperopt.mpi_executor import mpi_executor
from asreviewcontrib.hyperopt.mpi_executor import mpi_hyper_optimize
from asreviewcontrib.hyperopt.serial_executor import serial_executor
from asreviewcontrib.hyperopt.serial_executor import serial_hyper_optimize
from asreviewcontrib.hyperopt.job_utils import get_data_names
from asreviewcontrib.hyperopt.job_utils import get_data_names,\
_base_parse_arguments
from asreviewcontrib.hyperopt.cluster_job import ClusterJobRunner


Expand All @@ -43,46 +42,12 @@ def execute(self, argv):


def _parse_arguments():
parser = argparse.ArgumentParser(prog=sys.argv[0])
parser = _base_parse_arguments("hyper-cluster")
parser.add_argument(
"-e", "--feature_extraction",
type=str,
default="doc2vec",
help="Feature extraction method.")
parser.add_argument(
"-n", "--n_iter",
type=int,
default=1,
help="Number of iterations of Bayesian Optimization."
)
parser.add_argument(
"-d", "--datasets",
type=str,
default="all",
help="Datasets to use in the hyper parameter optimization "
"Separate by commas to use multiple at the same time [default: all].",
)
parser.add_argument(
"--mpi",
dest='use_mpi',
action='store_true',
help="Use the mpi implementation.",
)
parser.add_argument(
"-r", "--n_run",
type=int,
default=8,
help="Number of runs per dataset."
)
parser.add_argument(
"--server_job",
dest='server_job',
action='store_true',
help='Run job on the server. It will incur less overhead of used CPUs,'
' but more latency of workers waiting for the server to finish its own'
' job. Only makes sense in combination with the flag --mpi.'
)

return parser


Expand All @@ -95,17 +60,23 @@ def main(argv=sys.argv[1:]):
use_mpi = args["use_mpi"]
n_run = args["n_run"]
server_job = args["server_job"]
data_dir = args["data_dir"]
output_dir = args["output_dir"]

data_names = get_data_names(datasets)
data_names = get_data_names(datasets, data_dir=data_dir)
if use_mpi:
from asreviewcontrib.hyperopt.mpi_executor import mpi_executor
executor = mpi_executor
else:
executor = serial_executor

job_runner = ClusterJobRunner(data_names, feature_name, executor=executor,
n_cluster_run=n_run, server_job=server_job)
job_runner = ClusterJobRunner(
data_names, feature_name, executor=executor,
n_cluster_run=n_run, server_job=server_job,
data_dir=data_dir, output_dir=output_dir)

if use_mpi:
from asreviewcontrib.hyperopt.mpi_executor import mpi_hyper_optimize
mpi_hyper_optimize(job_runner, n_iter)
else:
serial_hyper_optimize(job_runner, n_iter)
Loading

0 comments on commit 5535fe5

Please sign in to comment.