Add testing (#14)

asreview · Mar 26, 2020 · 5535fe5 · 5535fe5
1 parent 2efd014
commit 5535fe5
Show file tree

Hide file tree

Showing 16 changed files with 486 additions and 172 deletions.
diff --git a/.github/workflows/ci-workflow.yml b/.github/workflows/ci-workflow.yml
@@ -0,0 +1,42 @@
+name: test-suite
+on: [push, pull_request]
+jobs:
+  test-master:
+    name: pytest
+    runs-on: ubuntu-latest
+    steps:
+    - uses: actions/checkout@v2
+      with:
+        path: asr-hyper
+    - uses: actions/setup-python@v1
+      with:
+        python-version: '3.6' # Version range or exact version of a Python version to use, using semvers version range syntax.
+        architecture: 'x64' # (x64 or x86)
+    - name: Install packages and run tests  
+      run: |
+        pip install pytest
+        pip install --upgrade setuptools>=41.0.0
+        git clone https://github.com/asreview/asreview.git
+        pip install ./asreview[all]
+        pip install ./asr-hyper
+        pytest asr-hyper/tests
+     
+  #test-older:
+    #name: pytest
+    #runs-on: ubuntu-latest
+    #strategy:
+      #matrix:
+        #asr_versions: ['0.7.2']
+    #steps:
+    #- uses: actions/checkout@v2
+    #- uses: actions/setup-python@v1
+      #with:
+        #python-version: '3.6' # Version range or exact version of a Python version to use, using semvers version range syntax.
+        #architecture: 'x64' # (x64 or x86)
+    #- name: Install packages and run tests
+      #run: |
+        #pip install pytest
+        #pip install --upgrade setuptools>=41.0.0
+        #pip install asreview[all]==${{ matrix.asr_versions }}
+        #pip install .
+        #pytest tests
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 ## ASReview-hyperopt
 
-![Deploy and release](https://github.com/msdslab/asreview-hyperopt/workflows/Deploy%20and%20release/badge.svg)
+![Deploy and release](https://github.com/asreview/asreview-hyperopt/workflows/Deploy%20and%20release/badge.svg)![Build status](https://github.com/asreview/asreview-hyperopt/workflows/test-suite/badge.svg)
 
 Hyper parameter optimization extension for 
 [ASReview](https://github.com/asreview/asreview). It uses the 
@@ -11,7 +11,7 @@ automatically used for hyper parameter optimization.
 
 ### Installation
 
-The easiest way to install the visualization package is to use the command line:
+The easiest way to install the hyper parameter optimization package is to use the command line:
 
 ``` bash
 pip install asreview-hyperopt
@@ -45,15 +45,29 @@ asreview hyper-active --help
 Which results in the following options:
 
 ```bash
-usage: /Users/qubix/Library/Python/3.6/bin/asreview [-h] [-m MODEL]
-                                                    [-q QUERY_STRATEGY]
-                                                    [-b BALANCE_STRATEGY]
-                                                    [-e FEATURE_EXTRACTION]
-                                                    [-n N_ITER] [-d DATASETS]
-                                                    [--mpi]
+usage: hyper-active [-h] [-n N_ITER] [-r N_RUN] [-d DATASETS] [--mpi]
+                    [--data_dir DATA_DIR] [--output_dir OUTPUT_DIR]
+                    [--server_job] [-m MODEL] [-q QUERY_STRATEGY]
+                    [-b BALANCE_STRATEGY] [-e FEATURE_EXTRACTION]
 
 optional arguments:
   -h, --help            show this help message and exit
+  -n N_ITER, --n_iter N_ITER
+                        Number of iterations of Bayesian Optimization.
+  -r N_RUN, --n_run N_RUN
+                        Number of runs per dataset.
+  -d DATASETS, --datasets DATASETS
+                        Datasets to use in the hyper parameter optimization
+                        Separate by commas to use multiple at the same time
+                        [default: all].
+  --mpi                 Use the mpi implementation.
+  --data_dir DATA_DIR   Base directory with data files.
+  --output_dir OUTPUT_DIR
+                        Output directory for trials.
+  --server_job          Run job on the server. It will incur less overhead of
+                        used CPUs, but more latency of workers waiting for the
+                        server to finish its own job. Only makes sense in
+                        combination with the flag --mpi.
   -m MODEL, --model MODEL
                         Prediction model for active learning.
   -q QUERY_STRATEGY, --query_strategy QUERY_STRATEGY
@@ -62,22 +76,16 @@ optional arguments:
                         Balance strategy for active learning.
   -e FEATURE_EXTRACTION, --feature_extraction FEATURE_EXTRACTION
                         Feature extraction method.
-  -n N_ITER, --n_iter N_ITER
-                        Number of iterations of Bayesian Optimization.
-  -d DATASETS, --datasets DATASETS
-                        Datasets to use in the hyper parameter optimization
-                        Separate by commas to use multiple at the same time
-                        [default: all].
-  --mpi                 Use the mpi implementation.
 
 ```
 
 ### Data structure
 
-The extension will search for datasets in the `data` directory, relative to the current
-working directory, so put your datasets there.
+The extension will by default search for datasets in the `data` directory, relative to the current
+working directory. Either put your datasets there, or specify and data directory.
 
-The output of the runs will be stored in the `output` directory, again relative to the current path.
+The output of the runs will by default be stored in the `output` directory, relative to
+the current path.
 
 An example of a structure that has been created:
 
@@ -161,20 +169,14 @@ The hyperopt extension has built-in support for MPI. MPI is used for paralleliza
 a local PC with an MPI-implementation (like OpenMPI) installed, one could run with 4 cores:
 
 ```bash
-mpirun -n 4 asreview hyper-active
+mpirun -n 4 asreview hyper-active --mpi
 ```
 
-On super computers one should sometimes replace `mpirun` with `srun`.
-
-
-### Time measurements:
+If you want to be slightly more efficient on a machine with a low number of cores, you can run
+jobs on the MPI server as well:
 
-#### inactive
+```bash
+mpirun -n 4 asreview hyper-active --mpi --server_job
+```
 
-nb, tfidf, double, max -> 53 seconds
-svm, tfidf, double, max -> 1940 seconds
-rf, tfidf, double, max -> 80 seconds
-logistic, tfidf, double, max -> 250 seconds /4
-dense_nn, tfidf, double, max -> ?
-dense_nn, doc2vec, double, max ->  2750 seconds /1, /2
-svm, doc2vec, ...
+On super computers one should sometimes replace `mpirun` with `srun`.
diff --git a/asreviewcontrib/hyperopt/__init__.py b/asreviewcontrib/hyperopt/__init__.py
@@ -18,5 +18,5 @@
 from asreviewcontrib.hyperopt.show_trials import ShowTrialsEntryPoint
 from asreviewcontrib.hyperopt.create_config import CreateConfigEntryPoint
 
-__version__ = "0.1.4"
+__version__ = "0.2.0"
 __extension_name__ = "asreview-hyperopt"
diff --git a/asreviewcontrib/hyperopt/active.py b/asreviewcontrib/hyperopt/active.py
@@ -16,13 +16,13 @@
 import argparse
 import logging
 
-from asreviewcontrib.hyperopt.mpi_executor import mpi_executor
-from asreviewcontrib.hyperopt.mpi_executor import mpi_hyper_optimize
+from asreview.entry_points import BaseEntryPoint
+
 from asreviewcontrib.hyperopt.serial_executor import serial_executor
 from asreviewcontrib.hyperopt.serial_executor import serial_hyper_optimize
-from asreviewcontrib.hyperopt.job_utils import get_data_names
+from asreviewcontrib.hyperopt.job_utils import get_data_names,\
+    _base_parse_arguments
 from asreviewcontrib.hyperopt.active_job import ActiveJobRunner
-from asreview.entry_points import BaseEntryPoint
 
 
 class HyperActiveEntryPoint(BaseEntryPoint):
@@ -43,7 +43,7 @@ def execute(self, argv):
 
 
 def _parse_arguments():
-    parser = argparse.ArgumentParser(prog=sys.argv[0])
+    parser = _base_parse_arguments(prog="hyper-active")
     parser.add_argument(
         "-m", "--model",
         type=str,
@@ -67,39 +67,6 @@ def _parse_arguments():
         type=str,
         default="tfidf",
         help="Feature extraction method.")
-    parser.add_argument(
-        "-n", "--n_iter",
-        type=int,
-        default=1,
-        help="Number of iterations of Bayesian Optimization."
-    )
-    parser.add_argument(
-        "-r", "--n_run",
-        type=int,
-        default=8,
-        help="Number of runs per dataset."
-    )
-    parser.add_argument(
-        "-d", "--datasets",
-        type=str,
-        default="all",
-        help="Datasets to use in the hyper parameter optimization "
-        "Separate by commas to use multiple at the same time [default: all].",
-    )
-    parser.add_argument(
-        "--mpi",
-        dest='use_mpi',
-        action='store_true',
-        help="Use the mpi implementation.",
-    )
-    parser.add_argument(
-        "--server_job",
-        dest='server_job',
-        action='store_true',
-        help='Run job on the server. It will incur less overhead of used CPUs,'
-        ' but more latency of workers waiting for the server to finish its own'
-        ' job. Only makes sense in combination with the flag --mpi.'
-    )
     return parser
 
 
@@ -115,19 +82,24 @@ def main(argv=sys.argv[1:]):
     use_mpi = args["use_mpi"]
     n_run = args["n_run"]
     server_job = args["server_job"]
+    data_dir = args["data_dir"]
+    output_dir = args["output_dir"]
 
-    data_names = get_data_names(datasets)
+    data_names = get_data_names(datasets, data_dir=data_dir)
     if use_mpi:
+        from asreviewcontrib.hyperopt.mpi_executor import mpi_executor
         executor = mpi_executor
     else:
         executor = serial_executor
 
     job_runner = ActiveJobRunner(
         data_names, model_name=model_name, query_name=query_name,
         balance_name=balance_name, feature_name=feature_name,
-        executor=executor, n_run=n_run, server_job=server_job)
+        executor=executor, n_run=n_run, server_job=server_job,
+        data_dir=data_dir, output_dir=output_dir)
 
     if use_mpi:
+        from asreviewcontrib.hyperopt.mpi_executor import mpi_hyper_optimize
         mpi_hyper_optimize(job_runner, n_iter)
     else:
         serial_hyper_optimize(job_runner, n_iter)
diff --git a/asreviewcontrib/hyperopt/active_job.py b/asreviewcontrib/hyperopt/active_job.py
@@ -40,12 +40,13 @@ class ActiveJobRunner():
     def __init__(self, data_names, model_name, query_name, balance_name,
                  feature_name, executor=serial_executor,
                  n_run=8, n_papers=1502, n_instances=50, n_included=1,
-                 n_excluded=1, server_job=False):
+                 n_excluded=1, server_job=False, data_dir="data",
+                 output_dir=None):
 
         self.trials_dir, self.trials_fp = get_trial_fp(
             data_names, model_name=model_name, balance_name=balance_name,
             query_name=query_name, feature_name=feature_name,
-            hyper_type="active")
+            hyper_type="active", output_dir=output_dir)
 
         self.feature_name = feature_name
         self.balance_name = balance_name
@@ -61,7 +62,7 @@ def __init__(self, data_names, model_name, query_name, balance_name,
         self.n_excluded = n_excluded
 
         self.server_job = server_job
-        self.data_dir = "data"
+        self.data_dir = data_dir
         self._cache = {data_name: {"priors": {}}
                        for data_name in data_names}
 

diff --git a/asreviewcontrib/hyperopt/cluster.py b/asreviewcontrib/hyperopt/cluster.py
@@ -18,11 +18,10 @@
 
 from asreview.entry_points import BaseEntryPoint
 
-from asreviewcontrib.hyperopt.mpi_executor import mpi_executor
-from asreviewcontrib.hyperopt.mpi_executor import mpi_hyper_optimize
 from asreviewcontrib.hyperopt.serial_executor import serial_executor
 from asreviewcontrib.hyperopt.serial_executor import serial_hyper_optimize
-from asreviewcontrib.hyperopt.job_utils import get_data_names
+from asreviewcontrib.hyperopt.job_utils import get_data_names,\
+    _base_parse_arguments
 from asreviewcontrib.hyperopt.cluster_job import ClusterJobRunner
 
 
@@ -43,46 +42,12 @@ def execute(self, argv):
 
 
 def _parse_arguments():
-    parser = argparse.ArgumentParser(prog=sys.argv[0])
+    parser = _base_parse_arguments("hyper-cluster")
     parser.add_argument(
         "-e", "--feature_extraction",
         type=str,
         default="doc2vec",
         help="Feature extraction method.")
-    parser.add_argument(
-        "-n", "--n_iter",
-        type=int,
-        default=1,
-        help="Number of iterations of Bayesian Optimization."
-    )
-    parser.add_argument(
-        "-d", "--datasets",
-        type=str,
-        default="all",
-        help="Datasets to use in the hyper parameter optimization "
-        "Separate by commas to use multiple at the same time [default: all].",
-    )
-    parser.add_argument(
-        "--mpi",
-        dest='use_mpi',
-        action='store_true',
-        help="Use the mpi implementation.",
-    )
-    parser.add_argument(
-        "-r", "--n_run",
-        type=int,
-        default=8,
-        help="Number of runs per dataset."
-    )
-    parser.add_argument(
-        "--server_job",
-        dest='server_job',
-        action='store_true',
-        help='Run job on the server. It will incur less overhead of used CPUs,'
-        ' but more latency of workers waiting for the server to finish its own'
-        ' job. Only makes sense in combination with the flag --mpi.'
-    )
-
     return parser
 
 
@@ -95,17 +60,23 @@ def main(argv=sys.argv[1:]):
     use_mpi = args["use_mpi"]
     n_run = args["n_run"]
     server_job = args["server_job"]
+    data_dir = args["data_dir"]
+    output_dir = args["output_dir"]
 
-    data_names = get_data_names(datasets)
+    data_names = get_data_names(datasets, data_dir=data_dir)
     if use_mpi:
+        from asreviewcontrib.hyperopt.mpi_executor import mpi_executor
         executor = mpi_executor
     else:
         executor = serial_executor
 
-    job_runner = ClusterJobRunner(data_names, feature_name, executor=executor,
-                                  n_cluster_run=n_run, server_job=server_job)
+    job_runner = ClusterJobRunner(
+        data_names, feature_name, executor=executor,
+        n_cluster_run=n_run, server_job=server_job,
+        data_dir=data_dir, output_dir=output_dir)
 
     if use_mpi:
+        from asreviewcontrib.hyperopt.mpi_executor import mpi_hyper_optimize
         mpi_hyper_optimize(job_runner, n_iter)
     else:
         serial_hyper_optimize(job_runner, n_iter)