Skip to content

Latest commit

 

History

History
319 lines (277 loc) · 16.5 KB

README.md

File metadata and controls

319 lines (277 loc) · 16.5 KB

Modev : Model Development for Data Science Projects

Introduction

Most data science projects involve similar ingredients (loading data, defining some evaluation metrics, splitting data into different train/validation/test sets, etc.). Modev's goal is to ease these repetitive steps, without constraining the freedom data scientists need to develop models.

Installation

The easiest is to install from pip:

pip install modev

Otherwise you can clone the latest release and install it manually:

git clone [email protected]:pabloarosado/modev.git
cd modev
python setup.py install

Otherwise you can install from conda:

conda install -c pablorosado modev

Quick guide

The quickest way to get started with modev is to run a pipeline with the default settings:

import modev
pipe = modev.Pipeline()
pipe.run()

This runs a pipeline on some example data, and returns a dataframe with a ranking of approaches that perform best (given some metrics) on the data.

To get the data used in the pipeline:

pipe.get_data()

By default, modev splits the data into a playground and a test set. The test set is omitted (unless parameter execution_inputs['test_mode'] is set to True), and the playground is split into k train/dev folds, to do k-fold cross-validation. To get the indexes of train/dev/test sets:

pipe.get_indexes()

The pipeline will load two dummy approaches (which can be accessed on pipe.approaches_function) with some parameters (which can be accessed on pipe.approaches_pars). For each fold, these approaches will be fitted to the train set and predict the 'color' of the examples on the dev sets.

The metrics used to evaluate the performance of the approaches are listed in pipe.evaluation_pars['metrics'].

An exhaustive grid search is performed, to get all possible combinations of the parameters of each of the approaches. The performance of each of these combinations on each fold can be accessed on:

pipe.get_results()

To plot these results per fold for each of the metrics:

pipe.plot_results()

To plot only a certain list of metrics, this list can be given as an argument of this function.

To get the final ranking of best approaches (after combining the results of different folds):

pipe.get_selected_models()

Guide

The inputs accepted by modev.Pipeline refer to the usual ingredients in a data science project (data loading, evaluation metrics, model selection method...). We define an experiment as a combination of all these ingredients. An experiment is defined by a dictionary with the following keys:

  1. load_inputs: Dictionary of inputs related to data loading:

    • Using the default function.

      If function is not given, modev.etl.load_local_file will be used.
      This function loads a local (.csv) file. It uses pandas.read_csv function and accepts all its arguments, and also some additional arguments.

      • Arguments that must be defined in load_inputs:
        • data_file : str
          Path to local (.csv) file.
      • Arguments that can optionally be defined in load_inputs:
        • selection : str or None
          Selection to perform on the data. For example, if selection is "(data['height'] > 3) & (data['width'] < 2)", that selection will be evaluated and applied to the data; None to apply no selection.
          Default: None
        • sample_nrows : int or None
          Number of random rows to sample from the data (without repeating rows); None to load all rows.
          Default: None
        • random_state : int or None
          Random state (relevant only when sampling from data, i.e. when sample_nrows is not None).
          Default: None
    • Using a custom function.

      If the function key is contained in the load_inputs dictionary, its value must be a valid function.

      • Arguments that this custom function must accept:
        This function can have an arbitrary number of mandatory arguments (or none), to be specified in load_inputs.
      • Additionally, this function can have an arbitrary number of optional arguments (or none), to be specified in load_inputs dictionary.
      • Outputs this custom function must return:
        • data : pd.DataFrame
          Relevant data.
  2. validation_inputs: Dictionary of inputs related to validation method (e.g. k-fold or temporal-fold cross-validation).

    • Using the default function.

      If function is not given, modev.validation.k_fold_playground_n_tests_split will be used.
      This function generates indexes that split data into a playground (with k folds) and n test sets. There is only one playground, which contains train and dev sets, and has no overlap with test sets. Playground is split into k folds, namely k non-overlapping dev sets, and k overlapping train sets. Each of the folds contains all data in the playground (part of it in train, and the rest in dev); hence train and dev sets of the same fold do not overlap.

      • Arguments that must be defined in validation_inputs:
        None (all arguments will be taken from default if not explicitly given).
      • Arguments that can optionally be defined in validation_inputs:
        • playground_n_folds : int
          Number of folds to split playground into (also called k), so that there will be k train sets and k dev sets.
          Default: 4
        • test_fraction : float
          Fraction of data to use for test sets.
          Default: 0.2
        • test_n_sets : int
          Number of test sets.
          Default: 2
        • labels : list or None
          Labels to stratify data according to their distribution; None to not stratify data.
          Default: None
        • shuffle : bool
          True to shuffle data before splitting; False to keep them sorted as they are before splitting.
          Default: True
        • random_state : int or None
          Random state for shuffling; Ignored if 'shuffle' is False (in which case, 'random_state' can be set to None).
          Default: None
        • test_mode : bool
          True to return indexes of the test set; False to return indexes of the dev set.
          Default: False
    • Using a custom function.

      If the function key is contained in the validation_inputs dictionary, its value must be a valid function.

      • Arguments that this custom function must accept:
        • data : pd.DataFrame
          Indexed data (e.g. a dataframe whose index can be accessed with data.index).
      • Additionally, this function can have an arbitrary number of optional arguments (or none), to be specified in validation_inputs dictionary.
      • Outputs this custom function must return:
        • train_indexes : dict Indexes to use for training on the different k folds, e.g. for 10 folds:
          {0: np.array([...]), 1: np.array([...]), ..., 10: np.array([...])}
        • test_indexes : dict Indexes to use for evaluating (either dev or test) on the different k folds, e.g. for 10 folds and if test_mode is False:
          {0: np.array([...]), 1: np.array([...]), ..., 10: np.array([...])}
  3. execution_inputs: Dictionary of inputs related to the execution of approaches.

    • Using the default function.

      If function is not given, modev.execution.execute_model will be used. This function defines the execution method (including training and prediction, and any possible preprocessing) for an approach. This function takes an approach approach_function with parameters approach_pars, a train set (with predictors train_x and targets train_y) and the predictors of a test set test_x, and returns the predicted targets of the test set.
      Note: Here, test refers to either a dev or a test set indistinctly.

      • Arguments that must be defined in execution_inputs:
        • target : str
          Name of target column in both train_set and test_set.
      • Arguments that can optionally be defined in execution_inputs:
        None (this function does not accept any other optional arguments).
    • Using a custom function.

      If the function key is contained in the execution_inputs dictionary, its value must be a valid function.

      • Arguments that this custom function must accept:
        • model : model object
          Instantiated approach.
        • data : pd.DataFrame
          Data, as returned by load inputs function.
        • fold_train_indexes : np.array
          Indexes of train set (or playground set) for current fold.
        • fold_test_indexes : np.array
          Indexes of dev set (or test set) for current fold.
        • target : str
          Name of target column in both train_set and test_set.
      • Additionally, this function can have an arbitrary number of optional arguments (or none), to be specified in execution_inputs dictionary.
      • Outputs this custom function must return:
        • execution_results : dict
          Execution results. It contains:
          • truth: np.array of true values of the target in the dev (or test) set.
          • prediction: np.array of predicted values of the target in the dev (or test) set.
  4. evaluation_inputs: Dictionary of inputs related to evaluation metrics.

    • Using the default function.

      If function is not given, modev.evaluation.evaluate_predictions will be used.
      This function evaluates predictions, given a ground truth, using a list of metrics.

      • Arguments that must be defined in evaluation_inputs:
        • metrics : list
          Metrics to use for evaluation. Implemented methods include:
          • precision: usual precision in classification problems.
          • recall: usual recall in classification problems.
          • f1: usual f1-score in classification problems.
          • accuracy: usual accuracy in classification problems.
          • precision_at_*: precision at k (e.g. 'precision_at_10') or at k percent (e.g. 'precision_at_5_pct').
          • recall_at_*: recall at k (e.g. 'recall_at_10') or at k percent (e.g. 'recall_at_5_pct').
          • threshold_at_*: threshold at k (e.g. 'threshold_at_10') or at k percent (e.g. 'threshold_at_5_pct').
            Note: For the time being, all metrics have to return only one number; In the case of a multi-class classification, a micro-average precision is returned.
    • Using a custom function.

      If the function key is contained in the evaluation_inputs dictionary, its value must be a valid function.

      • Arguments that this custom function must accept:
        • execution_results : dict
          Execution results as returned by execution inputs function. It must contain a 'truth' and a 'prediction' key.
      • Additionally, this function can have an arbitrary number of optional arguments (or none), to be specified in evaluation_inputs dictionary.
      • Outputs this custom function must return:
        • results : dict
          Results of evaluation. Each element in the dictionary corresponds to one of the metrics.
  5. exploration_inputs: Dictionary of inputs related to the method to explore the parameter space (e.g. grid search or random search).

    • Using the default function.

      If function is not given, modev.exploration.GridSearch will be used.
      This class allows for a grid-search exploration of the parameter space.

    • Using a custom function.

      If the function key is contained in the exploration_inputs dictionary, its value must be a valid class.

      • Arguments that this custom function must accept:
        • approaches_pars : dict
          Dictionaries of approaches. Each key corresponds to one approach name, and the value is a dictionary. This inner dictionary of an individual approach has one key per parameter, and the value is a list of parameter values to explore.
        • folds : list
          List of folds (e.g. [0, 1, 2, 3]).
        • results : pd.DataFrame or None
          Existing results to load; None to initialise results from scratch.
      • Additionally, this class can have an arbitrary number of optional arguments (or none), to be specified in exploration_inputs dictionary.
      • Methods this custom class must return:
        • initialise_results : function
          Initialise results dataframe and return it.
        • select_executions_left : function
          Select rows of results left to be executed and return the number of rows.
        • get_next_point : function
          Return next point of parameter space to be explored.
  6. selection_inputs: Dictionary of inputs related to the model selection method.

    • Using the default function.

      If function is not given, modev.selection.model_selection will be used.
      This function takes the evaluation of approaches on some folds, and selects the best model.

      • Arguments that must be defined in selection_inputs:
        • main_metric : str
          Name of the main metric (the one that has to be maximized).
      • Arguments that can optionally be defined in selection_inputs:
        • aggregation_method : str
          Aggregation method to use to combine evaluations of different folds (e.g. 'mean').
          Default: 'mean'
        • results_condition : str or None
          Condition to be applied to the results dataframe before combining results from different folds.
          Default: None
        • combined_results_condition : str or None
          Condition to be applied to the results dataframe after combining results from different folds.
          Default: None
    • Using a custom function.

      If the function key is contained in the selection_inputs dictionary, its value must be a valid function.

      • Arguments that this custom function must accept:
        • results : pd.DataFrame
          Evaluations of the performance of approaches on different data folds (output of function used in evaluation_inputs).
      • Additionally, this function can have an arbitrary number of optional arguments (or none), to be specified in evaluation_inputs dictionary.
      • Outputs this custom function must return:
        • combine_results_sorted : pd.DataFrame
          Ranking of results (sorted in descending value of 'main_metric') of approaches that fulfil the imposed conditions.
  7. approaches_inputs: List of dictionaries, one per approach to be used.

    • Definition of an approach.

      Each dictionary in the list has at least two keys:

      • approach_name: Name of the approach.
      • function: Actual approach (usually, a class with 'fit' and 'predict' methods).
      • Any other key in the dictionary of an approach will be assumed to be an argument of that approach.
        To see some examples of simple approaches, see modev.approaches.DummyPredictor and modev.approaches.RandomChoicePredictor.

An experiment can be contained in a python module. As an example, there is a template experiment in modev.templates, that is a small variation with respect to the default experiment. To start a pipeline on this experiment:

experiment = templates.experiment_01.experiment
pipe = Pipeline(**experiment)

And to run it follow the example in the quick guide.