From ff4fc48b65341f8f762d59fe35f8cf836d704cc1 Mon Sep 17 00:00:00 2001 From: Henry Date: Mon, 1 Jul 2024 21:01:19 +0200 Subject: [PATCH 01/13] :memo: improce documentation - formulations, typos - links to other papers of comparision methods --- README.md | 115 ++++++++++++++----------- project/README.md | 7 +- project/data/Alzheimer_study/README.md | 7 ++ project/src/R_NAGuideR/Imput_funcs.r | 2 +- 4 files changed, 75 insertions(+), 56 deletions(-) create mode 100644 project/data/Alzheimer_study/README.md diff --git a/README.md b/README.md index 7d4cee267..86ec184b6 100644 --- a/README.md +++ b/README.md @@ -13,7 +13,7 @@ We published the [work](https://www.nature.com/articles/s41467-024-48711-5) in N > Nat Commun 15, 5405 (2024). > https://doi.org/10.1038/s41467-024-48711-5 -We provide functionality as a python package, an excutable workflow or simply in notebooks. +We provide new functionality as a python package for simple use (in notebooks) and a workflow for comparsion with other methdos. For any questions, please [open an issue](https://github.com/RasmussenLab/pimms/issues) or contact me directly. @@ -43,22 +43,6 @@ Then you can use the models on a pandas DataFrame with missing values. You can t > `PIMMS` was called `vaep` during development. > Before entire refactoring has been completed the imported package will be `vaep`. -## Notebooks as scripts using papermill - -If you want to run a model on your prepared data, you can run notebooks prefixed with -`01_`, i.e. [`project/01_*.ipynb`](https://github.com/RasmussenLab/pimms/tree/HEAD/project) after cloning the repository. Using jupytext also python percentage script versions -are saved. - -```bash -# navigat to your desired folder -git clone https://github.com/RasmussenLab/pimms.git # get all notebooks -cd project # project folder as pwd -# pip install pimms-learn papermill # if not already installed -papermill 01_0_split_data.ipynb --help-notebook -papermill 01_1_train_vae.ipynb --help-notebook -``` -> :warning: Mistyped argument names won't throw an error when using papermill, but a warning is printed on the console thanks to my contributions:) - ## PIMMS comparison workflow and differential analysis workflow The PIMMS comparison workflow is a snakemake workflow that runs the all selected PIMMS models and R-models on @@ -88,7 +72,8 @@ To re-execute the entire workflow locally, have a look at the [configuration fil - [`config/alzheimer_study/config.yaml`](https://github.com/RasmussenLab/pimms/blob/HEAD/project/config/alzheimer_study/comparison.yaml) - [`config/alzheimer_study/comparsion.yaml`](https://github.com/RasmussenLab/pimms/blob/HEAD/project/config/alzheimer_study/config.yaml) -To execute that workflow, follow the Setup instructions below and run the following command in the project folder: +To execute that workflow, follow the Setup instructions below and run the following commands +in the project folder: ```bash # being in the project folder @@ -105,9 +90,31 @@ sphinx-build -n --keep-going -b html ./ ./_build/ # open ./_build/index.html ``` +## Notebooks as scripts using papermill + +The above workflow is based on notebooks as scripts, which can then be rendered as html files.'Using jupytext also python percentage script versions are saved. + +If you want to run a specific model on your data, you can run notebooks prefixed with +`01_`, i.e. [`project/01_*.ipynb`](https://github.com/RasmussenLab/pimms/tree/HEAD/project) after +creating hte appropriate data split. Start by cloning the repository. + +```bash +# navigat to your desired folder +git clone https://github.com/RasmussenLab/pimms.git # get all notebooks +cd project # project folder as pwd +# pip install pimms-learn papermill # if not already installed +papermill 01_0_split_data.ipynb --help-notebook +papermill 01_1_train_vae.ipynb --help-notebook +``` +> :warning: Mistyped argument names won't throw an error when using papermill, but a warning is printed on the console thanks to my contributions:) + ## Setup workflow and development environment -### Setup comparison workflow +Either (1) install one big conda environment based on an environment file, +or (2) install packages using a mix of conda and pip, +or (3) use snakemake separately with rule specific conda environments. + +### Setup comparison workflow (1) The core funtionality is available as a standalone software on PyPI under the name `pimms-learn`. However, running the entire snakemake workflow in enabled using conda (or mamba) and pip to setup an analysis environment. For a detailed description of setting up @@ -130,7 +137,7 @@ mamba env create -n pimms -f environment.yml # faster, less then 5mins If on Mac M1, M2 or having otherwise issue using your accelerator (e.g. GPUs): Install the pytorch dependencies first, then the rest of the environment: -### Install pytorch first +### Install pytorch first (2) > :warning: We currently see issues with some installations on M1 chips. A dependency > for one workflow is polars, which causes the issue. This should be [fixed now](https://github.com/RasmussenLab/njab/pull/13) @@ -158,7 +165,7 @@ papermill 04_1_train_pimms_models.ipynb 04_1_train_pimms_models_test.ipynb # sec python 04_1_train_pimms_models.py # just execute the code ``` -### Let Snakemake handle installation +### Let Snakemake handle installation (3) If you only want to execute the workflow, you can use snakemake to build the environments for you: @@ -178,7 +185,7 @@ Trouble shoot your R installation by opening jupyter lab jupyter lab # open 01_1_train_NAGuideR.ipynb ``` -## Run example +## Run example on HeLa data Change to the [`project` folder](./project) and see it's [README](project/README.md) You can subselect models by editing the config file: [`config.yaml`](https://github.com/RasmussenLab/pimms/tree/HEAD/project/config/single_dev_dataset/proteinGroups_N50) file. @@ -242,40 +249,44 @@ assert df_imputed.isna().sum().sum() == 0 df_imputed ``` +> [!NOTE]: The imputation is simpler if you use the provide scikit-learn Transformer +> interface (see [Tutorial](https://colab.research.google.com/github/RasmussenLab/pimms/blob/HEAD/project/04_1_train_pimms_models.ipynb)). + ## Available imputation methods -Packages either are based on this repository, or were referenced by NAGuideR (Table S1). -From the brief description in the table the exact procedure is not always clear. +Packages either are based on this repository, were referenced by NAGuideR or released recently. +From the brief description in this table the exact procedure is not always clear. -| Method | Package | source | status | name | +| Method | Package | source | links | name | | ------------- | ----------------- | ------ | ------ |------------------ | -| CF | pimms | pip | | Collaborative Filtering | -| DAE | pimms | pip | | Denoising Autoencoder | -| VAE | pimms | pip | | Variational Autoencoder | +| CF | pimms | pip | [paper](https://doi.org/10.1038/s41467-024-48711-5) | Collaborative Filtering | +| DAE | pimms | pip | [paper](https://doi.org/10.1038/s41467-024-48711-5) | Denoising Autoencoder | +| VAE | pimms | pip | [paper](https://doi.org/10.1038/s41467-024-48711-5) | Variational Autoencoder | | | | | | -| ZERO | - | - | | replace NA with 0 | -| MINIMUM | - | - | | replace NA with global minimum | -| COLMEDIAN | e1071 | CRAN | | replace NA with column median | -| ROWMEDIAN | e1071 | CRAN | | replace NA with row median | -| KNN_IMPUTE | impute | BIOCONDUCTOR | | k nearest neighbor imputation | -| SEQKNN | SeqKnn | tar file | | Sequential k- nearest neighbor imputation
starts with feature with least missing values and re-use imputed values for not yet imputed features -| BPCA | pcaMethods | BIOCONDUCTOR | | Bayesian PCA missing value imputation -| SVDMETHOD | pcaMethods | BIOCONDUCTOR | | replace NA initially with zero, use k most significant eigenvalues using Singular Value Decomposition for imputation until convergence -| LLS | pcaMethods | BIOCONDUCTOR | | Local least squares imputation of a feature based on k most correlated features +| ZERO | - | - | - | replace NA with 0 | +| MINIMUM | - | - | - | replace NA with global minimum | +| COLMEDIAN | e1071 | CRAN | - | replace NA with column median | +| ROWMEDIAN | e1071 | CRAN | - | replace NA with row median | +| KNN_IMPUTE | impute | BIOCONDUCTOR | [docs](https://bioconductor.org/packages/release/bioc/html/impute.html) | k nearest neighbor imputation | +| SEQKNN | SeqKnn | tar file | [paper](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-5-160) | Sequential k- nearest neighbor imputation
starts with feature with least missing values and re-use imputed values for not yet imputed features +| BPCA | pcaMethods | BIOCONDUCTOR | [paper](https://doi.org/10.1093/bioinformatics/btm069) | Bayesian PCA missing value imputation +| SVDMETHOD | pcaMethods | BIOCONDUCTOR | [paper](https://doi.org/10.1093/bioinformatics/btm069) | replace NA initially with zero, use k most significant eigenvalues using Singular Value Decomposition for imputation until convergence +| LLS | pcaMethods | BIOCONDUCTOR | [paper](https://doi.org/10.1093/bioinformatics/btm069) | Local least squares imputation of a feature based on k most correlated features | MLE | norm | CRAN | | Maximum likelihood estimation -| QRILC | imputeLCMD | CRAN | | quantile regression imputation of left-censored data, i.e. by random draws from a truncated distribution which parameters were estimated by quantile regression -| MINDET | imputeLCMD | CRAN | | replace NA with q-quantile minimum in a sample -| MINPROB | imputeLCMD | CRAN | | replace NA by random draws from q-quantile minimum centered distribution -| IRM | VIM | CRAN | | iterativ robust model-based imputation (one feature at at time) -| IMPSEQ | rrcovNA | CRAN | | Sequential imputation of missing values by minimizing the determinant of the covariance matrix with imputed values -| IMPSEQROB | rrcovNA | CRAN | | Sequential imputation of missing values using robust estimators -| MICE-NORM | mice | CRAN | | Multivariate Imputation by Chained Equations (MICE) using Bayesian linear regression -| MICE-CART | mice | CRAN | | Multivariate Imputation by Chained Equations (MICE) using regression trees -| TRKNN | - | script | | truncation k-nearest neighbor imputation -| RF | missForest | CRAN | | Random Forest imputation (one feature at a time) +| QRILC | imputeLCMD | CRAN | [paper](https://doi.org/10.1021/acs.jproteome.5b00981)| quantile regression imputation of left-censored data, i.e. by random draws from a truncated distribution which parameters were estimated by quantile regression +| MINDET | imputeLCMD | CRAN | [paper](https://doi.org/10.1021/acs.jproteome.5b00981) | replace NA with q-quantile minimum in a sample +| MINPROB | imputeLCMD | CRAN | [paper](https://doi.org/10.1021/acs.jproteome.5b00981) | replace NA by random draws from q-quantile minimum centered distribution +| IRM | VIM | CRAN | [paper](https://doi.org/10.18637/jss.v074.i07) | iterativ robust model-based imputation (one feature at at time) +| IMPSEQ | rrcovNA | CRAN | [paper](https://doi.org/10.1007/s11634-010-0075-2) | Sequential imputation of missing values by minimizing the determinant of the covariance matrix with imputed values +| IMPSEQROB | rrcovNA | CRAN | [paper](https://doi.org/10.1007/s11634-010-0075-2) | Sequential imputation of missing values using robust estimators +| MICE-NORM | mice | CRAN | [paper](https://doi.org/10.1002%2Fmpr.329)| Multivariate Imputation by Chained Equations (MICE) using Bayesian linear regression +| MICE-CART | mice | CRAN | [paper](https://doi.org/10.1002%2Fmpr.329)| Multivariate Imputation by Chained Equations (MICE) using regression trees +| TRKNN | - | script | [paper](https://doi.org/10.1186/s12859-017-1547-6) | truncation k-nearest neighbor imputation +| RF | missForest | CRAN | [paper](https://doi.org/10.1093/bioinformatics/btr597) | Random Forest imputation (one feature at a time) | PI | - | - | | Downshifted normal distribution (per sample) -| GSIMP | - | script | | QRILC initialization and iterative Gibbs sampling with generalized linear models (glmnet) -| MSIMPUTE | msImpute | BIOCONDUCTOR | | Missing at random algorithm using low rank approximation -| MSIMPUTE_MNAR | msImpute | BIOCONDUCTOR | | Missing not at random algorithm using low rank approximation -| ~~grr~~ | DreamAI | - | Fails to install | Rigde regression -| ~~GMS~~ | GMSimpute | tar file | Fails on Windows | Lasso model +| GSIMP | - | script | [paper](https://doi.org/10.1371/journal.pcbi.1005973) | QRILC initialization and iterative Gibbs sampling with generalized linear models (glmnet) - slow +| MSIMPUTE | msImpute | BIOCONDUCTOR | [paper](https://doi.org/10.1016/j.mcpro.2023.100558) | Missing at random algorithm using low rank approximation +| MSIMPUTE_MNAR | msImpute | BIOCONDUCTOR | [paper](https://doi.org/10.1016/j.mcpro.2023.100558) | Missing not at random algorithm using low rank approximation + + +DreamAI and GMSimpute are not available for installation on Windows or failed to install. diff --git a/project/README.md b/project/README.md index ba3cc6b25..44ca9cac1 100644 --- a/project/README.md +++ b/project/README.md @@ -1,7 +1,8 @@ -# Paper project +# Project folder README (workflows) The PIMMS comparison workflow is a snakemake workflow that runs the all selected PIMMS models and R-models on -a user-provided dataset and compares the results. An example for the smaller HeLa development dataset on the -protein groups level is re-built regularly and available at: [rasmussenlab.org/pimms](https://www.rasmussenlab.org/pimms/) +a user-provided dataset and compares the results. An example for a +[public alzheimer dataset](https://github.com/RasmussenLab/njab/tree/main/docs/tutorial/data) +on the protein groups level is re-built regularly and available at: [rasmussenlab.org/pimms](https://www.rasmussenlab.org/pimms/) ## Data requirements diff --git a/project/data/Alzheimer_study/README.md b/project/data/Alzheimer_study/README.md new file mode 100644 index 000000000..dcaf706c9 --- /dev/null +++ b/project/data/Alzheimer_study/README.md @@ -0,0 +1,7 @@ +# PXD016278 + +Proteome Profiling in Cerebrospinal Fluid Reveals Novel Biomarkers of Alzheimer's Disease + +- [PXD016278](https://www.ebi.ac.uk/pride/archive/projects/PXD016278) +- [publication](https://www.embopress.org/doi/full/10.15252/msb.20199356) +- [curated data version from omiclearn](https://github.com/MannLabs/OmicLearn/tree/master/omiclearn/data) diff --git a/project/src/R_NAGuideR/Imput_funcs.r b/project/src/R_NAGuideR/Imput_funcs.r index 63c34b9f7..c90b86baa 100644 --- a/project/src/R_NAGuideR/Imput_funcs.r +++ b/project/src/R_NAGuideR/Imput_funcs.r @@ -1,9 +1,9 @@ +# from: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1547-6 ################################################################################## #### MLE for the Truncated Normal #### Creating a Function that Returns the Log Likelihood, Gradient and #### Hessian Functions ################################################################################## - ## data = numeric vector ## t = truncation limits mklhood <- function(data, t, ...) { From a22ae43bdd65b8a930c8529de9cc3fe4fc9f5446 Mon Sep 17 00:00:00 2001 From: Henry Date: Mon, 1 Jul 2024 21:40:02 +0200 Subject: [PATCH 02/13] :art: do not use math for loss in vae --- vaep/models/vae.py | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/vaep/models/vae.py b/vaep/models/vae.py index d56704f13..61240137b 100644 --- a/vaep/models/vae.py +++ b/vaep/models/vae.py @@ -6,7 +6,6 @@ - loss is adapted to Dataset and FastAI adaptions - batchnorm1D for now (not weight norm) """ -import math from typing import List import torch @@ -15,6 +14,9 @@ leaky_relu_default = nn.LeakyReLU(.1) +PI = torch.tensor(torch.pi) +log_of_2 = torch.log(torch.tensor(2.)) + class VAE(nn.Module): def __init__(self, @@ -100,7 +102,7 @@ def compute_kld(z_mu, z_logvar): def gaussian_log_prob(z, mu, logvar): - return -0.5 * (math.log(2 * math.pi) + logvar + (z - mu)**2 / torch.exp(logvar)) + return -0.5 * (torch.log(2. * PI) + logvar + (z - mu)**2 / torch.exp(logvar)) def loss_fct(pred, y, reduction='sum', results: List = None, freebits=0.1): @@ -109,8 +111,8 @@ def loss_fct(pred, y, reduction='sum', results: List = None, freebits=0.1): l_rec = -torch.sum(gaussian_log_prob(batch, x_mu, x_logvar)) l_reg = torch.sum((F.relu(compute_kld(z_mu, z_logvar) - - freebits * math.log(2)) - + freebits * math.log(2)), + - freebits * log_of_2) + + freebits * log_of_2), 1) if results is not None: From 6fbbc6f7ca16c4473eb445559bd1773f62d36636 Mon Sep 17 00:00:00 2001 From: Henry Date: Mon, 1 Jul 2024 21:41:22 +0200 Subject: [PATCH 03/13] :construction: add two normalization functions (tbc) --- vaep/normalization.py | 82 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 82 insertions(+) create mode 100644 vaep/normalization.py diff --git a/vaep/normalization.py b/vaep/normalization.py new file mode 100644 index 000000000..430d8b118 --- /dev/null +++ b/vaep/normalization.py @@ -0,0 +1,82 @@ +import logging + +import pandas as pd + +logger = logging.getLogger(__name__) + + +def normalize_by_median(df_wide: pd.DataFrame, axis: int = 1) -> pd.DataFrame: + """Normalize by median. Level using global median of medians. + + Parameters + ---------- + df_wide : pd.DataFrame + DataFrame with samples as rows and features as columns + axis : int, optional + Axis to normalize, by default 1 (i.e. by row/sample) + + Returns + ------- + pd.DataFrame + Normalized DataFrame + """ + medians = df_wide.median(axis=axis) + global_median = medians.median() + df_wide = df_wide.subtract(medians, axis=1 - axis) + global_median + return df_wide + + +def normalize_sceptre(quant: pd.DataFrame, + iter_thresh: float = 1.1, + iter_max: int = 10, + check_convex=True) -> pd.DataFrame: + """Normalize by sample and channel as in SCeptre paper. Code adapted to work + with current pandas versions. + + Parameters + ---------- + quant : pd.DataFrame + MulitIndex columns with two levels: File ID and Channel + Not log transformed. + iter_thresh : float, optional + treshold for maximum absolute deviation in iteration, by default 1.1 + iter_max : int, optional + maximum number of iterations to check for convergence, by default 10 + + Returns + ------- + pd.DataFrame + Normalized DataFrame with same index and columns as input + """ + max_dev_old = None + for i in range(iter_max): # iterate to converge to normalized channel and file + quant_0 = quant.copy() + + # file bias normalization + # calculate median for each protein in each sample + med = quant.groupby(axis=0, level=0).median() + # calculate the factors needed for a median shift + med_tot = med.median(axis=0) + factors = med.divide(med_tot, axis=1) + quant = quant.divide(factors) + + # channel bias normalization + # calculate median for each protein in each channel + med = quant.groupby(axis=0, level=1).median() + # calculate the factors needed for a median shift + med_tot = med.median(axis=1) + factors = med.divide(med_tot, axis=0) + quant = quant.divide(factors) + # stop iterating when the change in quant to the previous iteration is below iter_thresh + max_dev = abs(quant - quant_0).max().max() + median_dev = abs(quant - quant_0).median().median() + print(f"Max deviation: {max_dev:.2f}, median deviation: {median_dev:.2f}") + if (median_dev) <= iter_thresh: + print(f"Max deviation: {max_dev:.2f}") + print(f"Median deviation: {median_dev:.2f}") + break + if i > 0 and check_convex and max_dev_old < max_dev: + raise ValueError("Non-convex behaviour. old max deviation smaller than current.") + print("performed {} iterations, max-dev: {:.2f}".format(i + 1, max_dev)) + max_dev_old = max_dev + return quant From eacef5ad6cc26b0d22f4332fa571cf7231e5ca1d Mon Sep 17 00:00:00 2001 From: Henry Date: Tue, 2 Jul 2024 08:44:06 +0200 Subject: [PATCH 04/13] :construction: add missing values funcitonality --- vaep/pandas/missing_data.py | 71 +++++++++++++++++++++++++++++++++++-- 1 file changed, 69 insertions(+), 2 deletions(-) diff --git a/vaep/pandas/missing_data.py b/vaep/pandas/missing_data.py index 7bd62e0ae..b4a3b97ec 100644 --- a/vaep/pandas/missing_data.py +++ b/vaep/pandas/missing_data.py @@ -1,11 +1,26 @@ +"""Functionality related to analyzing missing values in a pandas DataFrame.""" from __future__ import annotations -from pathlib import Path + import math +from pathlib import Path +from typing import Union import pandas as pd -def percent_missing(df: pd.DataFrame) -> float: +def percent_missing(df: pd.DataFrame): + """Total percentage of missing values in a DataFrame. + + Parameters + ---------- + df : pd.DataFrame + DataFrame with data. + + Returns + ------- + float + Proportion of missing values in the DataFrame. + """ return df.isna().sum().sum() / math.prod(df.shape) @@ -32,3 +47,55 @@ def get_record(data: pd.DataFrame, columns_sample=False) -> dict: N_mis=int(N_mis), missing=float(missing), ) return record + + +def decompose_NAs(data: pd.DataFrame, + level: Union[int, str], + label: int = 'summary') -> pd.DataFrame: + """Decompose missing values by a level into real and indirectly imputed missing values. + Real missing value have missing for all samples in a group. Indirectly imputed missing values + are in MS-based proteomics data that would be imputed by the mean (or median) of the observed + values in a group if the mean (or median) is used for imputation. + + Parameters + ---------- + data : pd.DataFrame + DataFrame with samples in columns and features in rows. + level : Union[int, str] + Index level to group by. Examples: Protein groups, peptides or precursors in MS data. + label : int, optional + Column name of single column dataframe returned, by default 'summary' + + Returns + ------- + pd.DataFrame + One column DataFrame with summary information about missing values. + """ + + real_mvs = 0 + ii_mvs = 0 + + grouped = data.groupby(level=level) + for _, _df in grouped: + if len(_df) == 1: + # single precursors -> all RMVs + real_mvs += _df.isna().sum().sum() + elif len(_df) > 1: + # caculate the number of missing values for samples where one precursor was observed + total_NAs = _df.isna().sum().sum() + M = len(_df) # normally 2 or 3 + _real_mvs = _df.isna().all(axis=0).sum() * M + real_mvs += _real_mvs + ii_mvs += (total_NAs - _real_mvs) + else: + ValueError("Something went wrong") + assert data.isna().sum().sum() == real_mvs + ii_mvs + return pd.Series( + {'total_obs': data.notna().sum().sum(), + 'total_MVs': data.isna().sum().sum(), + 'real_MVs': real_mvs, + 'indirectly_imputed_MVs': ii_mvs, + 'real_MVs_ratio': real_mvs / data.isna().sum().sum(), + 'indirectly_imputed_MVs_ratio': ii_mvs / data.isna().sum().sum(), + 'total_MVs_ratio': data.isna().sum().sum() / data.size + }).to_frame(name=label).T.convert_dtypes() From 266ad9434094c3fea96b1da2ea470ea79e32455c Mon Sep 17 00:00:00 2001 From: Henry Date: Tue, 2 Jul 2024 09:41:58 +0200 Subject: [PATCH 05/13] :truck: vaep to pimmslearn for pkg - start with running unittests --- {vaep => pimmslearn}/README.md | 4 +- {vaep => pimmslearn}/__init__.py | 23 ++++++++---- {vaep => pimmslearn}/analyzers/__init__.py | 2 +- {vaep => pimmslearn}/analyzers/analyzers.py | 12 +++--- .../analyzers/compare_predictions.py | 0 .../analyzers/diff_analysis.py | 0 .../cmd_interface/__init__.py | 0 .../setup_diff_analysis_website.py | 0 .../cmd_interface/setup_imp_cp_website.py | 0 {vaep => pimmslearn}/data_handling.py | 0 {vaep => pimmslearn}/databases/__init__.py | 0 {vaep => pimmslearn}/databases/diseases.py | 0 {vaep => pimmslearn}/filter.py | 0 {vaep => pimmslearn}/imputation.py | 0 {vaep => pimmslearn}/io/__init__.py | 4 +- {vaep => pimmslearn}/io/dataloaders.py | 12 +++--- {vaep => pimmslearn}/io/datasets.py | 0 {vaep => pimmslearn}/io/datasplits.py | 4 +- {vaep => pimmslearn}/io/format.py | 0 {vaep => pimmslearn}/io/load.py | 0 {vaep => pimmslearn}/io/types.py | 0 {vaep => pimmslearn}/logging.py | 2 +- {vaep => pimmslearn}/model.py | 0 {vaep => pimmslearn}/models/__init__.py | 10 ++--- {vaep => pimmslearn}/models/ae.py | 37 +++++++++---------- {vaep => pimmslearn}/models/analysis.py | 6 +-- {vaep => pimmslearn}/models/collab.py | 8 ++-- {vaep => pimmslearn}/models/collect_dumps.py | 4 +- {vaep => pimmslearn}/models/vae.py | 0 {vaep => pimmslearn}/nb.py | 4 +- {vaep => pimmslearn}/normalization.py | 0 {vaep => pimmslearn}/pandas/__init__.py | 2 +- {vaep => pimmslearn}/pandas/calc_errors.py | 0 {vaep => pimmslearn}/pandas/missing_data.py | 0 {vaep => pimmslearn}/plotting/__init__.py | 10 ++--- {vaep => pimmslearn}/plotting/data.py | 0 {vaep => pimmslearn}/plotting/defaults.py | 0 {vaep => pimmslearn}/plotting/errors.py | 6 +-- {vaep => pimmslearn}/plotting/plotly.py | 0 {vaep => pimmslearn}/sampling.py | 2 +- {vaep => pimmslearn}/sklearn/__init__.py | 2 +- .../sklearn/ae_transformer.py | 4 +- .../sklearn/cf_transformer.py | 10 ++--- {vaep => pimmslearn}/stats/__init__.py | 0 {vaep => pimmslearn}/transform.py | 0 {vaep => pimmslearn}/utils.py | 2 +- pyproject.toml | 6 +-- setup.cfg | 3 +- tests/io/test_data_objects.py | 2 +- tests/io/test_dataloaders.py | 6 +-- tests/io/test_dataset.py | 4 +- tests/io/test_datasplits.py | 2 +- .../{__pycache__ => }/test_collect_dumps.py | 2 +- tests/pandas/test_calc_errors.py | 2 +- tests/plotting/test_defaults.py | 2 +- tests/test_ae.py | 4 +- tests/test_collab.py | 6 +-- tests/test_helpers.py | 2 +- tests/test_imports.py | 8 ++-- tests/test_imputation.py | 2 +- tests/test_io.py | 6 +-- tests/test_nb.py | 2 +- tests/test_pandas.py | 10 ++--- tests/test_sampling.py | 6 +-- tests/test_transfrom.py | 4 +- tests/test_utils.py | 2 +- 66 files changed, 128 insertions(+), 123 deletions(-) rename {vaep => pimmslearn}/README.md (57%) rename {vaep => pimmslearn}/__init__.py (52%) rename {vaep => pimmslearn}/analyzers/__init__.py (73%) rename {vaep => pimmslearn}/analyzers/analyzers.py (98%) rename {vaep => pimmslearn}/analyzers/compare_predictions.py (100%) rename {vaep => pimmslearn}/analyzers/diff_analysis.py (100%) rename {vaep => pimmslearn}/cmd_interface/__init__.py (100%) rename {vaep => pimmslearn}/cmd_interface/setup_diff_analysis_website.py (100%) rename {vaep => pimmslearn}/cmd_interface/setup_imp_cp_website.py (100%) rename {vaep => pimmslearn}/data_handling.py (100%) rename {vaep => pimmslearn}/databases/__init__.py (100%) rename {vaep => pimmslearn}/databases/diseases.py (100%) rename {vaep => pimmslearn}/filter.py (100%) rename {vaep => pimmslearn}/imputation.py (100%) rename {vaep => pimmslearn}/io/__init__.py (98%) rename {vaep => pimmslearn}/io/dataloaders.py (92%) rename {vaep => pimmslearn}/io/datasets.py (100%) rename {vaep => pimmslearn}/io/datasplits.py (98%) rename {vaep => pimmslearn}/io/format.py (100%) rename {vaep => pimmslearn}/io/load.py (100%) rename {vaep => pimmslearn}/io/types.py (100%) rename {vaep => pimmslearn}/logging.py (97%) rename {vaep => pimmslearn}/model.py (100%) rename {vaep => pimmslearn}/models/__init__.py (97%) rename {vaep => pimmslearn}/models/ae.py (90%) rename {vaep => pimmslearn}/models/analysis.py (71%) rename {vaep => pimmslearn}/models/collab.py (95%) rename {vaep => pimmslearn}/models/collect_dumps.py (95%) rename {vaep => pimmslearn}/models/vae.py (100%) rename {vaep => pimmslearn}/nb.py (97%) rename {vaep => pimmslearn}/normalization.py (100%) rename {vaep => pimmslearn}/pandas/__init__.py (99%) rename {vaep => pimmslearn}/pandas/calc_errors.py (100%) rename {vaep => pimmslearn}/pandas/missing_data.py (100%) rename {vaep => pimmslearn}/plotting/__init__.py (97%) rename {vaep => pimmslearn}/plotting/data.py (100%) rename {vaep => pimmslearn}/plotting/defaults.py (100%) rename {vaep => pimmslearn}/plotting/errors.py (96%) rename {vaep => pimmslearn}/plotting/plotly.py (100%) rename {vaep => pimmslearn}/sampling.py (99%) rename {vaep => pimmslearn}/sklearn/__init__.py (92%) rename {vaep => pimmslearn}/sklearn/ae_transformer.py (98%) rename {vaep => pimmslearn}/sklearn/cf_transformer.py (97%) rename {vaep => pimmslearn}/stats/__init__.py (100%) rename {vaep => pimmslearn}/transform.py (100%) rename {vaep => pimmslearn}/utils.py (97%) rename tests/models/{__pycache__ => }/test_collect_dumps.py (83%) diff --git a/vaep/README.md b/pimmslearn/README.md similarity index 57% rename from vaep/README.md rename to pimmslearn/README.md index ae9be6d48..31cdac17d 100644 --- a/vaep/README.md +++ b/pimmslearn/README.md @@ -2,10 +2,10 @@ ## Imputation - imputation of data is done based on the standard variation or KNN imputation -- adapted scripts from Annelaura are under `vaep/imputation.py` +- adapted scripts from Annelaura are under `pimmslearn/imputation.py` ## Transform -- transformation of intensity data is in `vaep/transfrom.py` +- transformation of intensity data is in `pimmslearn/transfrom.py` ## Utils diff --git a/vaep/__init__.py b/pimmslearn/__init__.py similarity index 52% rename from vaep/__init__.py rename to pimmslearn/__init__.py index 059ccf970..95021ccad 100644 --- a/vaep/__init__.py +++ b/pimmslearn/__init__.py @@ -1,6 +1,13 @@ """ -VAEP -Variatonal autoencoder for proteomics +pimmslearn: a package for imputation using self-supervised deep learning models: + +1. Collaborative Filtering +2. Denoising Autoencoder +3. Variational Autoencoder + +The package offers Imputation transformers in the style of scikit-learn. + +PyPI package is called pimms-learn (with a hyphen). """ from __future__ import annotations @@ -10,10 +17,10 @@ import njab -import vaep.logging -import vaep.nb -import vaep.pandas -import vaep.plotting +import pimmslearn.logging +import pimmslearn.nb +import pimmslearn.pandas +import pimmslearn.plotting _logging.getLogger(__name__).addHandler(_logging.NullHandler()) @@ -21,7 +28,7 @@ # put into some pandas_cfg.py file and import all -savefig = vaep.plotting.savefig +savefig = pimmslearn.plotting.savefig __license__ = 'GPLv3' __version__ = metadata.version("pimms-learn") @@ -33,4 +40,4 @@ njab.pandas.set_pandas_number_formatting(float_format='{:,.3f}') -vaep.plotting.make_large_descriptors('x-large') +pimmslearn.plotting.make_large_descriptors('x-large') diff --git a/vaep/analyzers/__init__.py b/pimmslearn/analyzers/__init__.py similarity index 73% rename from vaep/analyzers/__init__.py rename to pimmslearn/analyzers/__init__.py index 6d16805a6..1c1944993 100644 --- a/vaep/analyzers/__init__.py +++ b/pimmslearn/analyzers/__init__.py @@ -2,7 +2,7 @@ """ from types import SimpleNamespace -from vaep.analyzers import compare_predictions, diff_analysis +from pimmslearn.analyzers import compare_predictions, diff_analysis __all__ = ['diff_analysis', 'compare_predictions', 'Analysis'] diff --git a/vaep/analyzers/analyzers.py b/pimmslearn/analyzers/analyzers.py similarity index 98% rename from vaep/analyzers/analyzers.py rename to pimmslearn/analyzers/analyzers.py index 7bd8c1e3c..29c69b283 100644 --- a/vaep/analyzers/analyzers.py +++ b/pimmslearn/analyzers/analyzers.py @@ -13,11 +13,11 @@ from njab.sklearn import run_pca from sklearn.impute import SimpleImputer -import vaep -from vaep.analyzers import Analysis -from vaep.io.datasplits import long_format, wide_format -from vaep.io.load import verify_df -from vaep.pandas import _add_indices +import pimmslearn +from pimmslearn.analyzers import Analysis +from pimmslearn.io.datasplits import long_format, wide_format +from pimmslearn.io.load import verify_df +from pimmslearn.pandas import _add_indices logger = logging.getLogger(__name__) @@ -379,7 +379,7 @@ def _plot(self, fct, meta_key: str, save: bool = True): meta=meta_data.loc[self.latent_reduced.index], title=f'{self.model_name} latent space PCA of {self.latent_dim} dimensions by {meta_key}') if save: - vaep.plotting._savefig(fig, name=f'{self.model_name}_latent_by_{meta_key}', + pimmslearn.plotting._savefig(fig, name=f'{self.model_name}_latent_by_{meta_key}', folder=self.folder) return fig, ax diff --git a/vaep/analyzers/compare_predictions.py b/pimmslearn/analyzers/compare_predictions.py similarity index 100% rename from vaep/analyzers/compare_predictions.py rename to pimmslearn/analyzers/compare_predictions.py diff --git a/vaep/analyzers/diff_analysis.py b/pimmslearn/analyzers/diff_analysis.py similarity index 100% rename from vaep/analyzers/diff_analysis.py rename to pimmslearn/analyzers/diff_analysis.py diff --git a/vaep/cmd_interface/__init__.py b/pimmslearn/cmd_interface/__init__.py similarity index 100% rename from vaep/cmd_interface/__init__.py rename to pimmslearn/cmd_interface/__init__.py diff --git a/vaep/cmd_interface/setup_diff_analysis_website.py b/pimmslearn/cmd_interface/setup_diff_analysis_website.py similarity index 100% rename from vaep/cmd_interface/setup_diff_analysis_website.py rename to pimmslearn/cmd_interface/setup_diff_analysis_website.py diff --git a/vaep/cmd_interface/setup_imp_cp_website.py b/pimmslearn/cmd_interface/setup_imp_cp_website.py similarity index 100% rename from vaep/cmd_interface/setup_imp_cp_website.py rename to pimmslearn/cmd_interface/setup_imp_cp_website.py diff --git a/vaep/data_handling.py b/pimmslearn/data_handling.py similarity index 100% rename from vaep/data_handling.py rename to pimmslearn/data_handling.py diff --git a/vaep/databases/__init__.py b/pimmslearn/databases/__init__.py similarity index 100% rename from vaep/databases/__init__.py rename to pimmslearn/databases/__init__.py diff --git a/vaep/databases/diseases.py b/pimmslearn/databases/diseases.py similarity index 100% rename from vaep/databases/diseases.py rename to pimmslearn/databases/diseases.py diff --git a/vaep/filter.py b/pimmslearn/filter.py similarity index 100% rename from vaep/filter.py rename to pimmslearn/filter.py diff --git a/vaep/imputation.py b/pimmslearn/imputation.py similarity index 100% rename from vaep/imputation.py rename to pimmslearn/imputation.py diff --git a/vaep/io/__init__.py b/pimmslearn/io/__init__.py similarity index 98% rename from vaep/io/__init__.py rename to pimmslearn/io/__init__.py index 33613f332..f86ceed99 100644 --- a/vaep/io/__init__.py +++ b/pimmslearn/io/__init__.py @@ -8,7 +8,7 @@ import numpy as np import pandas as pd -import vaep.pandas +import pimmslearn.pandas PathsList = namedtuple('PathsList', ['files', 'folder']) @@ -86,7 +86,7 @@ def get_fname_from_keys(keys, folder='.', file_ext='.pkl', remove_duplicates=Tru keys = list(dict.fromkeys(keys)) folder = Path(folder) folder.mkdir(exist_ok=True, parents=True) - fname_dataset = folder / '{}{}'.format(vaep.pandas.replace_with( + fname_dataset = folder / '{}{}'.format(pimmslearn.pandas.replace_with( ' '.join(keys), replace='- ', replace_with='_'), file_ext) return fname_dataset diff --git a/vaep/io/dataloaders.py b/pimmslearn/io/dataloaders.py similarity index 92% rename from vaep/io/dataloaders.py rename to pimmslearn/io/dataloaders.py index 57c373dba..a49776a35 100644 --- a/vaep/io/dataloaders.py +++ b/pimmslearn/io/dataloaders.py @@ -6,9 +6,9 @@ from fastai.data.load import DataLoader from torch.utils.data import Dataset -from vaep.io import datasets -from vaep.io.datasets import DatasetWithTarget -from vaep.transform import VaepPipeline +from pimmslearn.io import datasets +from pimmslearn.io.datasets import DatasetWithTarget +from pimmslearn.transform import VaepPipeline def get_dls(train_X: pandas.DataFrame, @@ -42,8 +42,8 @@ def get_dls(train_X: pandas.DataFrame, from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler - from vaep.dataloader import get_dls - from vaep.transform import VaepPipeline + from pimmslearn.dataloader import get_dls + from pimmslearn.transform import VaepPipeline dae_default_pipeline = sklearn.pipeline.Pipeline( [('normalize', StandardScaler()), @@ -86,7 +86,7 @@ def get_test_dl(df: pandas.DataFrame, ---------- df : pandas.DataFrame Test data in a DataFrame - transformer : vaep.transform.VaepPipeline + transformer : pimmslearn.transform.VaepPipeline Pipeline with separate encode and decode dataset : torch.utils.data.Dataset, optional torch Dataset to yield encoded samples, by default DatasetWithTarget diff --git a/vaep/io/datasets.py b/pimmslearn/io/datasets.py similarity index 100% rename from vaep/io/datasets.py rename to pimmslearn/io/datasets.py diff --git a/vaep/io/datasplits.py b/pimmslearn/io/datasplits.py similarity index 98% rename from vaep/io/datasplits.py rename to pimmslearn/io/datasplits.py index 25be59183..e3e886724 100644 --- a/vaep/io/datasplits.py +++ b/pimmslearn/io/datasplits.py @@ -8,8 +8,8 @@ import pandas as pd -from vaep.io.format import class_full_module, classname -from vaep.pandas import interpolate +from pimmslearn.io.format import class_full_module, classname +from pimmslearn.pandas import interpolate logger = logging.getLogger(__name__) diff --git a/vaep/io/format.py b/pimmslearn/io/format.py similarity index 100% rename from vaep/io/format.py rename to pimmslearn/io/format.py diff --git a/vaep/io/load.py b/pimmslearn/io/load.py similarity index 100% rename from vaep/io/load.py rename to pimmslearn/io/load.py diff --git a/vaep/io/types.py b/pimmslearn/io/types.py similarity index 100% rename from vaep/io/types.py rename to pimmslearn/io/types.py diff --git a/vaep/logging.py b/pimmslearn/logging.py similarity index 97% rename from vaep/logging.py rename to pimmslearn/logging.py index 6158b5ba8..d05f14474 100644 --- a/vaep/logging.py +++ b/pimmslearn/logging.py @@ -42,7 +42,7 @@ def setup_logger_w_file(logger, level=logging.INFO, fname_base=None): Examples -------- >>> import logging - >>> logger = logging.getLogger('vaep') + >>> logger = logging.getLogger('pimmslearn') >>> _ = setup_logger_w_file(logger) # no logging to file >>> logger.handlers = [] # reset logger >>> _ = setup_logger_w_file() # diff --git a/vaep/model.py b/pimmslearn/model.py similarity index 100% rename from vaep/model.py rename to pimmslearn/model.py diff --git a/vaep/models/__init__.py b/pimmslearn/models/__init__.py similarity index 97% rename from vaep/models/__init__.py rename to pimmslearn/models/__init__.py index 2ae111d44..5a2bf2306 100644 --- a/vaep/models/__init__.py +++ b/pimmslearn/models/__init__.py @@ -15,8 +15,8 @@ from fastai import learner from fastcore.foundation import L -import vaep -from vaep.models import ae, analysis, collab, vae +import pimmslearn +from pimmslearn.models import ae, analysis, collab, vae logger = logging.getLogger(__name__) @@ -91,7 +91,7 @@ def plot_training_losses(learner: learner.Learner, norm_train=norm_train, norm_val=norm_val) name = name.lower() _ = RecorderDump(learner.recorder, name).save(folder) - vaep.savefig(fig, name=f'{name}_training', + pimmslearn.savefig(fig, name=f'{name}_training', folder=folder) return fig @@ -218,7 +218,7 @@ def collect_metrics(metrics_jsons: List, key_fct: Callable) -> dict: logger.debug(f"{key = }") with open(fname) as f: loaded = json.load(f) - loaded = vaep.pandas.flatten_dict_of_dicts(loaded) + loaded = pimmslearn.pandas.flatten_dict_of_dicts(loaded) if key not in all_metrics: all_metrics[key] = loaded @@ -320,7 +320,7 @@ def get_df_from_nested_dict(nested_dict, row_name='subset'): metrics = {} for k, run_metrics in nested_dict.items(): - metrics[k] = vaep.pandas.flatten_dict_of_dicts(run_metrics) + metrics[k] = pimmslearn.pandas.flatten_dict_of_dicts(run_metrics) metrics = pd.DataFrame.from_dict(metrics, orient='index') metrics.columns.names = column_levels diff --git a/vaep/models/ae.py b/pimmslearn/models/ae.py similarity index 90% rename from vaep/models/ae.py rename to pimmslearn/models/ae.py index 8295c8560..c1a041591 100644 --- a/vaep/models/ae.py +++ b/pimmslearn/models/ae.py @@ -1,6 +1,6 @@ """Autoencoder model trained using denoising procedure. -Variational Autencoder model adapter should be moved to vaep.models.vae. +Variational Autencoder model adapter should be moved to pimmslearn.models.vae. Or model class could be put somewhere else. """ import logging @@ -15,22 +15,21 @@ from fastai.callback.core import Callback from torch import nn -import vaep.io.dataloaders -import vaep.io.datasets -import vaep.io.datasplits -import vaep.models -import vaep.transform - -from vaep.models import analysis +import pimmslearn.io.dataloaders +import pimmslearn.io.datasets +import pimmslearn.io.datasplits +import pimmslearn.models +import pimmslearn.transform +from pimmslearn.models import analysis logger = logging.getLogger(__name__) def get_preds_from_df(df: pd.DataFrame, learn: fastai.learner.Learner, - transformer: vaep.transform.VaepPipeline, + transformer: pimmslearn.transform.VaepPipeline, position_pred_tuple: int = None, - dataset: torch.utils.data.Dataset = vaep.io.datasets.DatasetWithTarget): + dataset: torch.utils.data.Dataset = pimmslearn.io.datasets.DatasetWithTarget): """Get predictions for specified DataFrame, using a fastai learner and a custom sklearn Pipeline. @@ -40,22 +39,22 @@ def get_preds_from_df(df: pd.DataFrame, DataFrame to create predictions from. learn : fastai.learner.Learner fastai Learner with trained model - transformer : vaep.transform.VaepPipeline + transformer : pimmslearn.transform.VaepPipeline Pipeline with separate encode and decode position_pred_tuple : int, optional In that the model returns multiple outputs, select the one which contains the predictions matching the target variable (VAE case), by default None dataset : torch.utils.data.Dataset, optional - Dataset to build batches from, by default vaep.io.datasets.DatasetWithTarget + Dataset to build batches from, by default pimmslearn.io.datasets.DatasetWithTarget Returns ------- tuple tuple of pandas DataFrames (prediciton and target) based on learn.get_preds """ - dl = vaep.io.dataloaders.get_test_dl(df=df, - transformer=transformer, - dataset=dataset) + dl = pimmslearn.io.dataloaders.get_test_dl(df=df, + transformer=transformer, + dataset=dataset) res = learn.get_preds(dl=dl) # -> dl could be int if position_pred_tuple is not None and issubclass(type(res[0]), tuple): res = (res[0][position_pred_tuple], *res[1:]) @@ -272,11 +271,11 @@ def __init__(self, decode: List[str], bs=64 ): - self.transform = vaep.transform.VaepPipeline( + self.transform = pimmslearn.transform.VaepPipeline( df_train=train_df, encode=transform, decode=decode) - self.dls = vaep.io.dataloaders.get_dls( + self.dls = pimmslearn.io.dataloaders.get_dls( train_X=train_df, valid_X=val_df, transformer=self.transform, bs=bs) @@ -286,7 +285,7 @@ def __init__(self, self.params = dict(self.kwargs_model) self.model = model(**self.kwargs_model) - self.n_params_ae = vaep.models.calc_net_weight_count(self.model) + self.n_params_ae = pimmslearn.models.calc_net_weight_count(self.model) self.params['n_parameters'] = self.n_params_ae self.learn = None @@ -296,4 +295,4 @@ def get_preds_from_df(self, df_wide: pd.DataFrame) -> pd.DataFrame: return get_preds_from_df(df=df_wide, learn=self.learn, transformer=self.transform) def get_test_dl(self, df_wide: pd.DataFrame, bs: int = 64) -> pd.DataFrame: - return vaep.io.dataloaders.get_test_dl(df=df_wide, transformer=self.transform, bs=bs) + return pimmslearn.io.dataloaders.get_test_dl(df=df_wide, transformer=self.transform, bs=bs) diff --git a/vaep/models/analysis.py b/pimmslearn/models/analysis.py similarity index 71% rename from vaep/models/analysis.py rename to pimmslearn/models/analysis.py index 93d8a2aaa..570a54b70 100644 --- a/vaep/models/analysis.py +++ b/pimmslearn/models/analysis.py @@ -1,9 +1,9 @@ -import vaep.transform +import pimmslearn.transform import torch.nn import fastai.data.core import fastai.learner -from vaep.analyzers import Analysis +from pimmslearn.analyzers import Analysis class ModelAnalysis(Analysis): @@ -13,4 +13,4 @@ class ModelAnalysis(Analysis): dls: fastai.data.core.DataLoaders learn: fastai.learner.Learner params: dict - transform: vaep.transform.VaepPipeline + transform: pimmslearn.transform.VaepPipeline diff --git a/vaep/models/collab.py b/pimmslearn/models/collab.py similarity index 95% rename from vaep/models/collab.py rename to pimmslearn/models/collab.py index f54ab6df2..812495bfc 100644 --- a/vaep/models/collab.py +++ b/pimmslearn/models/collab.py @@ -9,9 +9,9 @@ TransformBlock) from fastai.tabular.all import * -import vaep.io.dataloaders -import vaep.io.datasplits -from vaep.models import analysis +import pimmslearn.io.dataloaders +import pimmslearn.io.datasplits +from pimmslearn.models import analysis logger = logging.getLogger(__name__) @@ -43,7 +43,7 @@ def combine_data(train_df: pd.DataFrame, val_df: pd.DataFrame) -> Tuple[pd.DataF class CollabAnalysis(analysis.ModelAnalysis): def __init__(self, - datasplits: vaep.io.datasplits.DataSplits, + datasplits: pimmslearn.io.datasplits.DataSplits, sample_column: str = 'Sample ID', item_column: str = 'peptide', target_column: str = 'intensity', diff --git a/vaep/models/collect_dumps.py b/pimmslearn/models/collect_dumps.py similarity index 95% rename from vaep/models/collect_dumps.py rename to pimmslearn/models/collect_dumps.py index ca2edb28a..f5e0f9440 100644 --- a/vaep/models/collect_dumps.py +++ b/pimmslearn/models/collect_dumps.py @@ -5,7 +5,7 @@ import json import yaml from typing import Iterable, Callable -import vaep.pandas +import pimmslearn.pandas logger = logging.getLogger(__name__) @@ -29,7 +29,7 @@ def load_config_file(fname: Path, first_split='config_') -> dict: def load_metric_file(fname: Path, first_split='metrics_') -> dict: with open(fname) as f: loaded = json.load(f) - loaded = vaep.pandas.flatten_dict_of_dicts(loaded) + loaded = pimmslearn.pandas.flatten_dict_of_dicts(loaded) key = f"{fname.parent.name}_{select_content(fname.stem, first_split=first_split)}" return key, loaded diff --git a/vaep/models/vae.py b/pimmslearn/models/vae.py similarity index 100% rename from vaep/models/vae.py rename to pimmslearn/models/vae.py diff --git a/vaep/nb.py b/pimmslearn/nb.py similarity index 97% rename from vaep/nb.py rename to pimmslearn/nb.py index 0d13104b7..59576086a 100644 --- a/vaep/nb.py +++ b/pimmslearn/nb.py @@ -2,7 +2,7 @@ from pprint import pformat import yaml -import vaep.io +import pimmslearn.io import logging logger = logging.getLogger() @@ -40,7 +40,7 @@ def dump(self, fname=None): except AttributeError: raise AttributeError( 'Specify fname or set "out_folder" attribute.') - d = vaep.io.parse_dict(input_dict=self.__dict__) + d = pimmslearn.io.parse_dict(input_dict=self.__dict__) with open(fname, 'w') as f: yaml.dump(d, f) logger.info(f"Dumped config to: {fname}") diff --git a/vaep/normalization.py b/pimmslearn/normalization.py similarity index 100% rename from vaep/normalization.py rename to pimmslearn/normalization.py diff --git a/vaep/pandas/__init__.py b/pimmslearn/pandas/__init__.py similarity index 99% rename from vaep/pandas/__init__.py rename to pimmslearn/pandas/__init__.py index 5f82204b1..4be42b68d 100644 --- a/vaep/pandas/__init__.py +++ b/pimmslearn/pandas/__init__.py @@ -7,7 +7,7 @@ import omegaconf import pandas as pd -from vaep.pandas.calc_errors import calc_errors_per_feat, get_absolute_error +from pimmslearn.pandas.calc_errors import calc_errors_per_feat, get_absolute_error __all__ = [ 'calc_errors_per_feat', diff --git a/vaep/pandas/calc_errors.py b/pimmslearn/pandas/calc_errors.py similarity index 100% rename from vaep/pandas/calc_errors.py rename to pimmslearn/pandas/calc_errors.py diff --git a/vaep/pandas/missing_data.py b/pimmslearn/pandas/missing_data.py similarity index 100% rename from vaep/pandas/missing_data.py rename to pimmslearn/pandas/missing_data.py diff --git a/vaep/plotting/__init__.py b/pimmslearn/plotting/__init__.py similarity index 97% rename from vaep/plotting/__init__.py rename to pimmslearn/plotting/__init__.py index 105c183f8..81136dc63 100644 --- a/vaep/plotting/__init__.py +++ b/pimmslearn/plotting/__init__.py @@ -10,9 +10,9 @@ import pandas as pd import seaborn -import vaep.pandas -from vaep.plotting import data, defaults, errors, plotly -from vaep.plotting.errors import plot_rolling_error +import pimmslearn.pandas +from pimmslearn.plotting import data, defaults, errors, plotly +from pimmslearn.plotting.errors import plot_rolling_error seaborn.set_style("whitegrid") # seaborn.set_theme() @@ -270,11 +270,11 @@ def plot_counts(df_counts: pd.DataFrame, n_samples, count_col=feat_col_name, ax=ax, **kwargs) df_counts['prop'] = df_counts[feat_col_name] / n_samples - n_feat_cutoff = vaep.pandas.get_last_index_matching_proportion( + n_feat_cutoff = pimmslearn.pandas.get_last_index_matching_proportion( df_counts=df_counts, prop=prop_feat, prop_col='prop') n_samples_cutoff = df_counts.loc[n_feat_cutoff, feat_col_name] logger.info(f'{n_feat_cutoff = }, {n_samples_cutoff = }') - x_lim_max = vaep.pandas.get_last_index_matching_proportion( + x_lim_max = pimmslearn.pandas.get_last_index_matching_proportion( df_counts, min_feat_prop, prop_col='prop') logger.info(f'{x_lim_max = }') ax.set_xlim(-1, x_lim_max) diff --git a/vaep/plotting/data.py b/pimmslearn/plotting/data.py similarity index 100% rename from vaep/plotting/data.py rename to pimmslearn/plotting/data.py diff --git a/vaep/plotting/defaults.py b/pimmslearn/plotting/defaults.py similarity index 100% rename from vaep/plotting/defaults.py rename to pimmslearn/plotting/defaults.py diff --git a/vaep/plotting/errors.py b/pimmslearn/plotting/errors.py similarity index 96% rename from vaep/plotting/errors.py rename to pimmslearn/plotting/errors.py index cdfeed140..a9edfdeb3 100644 --- a/vaep/plotting/errors.py +++ b/pimmslearn/plotting/errors.py @@ -11,7 +11,7 @@ from matplotlib.axes import Axes from seaborn.categorical import _BarPlotter -import vaep.pandas.calc_errors +import pimmslearn.pandas.calc_errors def plot_errors_binned(pred: pd.DataFrame, target_col='observed', @@ -22,7 +22,7 @@ def plot_errors_binned(pred: pd.DataFrame, target_col='observed', assert target_col in pred.columns, f'Specify `target_col` parameter, `pred` do no contain: {target_col}' models_order = pred.columns.to_list() models_order.remove(target_col) - errors_binned = vaep.pandas.calc_errors.calc_errors_per_bin( + errors_binned = pimmslearn.pandas.calc_errors.calc_errors_per_bin( pred=pred, target_col=target_col) meta_cols = ['bin', 'n_obs'] # calculated along binned error @@ -61,7 +61,7 @@ def plot_errors_by_median(pred: pd.DataFrame, metric_name: Optional[str] = None, errwidth: float = 1.2) -> tuple[Axes, pd.DataFrame]: # calculate absolute errors - errors = vaep.pandas.get_absolute_error(pred, y_true=target_col) + errors = pimmslearn.pandas.get_absolute_error(pred, y_true=target_col) errors.columns.name = 'model' # define bins by integer value of median feature intensity diff --git a/vaep/plotting/plotly.py b/pimmslearn/plotting/plotly.py similarity index 100% rename from vaep/plotting/plotly.py rename to pimmslearn/plotting/plotly.py diff --git a/vaep/sampling.py b/pimmslearn/sampling.py similarity index 99% rename from vaep/sampling.py rename to pimmslearn/sampling.py index ae1aaaa1a..52dc1f8c8 100644 --- a/vaep/sampling.py +++ b/pimmslearn/sampling.py @@ -4,7 +4,7 @@ import numpy as np import pandas as pd -from vaep.io.datasplits import DataSplits +from pimmslearn.io.datasplits import DataSplits logger = logging.getLogger(__name__) diff --git a/vaep/sklearn/__init__.py b/pimmslearn/sklearn/__init__.py similarity index 92% rename from vaep/sklearn/__init__.py rename to pimmslearn/sklearn/__init__.py index 76b32d5f3..df1bac5e0 100644 --- a/vaep/sklearn/__init__.py +++ b/pimmslearn/sklearn/__init__.py @@ -7,7 +7,7 @@ from njab.sklearn import run_pca from sklearn.impute import SimpleImputer -from vaep.io import add_indices +from pimmslearn.io import add_indices logger = logging.getLogger(__name__) diff --git a/vaep/sklearn/ae_transformer.py b/pimmslearn/sklearn/ae_transformer.py similarity index 98% rename from vaep/sklearn/ae_transformer.py rename to pimmslearn/sklearn/ae_transformer.py index a64af1a31..1fa1c4124 100644 --- a/vaep/sklearn/ae_transformer.py +++ b/pimmslearn/sklearn/ae_transformer.py @@ -18,9 +18,9 @@ from sklearn.preprocessing import StandardScaler from sklearn.utils.validation import check_is_fitted -import vaep.models as models +import pimmslearn.models as models # patch plotting function -from vaep.models import ae, plot_loss +from pimmslearn.models import ae, plot_loss learner.Recorder.plot_loss = plot_loss diff --git a/vaep/sklearn/cf_transformer.py b/pimmslearn/sklearn/cf_transformer.py similarity index 97% rename from vaep/sklearn/cf_transformer.py rename to pimmslearn/sklearn/cf_transformer.py index 1ffca9a6c..0bdc93bf7 100644 --- a/vaep/sklearn/cf_transformer.py +++ b/pimmslearn/sklearn/cf_transformer.py @@ -20,10 +20,10 @@ from sklearn.base import BaseEstimator, TransformerMixin from sklearn.utils.validation import check_is_fitted -import vaep -import vaep.models as models +import pimmslearn +import pimmslearn.models as models # patch plotting function -from vaep.models import collab, plot_loss +from pimmslearn.models import collab, plot_loss learner.Recorder.plot_loss = plot_loss @@ -196,9 +196,9 @@ def plot_loss(self, y, figsize=(8, 4)): # -> Axes: ax.set_title('CF loss: Reconstruction loss') self.learn.recorder.plot_loss(skip_start=5, ax=ax, with_valid=True if y is not None else False) - vaep.savefig(fig, name='collab_training', + pimmslearn.savefig(fig, name='collab_training', folder=self.out_folder) self.model_kwargs['batch_size'] = self.batch_size - vaep.io.dump_json(self.model_kwargs, + pimmslearn.io.dump_json(self.model_kwargs, self.out_folder / 'model_params_{}.json'.format('CF')) return ax diff --git a/vaep/stats/__init__.py b/pimmslearn/stats/__init__.py similarity index 100% rename from vaep/stats/__init__.py rename to pimmslearn/stats/__init__.py diff --git a/vaep/transform.py b/pimmslearn/transform.py similarity index 100% rename from vaep/transform.py rename to pimmslearn/transform.py diff --git a/vaep/utils.py b/pimmslearn/utils.py similarity index 97% rename from vaep/utils.py rename to pimmslearn/utils.py index 595f816b7..e9c466000 100644 --- a/vaep/utils.py +++ b/pimmslearn/utils.py @@ -3,7 +3,7 @@ import numpy as np import pandas as pd -from vaep.io.datasplits import long_format +from pimmslearn.io.datasplits import long_format def append_to_filepath(filepath: Union[pathlib.Path, str], diff --git a/pyproject.toml b/pyproject.toml index 2f9d1ca6c..f82b852d3 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -35,8 +35,8 @@ dependencies = [ [project.scripts] # pimms-report-imputation-comparison # pimms-report-diff-analysis -pimms-setup-imputation-comparison = "vaep.cmd_interface.setup_imp_cp_website:main" -pimms-add-diff-comp = "vaep.cmd_interface.setup_diff_analysis_website:main" +pimms-setup-imputation-comparison = "pimmslearn.cmd_interface.setup_imp_cp_website:main" +pimms-add-diff-comp = "pimmslearn.cmd_interface.setup_diff_analysis_website:main" [project.urls] "Bug Tracker" = "https://github.com/RasmussenLab/pimms/issues" @@ -65,4 +65,4 @@ requires = ["setuptools>=64", "setuptools_scm>=8", "wheel"] # used to pick up the version from the git tags or the latest commit. [tool.setuptools.packages.find] -include = ["vaep"] +include = ["pimmslearn"] diff --git a/setup.cfg b/setup.cfg index 11106afaa..a5bae358b 100644 --- a/setup.cfg +++ b/setup.cfg @@ -1,6 +1,5 @@ [options.packages.find] -# where = vaep exclude = test* ###################### @@ -18,7 +17,7 @@ addopts = --cov --strict-markers xfail_strict = True [coverage:run] -source = vaep +source = pimmslearn branch = True [coverage:report] diff --git a/tests/io/test_data_objects.py b/tests/io/test_data_objects.py index 78ffa67cf..d4da76db8 100644 --- a/tests/io/test_data_objects.py +++ b/tests/io/test_data_objects.py @@ -1,7 +1,7 @@ import io from tokenize import group import pandas as pd -from vaep.pandas import select_max_by +from pimmslearn.pandas import select_max_by # m/z Protein group IDs Intensity Score # Sequence Charge diff --git a/tests/io/test_dataloaders.py b/tests/io/test_dataloaders.py index a801beeba..195fe5f39 100644 --- a/tests/io/test_dataloaders.py +++ b/tests/io/test_dataloaders.py @@ -2,9 +2,9 @@ from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler -from vaep.transform import VaepPipeline -from vaep.io.dataloaders import get_dls -from vaep.utils import create_random_df +from pimmslearn.transform import VaepPipeline +from pimmslearn.io.dataloaders import get_dls +from pimmslearn.utils import create_random_df def test_get_dls(): diff --git a/tests/io/test_dataset.py b/tests/io/test_dataset.py index cea05e853..637f5d845 100644 --- a/tests/io/test_dataset.py +++ b/tests/io/test_dataset.py @@ -6,8 +6,8 @@ import torch from numpy import nan -from vaep.io import datasets -from vaep.io.datasets import DatasetWithMaskAndNoTarget, DatasetWithTarget +from pimmslearn.io import datasets +from pimmslearn.io.datasets import DatasetWithMaskAndNoTarget, DatasetWithTarget data = np.random.random(size=(10, 5)) threshold = max(0.15, data.min() + 0.02) diff --git a/tests/io/test_datasplits.py b/tests/io/test_datasplits.py index d9a4ebde0..afccec99c 100644 --- a/tests/io/test_datasplits.py +++ b/tests/io/test_datasplits.py @@ -1,6 +1,6 @@ import numpy as np import pandas as pd -from vaep.io.datasplits import DataSplits, wide_format +from pimmslearn.io.datasplits import DataSplits, wide_format import pytest import numpy.testing as npt diff --git a/tests/models/__pycache__/test_collect_dumps.py b/tests/models/test_collect_dumps.py similarity index 83% rename from tests/models/__pycache__/test_collect_dumps.py rename to tests/models/test_collect_dumps.py index 5dbc9b986..3ec95d13d 100644 --- a/tests/models/__pycache__/test_collect_dumps.py +++ b/tests/models/test_collect_dumps.py @@ -1,4 +1,4 @@ -from vaep.models.collect_dumps import select_content +from pimmslearn.models.collect_dumps import select_content def test_select_content(): diff --git a/tests/pandas/test_calc_errors.py b/tests/pandas/test_calc_errors.py index 0749cd2d7..2985b4ada 100644 --- a/tests/pandas/test_calc_errors.py +++ b/tests/pandas/test_calc_errors.py @@ -1,7 +1,7 @@ import pandas as pd from pytest import fixture -from vaep.pandas import calc_errors +from pimmslearn.pandas import calc_errors @fixture diff --git a/tests/plotting/test_defaults.py b/tests/plotting/test_defaults.py index 086a1df63..2fbec8075 100644 --- a/tests/plotting/test_defaults.py +++ b/tests/plotting/test_defaults.py @@ -1,4 +1,4 @@ -from vaep.plotting.defaults import assign_colors +from pimmslearn.plotting.defaults import assign_colors def test_assign_colors(): diff --git a/tests/test_ae.py b/tests/test_ae.py index 319f91e81..6b3988ebc 100644 --- a/tests/test_ae.py +++ b/tests/test_ae.py @@ -1,5 +1,5 @@ -import vaep -from vaep.models import ae +import pimmslearn +from pimmslearn.models import ae expected_repr = """Autoencoder( diff --git a/tests/test_collab.py b/tests/test_collab.py index 1e770d911..42a728715 100644 --- a/tests/test_collab.py +++ b/tests/test_collab.py @@ -2,9 +2,9 @@ import numpy.testing as npt import pandas as pd -import vaep -from vaep.io.datasplits import DataSplits -from vaep.models import collab +import pimmslearn +from pimmslearn.io.datasplits import DataSplits +from pimmslearn.models import collab N, M = 10, 4 diff --git a/tests/test_helpers.py b/tests/test_helpers.py index 3e199940d..62784b801 100644 --- a/tests/test_helpers.py +++ b/tests/test_helpers.py @@ -1,6 +1,6 @@ import numpy as np -from vaep.utils import create_random_missing_data +from pimmslearn.utils import create_random_missing_data def test_create_random_missing_data(): diff --git a/tests/test_imports.py b/tests/test_imports.py index 72c899208..3ce8e383e 100644 --- a/tests/test_imports.py +++ b/tests/test_imports.py @@ -1,6 +1,6 @@ def test_imports(): - import vaep.analyzers - import vaep.sklearn - print(vaep.analyzers.__doc__) - print(vaep.sklearn.__doc__) + import pimmslearn.analyzers + import pimmslearn.sklearn + print(pimmslearn.analyzers.__doc__) + print(pimmslearn.sklearn.__doc__) diff --git a/tests/test_imputation.py b/tests/test_imputation.py index 747b71c80..0c98f77bd 100644 --- a/tests/test_imputation.py +++ b/tests/test_imputation.py @@ -3,7 +3,7 @@ import pandas as pd import pytest -from vaep.imputation import imputation_KNN, imputation_normal_distribution, impute_shifted_normal +from pimmslearn.imputation import imputation_KNN, imputation_normal_distribution, impute_shifted_normal """ # Test Data set was created from a sample by shuffling: diff --git a/tests/test_io.py b/tests/test_io.py index 76e70fb80..05715c103 100644 --- a/tests/test_io.py +++ b/tests/test_io.py @@ -1,13 +1,13 @@ from pathlib import Path -import vaep.io +import pimmslearn.io def test_relative_to(): fpath = Path('project/runs/experiment_name/run') pwd = 'project/runs/' # per defaut '.' (the current working directory) expected = Path('experiment_name/run') - acutal = vaep.io.resolve_path(fpath, pwd) + acutal = pimmslearn.io.resolve_path(fpath, pwd) assert expected == acutal # # no solution yet, expect chaning notebook pwd @@ -15,5 +15,5 @@ def test_relative_to(): # # pwd is different subfolder # pwd = 'root/home/project/runs/' # per defaut '.' (the current working directory) # expected = Path('root/home/project/data/file') - # acutal = vaep.io.resolve_path(fpath, pwd) + # acutal = pimmslearn.io.resolve_path(fpath, pwd) # assert expected == acutal diff --git a/tests/test_nb.py b/tests/test_nb.py index a6dddb8b5..5d4b28ad4 100644 --- a/tests/test_nb.py +++ b/tests/test_nb.py @@ -1,5 +1,5 @@ import pytest -from vaep.nb import Config +from pimmslearn.nb import Config def test_Config(): diff --git a/tests/test_pandas.py b/tests/test_pandas.py index 4d46cc2f7..7fa645cca 100644 --- a/tests/test_pandas.py +++ b/tests/test_pandas.py @@ -1,6 +1,6 @@ from numpy import nan import pandas as pd -import vaep.pandas +import pimmslearn.pandas def test_interpolate(): @@ -33,7 +33,7 @@ def test_interpolate(): # all peptides from pep4 dropped as expected } - actual = vaep.pandas.interpolate(df_test_data).to_dict() + actual = pimmslearn.pandas.interpolate(df_test_data).to_dict() assert actual == expected assert df_test_data.equals(pd.DataFrame(test_data)) @@ -48,7 +48,7 @@ def test_flatten_dict_of_dicts(): "a": {'a1': {'a2': 1, 'a3': 2}}, "b": {'b1': {'b2': 3, 'b3': 4}} } - actual = vaep.pandas.flatten_dict_of_dicts(data) + actual = pimmslearn.pandas.flatten_dict_of_dicts(data) assert expected == actual @@ -70,7 +70,7 @@ def test_key_map(): 'beta': ('a', 'b'), 'gamma': ('a', 'b'), 'delta': ('a', 'b')}} - actual = vaep.pandas.key_map(d) + actual = pimmslearn.pandas.key_map(d) assert expected == actual d = {'one': {'alpha': {'a': 0.5, 'b': 0.3}}, @@ -88,5 +88,5 @@ def test_key_map(): 'beta': ('a', 'b'), 'gamma': ('a', 'b'), 'delta': None}} - actual = vaep.pandas.key_map(d) + actual = pimmslearn.pandas.key_map(d) assert expected == actual diff --git a/tests/test_sampling.py b/tests/test_sampling.py index 813c79331..db7d887f2 100644 --- a/tests/test_sampling.py +++ b/tests/test_sampling.py @@ -3,10 +3,10 @@ import pandas as pd import pytest -from vaep.io.datasplits import to_long_format -from vaep.sampling import feature_frequency, frequency_by_index, sample_data +from pimmslearn.io.datasplits import to_long_format +from pimmslearn.sampling import feature_frequency, frequency_by_index, sample_data -from vaep.utils import create_random_df +from pimmslearn.utils import create_random_df @pytest.fixture diff --git a/tests/test_transfrom.py b/tests/test_transfrom.py index 8115a13c0..84e9331b2 100644 --- a/tests/test_transfrom.py +++ b/tests/test_transfrom.py @@ -4,8 +4,8 @@ import sklearn from sklearn import impute, preprocessing -from vaep.io.datasets import to_tensor -from vaep.transform import VaepPipeline +from pimmslearn.io.datasets import to_tensor +from pimmslearn.transform import VaepPipeline def test_Vaep_Pipeline(): diff --git a/tests/test_utils.py b/tests/test_utils.py index c6bad659f..290cb66e7 100644 --- a/tests/test_utils.py +++ b/tests/test_utils.py @@ -1,5 +1,5 @@ import pathlib -from vaep.utils import append_to_filepath +from pimmslearn.utils import append_to_filepath def test_append_to_filepath(): From 9d6858b7df537d53cfc7aa304abec928da78731f Mon Sep 17 00:00:00 2001 From: Henry Date: Tue, 2 Jul 2024 09:58:00 +0200 Subject: [PATCH 06/13] :bug: fix unit-test - wasnt' run before apparently --- tests/models/test_collect_dumps.py | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/tests/models/test_collect_dumps.py b/tests/models/test_collect_dumps.py index 3ec95d13d..d08b0c4b2 100644 --- a/tests/models/test_collect_dumps.py +++ b/tests/models/test_collect_dumps.py @@ -5,5 +5,8 @@ def test_select_content(): test_cases = ['model_metrics_HL_1024_512_256_dae', 'model_metrics_HL_1024_512_vae', 'model_metrics_collab'] - for test_case in test_cases: - assert select_content(test_case, first_split='metrics_') == test_case.split('metrics_')[1] + expected = ['HL_1024_512_256', + 'HL_1024_512', + 'collab'] + for test_case, v in zip(test_cases, expected): + assert select_content(test_case, first_split='metrics_') == v From 563655e019cab296a0a47f6550b6da8ccb11bc33 Mon Sep 17 00:00:00 2001 From: Henry Date: Tue, 2 Jul 2024 11:38:27 +0200 Subject: [PATCH 07/13] :construction: change import in all notebook (scripts) --- project/00_5_training_data_exploration.py | 86 +++++++------- project/00_6_0_permute_data.ipynb | 18 +-- project/00_6_0_permute_data.py | 18 +-- project/00_8_add_random_missing_values.py | 6 +- project/01_0_split_data.ipynb | 86 +++++++------- project/01_0_split_data.py | 88 +++++++------- .../01_0_transform_data_to_wide_format.ipynb | 10 +- project/01_0_transform_data_to_wide_format.py | 12 +- project/01_1_train_CF.ipynb | 32 ++--- project/01_1_train_CF.py | 34 +++--- project/01_1_train_DAE.ipynb | 32 ++--- project/01_1_train_DAE.py | 34 +++--- project/01_1_train_KNN.ipynb | 22 ++-- project/01_1_train_KNN.py | 24 ++-- project/01_1_train_KNN_unique_samples.py | 24 ++-- project/01_1_train_Median.ipynb | 22 ++-- project/01_1_train_Median.py | 24 ++-- project/01_1_train_RSN.ipynb | 30 ++--- project/01_1_train_RSN.py | 32 ++--- project/01_1_train_VAE.ipynb | 38 +++--- project/01_1_train_VAE.py | 40 +++---- project/01_1_transfer_NAGuideR_pred.ipynb | 24 ++-- project/01_1_transfer_NAGuideR_pred.py | 26 ++--- project/01_2_performance_plots.ipynb | 108 ++++++++--------- project/01_2_performance_plots.py | 110 +++++++++--------- project/01_3_revision3.py | 24 ++-- project/02_1_aggregate_metrics.py.ipynb | 2 +- project/02_1_aggregate_metrics.py.py | 4 +- project/02_1_join_metrics.py.py | 2 +- project/02_2_aggregate_configs.py.ipynb | 4 +- project/02_2_aggregate_configs.py.py | 6 +- project/02_2_join_configs.py.py | 2 +- project/02_3_grid_search_analysis.ipynb | 74 ++++++------ project/02_3_grid_search_analysis.py | 76 ++++++------ project/02_4_best_models_over_all_data.ipynb | 28 ++--- project/02_4_best_models_over_all_data.py | 30 ++--- project/03_1_best_models_comparison.ipynb | 18 +-- project/03_1_best_models_comparison.py | 20 ++-- .../03_2_best_models_comparison_fig2.ipynb | 16 +-- project/03_2_best_models_comparison_fig2.py | 18 +-- .../03_3_combine_experiment_result_tables.py | 2 +- project/03_6_setup_comparison_rev3.py | 17 +-- project/04_1_train_pimms_models.ipynb | 44 +++---- project/04_1_train_pimms_models.py | 44 +++---- project/10_0_ald_data.ipynb | 24 ++-- project/10_0_ald_data.py | 26 ++--- project/10_1_ald_diff_analysis.ipynb | 52 ++++----- project/10_1_ald_diff_analysis.py | 54 ++++----- project/10_2_ald_compare_methods.ipynb | 24 ++-- project/10_2_ald_compare_methods.py | 26 ++--- project/10_3_ald_ml_new_feat.ipynb | 38 +++--- project/10_3_ald_ml_new_feat.py | 40 +++---- project/10_4_ald_compare_single_pg.ipynb | 26 ++--- project/10_4_ald_compare_single_pg.py | 28 ++--- .../10_5_comp_diff_analysis_repetitions.ipynb | 2 +- .../10_5_comp_diff_analysis_repetitions.py | 4 +- project/10_6_interpret_repeated_ald_da.py | 10 +- project/10_7_ald_reduced_dataset_plots.ipynb | 14 +-- project/10_7_ald_reduced_dataset_plots.py | 14 +-- project/misc_embeddings.py | 2 +- project/misc_illustrations.py | 2 +- project/misc_json_formats.ipynb | 4 +- project/misc_json_formats.py | 6 +- project/misc_pytorch_fastai_dataloaders.ipynb | 18 +-- project/misc_pytorch_fastai_dataloaders.py | 20 ++-- project/misc_pytorch_fastai_dataset.ipynb | 109 ++++------------- project/misc_pytorch_fastai_dataset.py | 8 +- project/misc_sampling_in_pandas.ipynb | 2 +- project/misc_sampling_in_pandas.py | 4 +- .../best_repeated_split_collect_metrics.py | 6 +- .../best_repeated_train_collect_metrics.py | 6 +- 71 files changed, 957 insertions(+), 1023 deletions(-) diff --git a/project/00_5_training_data_exploration.py b/project/00_5_training_data_exploration.py index 92735c858..f5033f015 100644 --- a/project/00_5_training_data_exploration.py +++ b/project/00_5_training_data_exploration.py @@ -5,7 +5,7 @@ # extension: .py # format_name: percent # format_version: '1.3' -# jupytext_version: 1.15.0 +# jupytext_version: 1.16.2 # kernelspec: # display_name: vaep # language: python @@ -37,14 +37,14 @@ import pandas as pd import seaborn as sns -import vaep -import vaep.data_handling -from vaep import plotting -from vaep.analyzers import analyzers -from vaep.pandas import missing_data -from vaep.utils import create_random_df +import pimmslearn +import pimmslearn.data_handling +from pimmslearn import plotting +from pimmslearn.analyzers import analyzers +from pimmslearn.pandas import missing_data +from pimmslearn.utils import create_random_df -logger = vaep.logging.setup_nb_logger() +logger = pimmslearn.logging.setup_nb_logger() logging.getLogger('fontTools').setLevel(logging.WARNING) matplotlib.rcParams.update({'font.size': 6, @@ -58,7 +58,7 @@ def get_clustermap(data, **kwargs): from sklearn.impute import SimpleImputer - from vaep.pandas import _add_indices + from pimmslearn.pandas import _add_indices X = SimpleImputer().fit_transform(data) X = _add_indices(X, data) cg = sns.clustermap(X, @@ -226,13 +226,13 @@ def get_dynamic_range(min_max): min_samples_per_feat=min_samples_per_feat) fname = FIGUREFOLDER / 'dist_all_lineplot_w_cutoffs.pdf' files_out[fname.name] = fname -vaep.savefig(fig, name=fname) +pimmslearn.savefig(fig, name=fname) # %% fig = plotting.data.plot_missing_dist_highdim(data) fname = FIGUREFOLDER / 'dist_all_lineplot_wo_cutoffs.pdf' files_out[fname.name] = fname -vaep.savefig(fig, name=fname) +pimmslearn.savefig(fig, name=fname) # %% fig = plotting.data.plot_missing_pattern_histogram(data, @@ -240,13 +240,13 @@ def get_dynamic_range(min_max): min_samples_per_feat=min_samples_per_feat) fname = FIGUREFOLDER / 'dist_all_histogram_w_cutoffs.pdf' files_out[fname.name] = fname -vaep.savefig(fig, name=fname) +pimmslearn.savefig(fig, name=fname) # %% fig = plotting.data.plot_missing_pattern_histogram(data) fname = FIGUREFOLDER / 'dist_all_histogram_wo_cutoffs.pdf' files_out[fname.name] = fname -vaep.savefig(fig, name=fname) +pimmslearn.savefig(fig, name=fname) # %% [markdown] # ### Boxplots @@ -255,7 +255,7 @@ def get_dynamic_range(min_max): fig = plotting.data.plot_missing_dist_boxplots(data) fname = FIGUREFOLDER / 'dist_all_boxplots.pdf' files_out[fname.name] = fname -vaep.savefig(fig, name=fname) +pimmslearn.savefig(fig, name=fname) # %% [markdown] # ### Violinplots @@ -265,7 +265,7 @@ def get_dynamic_range(min_max): data, min_feat_per_sample, min_samples_per_feat) fname = FIGUREFOLDER / 'dist_all_violin_plot.pdf' files_out[fname.name] = fname -vaep.savefig(fig, name=fname) +pimmslearn.savefig(fig, name=fname) # %% [markdown] # ## Feature medians over prop. of missing of feature @@ -274,14 +274,14 @@ def get_dynamic_range(min_max): data=data, type='scatter', s=1) fname = FIGUREFOLDER / 'intensity_median_vs_prop_missing_scatter' files_out[fname.stem] = fname -vaep.savefig(ax.get_figure(), fname) +pimmslearn.savefig(ax.get_figure(), fname) # %% ax = plotting.data.plot_feat_median_over_prop_missing( data=data, type='boxplot', s=.8) fname = FIGUREFOLDER / 'intensity_median_vs_prop_missing_boxplot' files_out[fname.stem] = fname -vaep.savefig(ax.get_figure(), fname) +pimmslearn.savefig(ax.get_figure(), fname) # %% [markdown] @@ -305,7 +305,7 @@ def get_dynamic_range(min_max): fig.suptitle(f'Histogram of correlations based on {FEATURES_CUTOFF_TEXT}') fname = FIGUREFOLDER / 'corr_histogram_feat.pdf' files_out[fname.name] = fname -vaep.savefig(fig, name=fname) +pimmslearn.savefig(fig, name=fname) # %% [markdown] @@ -318,7 +318,7 @@ def get_dynamic_range(min_max): ax.set_title(f'Histogram of coefficient of variation (CV) of {FEATURES_CUTOFF_TEXT}') fname = FIGUREFOLDER / 'CV_histogram_features.pdf' files_out[fname.name] = fname -vaep.savefig(ax.get_figure(), name=fname) +pimmslearn.savefig(ax.get_figure(), name=fname) # %% [markdown] # ## Clustermap and heatmaps of missing values @@ -327,7 +327,7 @@ def get_dynamic_range(min_max): # needs to deal with duplicates # notna = data.notna().T.drop_duplicates().T # get index and column names -vaep.plotting.make_large_descriptors(5) +pimmslearn.plotting.make_large_descriptors(5) cg = sns.clustermap(selected.notna(), cbar_pos=None, @@ -345,10 +345,10 @@ def get_dynamic_range(min_max): cg.figure.tight_layout() fname = FIGUREFOLDER / 'clustermap_present_absent_pattern.png' files_out[fname.name] = fname -vaep.savefig(cg.figure, - name=fname, - pdf=False, - dpi=600) +pimmslearn.savefig(cg.figure, + name=fname, + pdf=False, + dpi=600) # %% [markdown] # based on cluster, plot heatmaps of features and samples @@ -358,7 +358,7 @@ def get_dynamic_range(min_max): cg.dendrogram_col.reordered_ind)) == selected.shape # %% -vaep.plotting.make_large_descriptors(5) +pimmslearn.plotting.make_large_descriptors(5) fig, ax = plt.subplots(figsize=(7.5, 3.5)) ax = sns.heatmap( selected.iloc[cg.dendrogram_row.reordered_ind, @@ -370,8 +370,8 @@ def get_dynamic_range(min_max): ) ax.set_title(f'Heatmap of intensities clustered by missing pattern of {FEATURES_CUTOFF_TEXT}', fontsize=8) -vaep.plotting.only_every_x_ticks(ax, x=2) -vaep.plotting.use_first_n_chars_in_labels(ax, x=SAMPLE_FIRST_N_CHARS) +pimmslearn.plotting.only_every_x_ticks(ax, x=2) +pimmslearn.plotting.use_first_n_chars_in_labels(ax, x=SAMPLE_FIRST_N_CHARS) if PG_SEPARATOR is not None: _new_labels = [_l.get_text().split(PG_SEPARATOR)[0] for _l in ax.get_xticklabels()] @@ -381,7 +381,7 @@ def get_dynamic_range(min_max): ax.set_yticks([]) fname = FIGUREFOLDER / 'heatmap_intensities_ordered_by_missing_pattern.png' files_out[fname.name] = fname -vaep.savefig(fig, name=fname, pdf=False, dpi=600) +pimmslearn.savefig(fig, name=fname, pdf=False, dpi=600) # ax.get_figure().savefig(fname, dpi=300) # %% [markdown] @@ -400,8 +400,8 @@ def get_dynamic_range(min_max): ) ax.set_title(f'Heatmap of feature correlation of {FEATURES_CUTOFF_TEXT}', fontsize=8) -_ = vaep.plotting.only_every_x_ticks(ax, x=2) -_ = vaep.plotting.use_first_n_chars_in_labels(ax, x=SAMPLE_FIRST_N_CHARS) +_ = pimmslearn.plotting.only_every_x_ticks(ax, x=2) +_ = pimmslearn.plotting.use_first_n_chars_in_labels(ax, x=SAMPLE_FIRST_N_CHARS) if PG_SEPARATOR is not None: _new_labels = [_l.get_text().split(PG_SEPARATOR)[0] for _l in ax.get_xticklabels()] @@ -411,7 +411,7 @@ def get_dynamic_range(min_max): ax.set_yticks([]) fname = FIGUREFOLDER / 'heatmap_feature_correlation.png' files_out[fname.name] = fname -vaep.savefig(fig, name=fname, pdf=False, dpi=600) +pimmslearn.savefig(fig, name=fname, pdf=False, dpi=600) # %% lower_corr = analyzers.corr_lower_triangle( @@ -427,18 +427,18 @@ def get_dynamic_range(min_max): cbar_kws={'shrink': 0.75}, square=True, ) -_ = vaep.plotting.only_every_x_ticks(ax, x=2) -_ = vaep.plotting.use_first_n_chars_in_labels(ax, x=SAMPLE_FIRST_N_CHARS) +_ = pimmslearn.plotting.only_every_x_ticks(ax, x=2) +_ = pimmslearn.plotting.use_first_n_chars_in_labels(ax, x=SAMPLE_FIRST_N_CHARS) if NO_TICK_LABELS_ON_HEATMAP: ax.set_xticks([]) ax.set_yticks([]) ax.set_title(f'Heatmap of sample correlation based on {FEATURES_CUTOFF_TEXT}', fontsize=7) fname = FIGUREFOLDER / 'heatmap_sample_correlation.png' files_out[fname.name] = fname -vaep.savefig(fig, name=fname, pdf=False, dpi=600) +pimmslearn.savefig(fig, name=fname, pdf=False, dpi=600) # %% -vaep.plotting.make_large_descriptors(6) +pimmslearn.plotting.make_large_descriptors(6) kwargs = dict() if NO_TICK_LABELS_ON_HEATMAP: kwargs['xticklabels'] = False @@ -449,15 +449,15 @@ def get_dynamic_range(min_max): _new_labels = [_l.get_text().split(PG_SEPARATOR)[0] for _l in ax.get_xticklabels()] _ = ax.set_xticklabels(_new_labels) -_ = vaep.plotting.only_every_x_ticks(ax, x=2, axis=0) -_ = vaep.plotting.use_first_n_chars_in_labels(ax, x=SAMPLE_FIRST_N_CHARS) +_ = pimmslearn.plotting.only_every_x_ticks(ax, x=2, axis=0) +_ = pimmslearn.plotting.use_first_n_chars_in_labels(ax, x=SAMPLE_FIRST_N_CHARS) # ax.set_title(f'Clustermap of intensities based on {FEATURES_CUTOFF_TEXT}', fontsize=7) # cg.fig.tight_layout() # tight_layout makes the cbar a bit ugly cg.fig.suptitle(f'Clustermap of intensities based on {FEATURES_CUTOFF_TEXT}', fontsize=7) fname = FIGUREFOLDER / 'clustermap_intensities_normalized.png' files_out[fname.name] = fname cg.fig.savefig(fname, dpi=300) # avoid tight_layout -# vaep.savefig(cg.fig, +# pimmslearn.savefig(cg.fig, # name=fname, # pdf=False) @@ -469,17 +469,17 @@ def get_dynamic_range(min_max): COL_NO_MISSING, COL_NO_IDENTIFIED = f'no_missing_{TYPE}', f'no_identified_{TYPE}' COL_PROP_SAMPLES = 'prop_samples' -sample_stats = vaep.data_handling.compute_stats_missing( +sample_stats = pimmslearn.data_handling.compute_stats_missing( data.notna(), COL_NO_MISSING, COL_NO_IDENTIFIED) sample_stats # %% -vaep.plotting.make_large_descriptors(8) +pimmslearn.plotting.make_large_descriptors(8) fig_ident = sns.relplot( x='SampleID_int', y=COL_NO_IDENTIFIED, data=sample_stats) fig_ident.set_axis_labels('Sample ID', f'Frequency of identified {TYPE}') fig_ident.fig.suptitle(f'Frequency of identified {TYPE} by sample id', y=1.03) -vaep.savefig(fig_ident, f'identified_{TYPE}_by_sample', folder=FIGUREFOLDER) +pimmslearn.savefig(fig_ident, f'identified_{TYPE}_by_sample', folder=FIGUREFOLDER) fig_ident_dist = sns.relplot( x=COL_PROP_SAMPLES, y=COL_NO_IDENTIFIED, data=sample_stats) @@ -489,7 +489,7 @@ def get_dynamic_range(min_max): f'Frequency of identified {TYPE} groups by sample id', y=1.03) fname = FIGUREFOLDER / f'identified_{TYPE}_ordered.pdf' files_out[fname.name] = fname -vaep.savefig(fig_ident_dist, fname) +pimmslearn.savefig(fig_ident_dist, fname) # %% COL_NO_MISSING_PROP = COL_NO_MISSING + '_PROP' @@ -505,7 +505,7 @@ def get_dynamic_range(min_max): fname = FIGUREFOLDER / 'proportion_feat_missing.pdf' files_out[fname.name] = fname -vaep.savefig(g, fname) +pimmslearn.savefig(g, fname) # %% [markdown] # ### Reference table intensities (log2) diff --git a/project/00_6_0_permute_data.ipynb b/project/00_6_0_permute_data.ipynb index c16637bc1..1c208b34f 100644 --- a/project/00_6_0_permute_data.ipynb +++ b/project/00_6_0_permute_data.ipynb @@ -21,11 +21,11 @@ "from typing import Union, List\n", "\n", "import numpy as np\n", - "import vaep\n", - "import vaep.analyzers.analyzers\n", - "from vaep.utils import create_random_df\n", + "import pimmslearn\n", + "import pimmslearn.analyzers.analyzers\n", + "from pimmslearn.utils import create_random_df\n", "\n", - "logger = vaep.logging.setup_nb_logger()\n", + "logger = pimmslearn.logging.setup_nb_logger()\n", "logger.info(\"Split data and make diagnostic plots\")" ] }, @@ -92,7 +92,7 @@ "metadata": {}, "outputs": [], "source": [ - "args = vaep.nb.get_params(args, globals=globals())\n", + "args = pimmslearn.nb.get_params(args, globals=globals())\n", "args" ] }, @@ -105,7 +105,7 @@ }, "outputs": [], "source": [ - "args = vaep.nb.Config().from_dict(args)\n", + "args = pimmslearn.nb.Config().from_dict(args)\n", "args" ] }, @@ -166,7 +166,7 @@ "outputs": [], "source": [ "constructor = getattr(\n", - " vaep.analyzers.analyzers.AnalyzePeptides,\n", + " pimmslearn.analyzers.analyzers.AnalyzePeptides,\n", " FILE_FORMAT_TO_CONSTRUCTOR_IN[FILE_EXT]) # AnalyzePeptides.from_csv\n", "analysis = constructor(fname=args.FN_INTENSITIES,\n", " index_col=args.index_col,\n", @@ -214,7 +214,7 @@ "\n", "method = getattr(df, FILE_FORMAT_TO_CONSTRUCTOR.get(FILE_EXT))\n", "\n", - "fname = vaep.utils.append_to_filepath(args.FN_INTENSITIES, 'permuted')\n", + "fname = pimmslearn.utils.append_to_filepath(args.FN_INTENSITIES, 'permuted')\n", "method(fname)" ] }, @@ -226,7 +226,7 @@ "outputs": [], "source": [ "constructor = getattr(\n", - " vaep.analyzers.analyzers.AnalyzePeptides,\n", + " pimmslearn.analyzers.analyzers.AnalyzePeptides,\n", " FILE_FORMAT_TO_CONSTRUCTOR_IN[FILE_EXT]) # AnalyzePeptides.from_csv\n", "analysis = constructor(fname=args.FN_INTENSITIES,\n", " index_col=args.index_col,\n", diff --git a/project/00_6_0_permute_data.py b/project/00_6_0_permute_data.py index 9c6612d2e..507730770 100644 --- a/project/00_6_0_permute_data.py +++ b/project/00_6_0_permute_data.py @@ -7,11 +7,11 @@ from typing import Union, List import numpy as np -import vaep -import vaep.analyzers.analyzers -from vaep.utils import create_random_df +import pimmslearn +import pimmslearn.analyzers.analyzers +from pimmslearn.utils import create_random_df -logger = vaep.logging.setup_nb_logger() +logger = pimmslearn.logging.setup_nb_logger() logger.info("Split data and make diagnostic plots") # %% @@ -38,11 +38,11 @@ file_format: str = 'pkl' # %% -args = vaep.nb.get_params(args, globals=globals()) +args = pimmslearn.nb.get_params(args, globals=globals()) args # %% -args = vaep.nb.Config().from_dict(args) +args = pimmslearn.nb.Config().from_dict(args) args @@ -71,7 +71,7 @@ # %% constructor = getattr( - vaep.analyzers.analyzers.AnalyzePeptides, + pimmslearn.analyzers.analyzers.AnalyzePeptides, FILE_FORMAT_TO_CONSTRUCTOR_IN[FILE_EXT]) # AnalyzePeptides.from_csv analysis = constructor(fname=args.FN_INTENSITIES, index_col=args.index_col, @@ -94,11 +94,11 @@ method = getattr(df, FILE_FORMAT_TO_CONSTRUCTOR.get(FILE_EXT)) -fname = vaep.utils.append_to_filepath(args.FN_INTENSITIES, 'permuted') +fname = pimmslearn.utils.append_to_filepath(args.FN_INTENSITIES, 'permuted') method(fname) # %% constructor = getattr( - vaep.analyzers.analyzers.AnalyzePeptides, + pimmslearn.analyzers.analyzers.AnalyzePeptides, FILE_FORMAT_TO_CONSTRUCTOR_IN[FILE_EXT]) # AnalyzePeptides.from_csv analysis = constructor(fname=args.FN_INTENSITIES, index_col=args.index_col, diff --git a/project/00_8_add_random_missing_values.py b/project/00_8_add_random_missing_values.py index b4a9e6b65..14a00b227 100644 --- a/project/00_8_add_random_missing_values.py +++ b/project/00_8_add_random_missing_values.py @@ -6,7 +6,7 @@ from pathlib import Path from typing import Optional, Union import pandas as pd -import vaep.nb +import pimmslearn.nb # %% # catch passed parameters @@ -33,8 +33,8 @@ fn_intensities = Path(fn_intensities) if not out_root: out_root = fn_intensities.parent -args = vaep.nb.get_params(args, globals=globals()) -args = vaep.nb.args_from_dict(args) +args = pimmslearn.nb.get_params(args, globals=globals()) +args = pimmslearn.nb.args_from_dict(args) args # %% diff --git a/project/01_0_split_data.ipynb b/project/01_0_split_data.ipynb index d73788af0..a1f034d14 100644 --- a/project/01_0_split_data.ipynb +++ b/project/01_0_split_data.ipynb @@ -31,14 +31,14 @@ "from IPython.display import display\n", "from sklearn.model_selection import train_test_split\n", "\n", - "import vaep\n", - "import vaep.io.load\n", - "from vaep.analyzers import analyzers\n", - "from vaep.io.datasplits import DataSplits\n", - "from vaep.sampling import feature_frequency\n", - "from vaep.sklearn import get_PCA\n", + "import pimmslearn\n", + "import pimmslearn.io.load\n", + "from pimmslearn.analyzers import analyzers\n", + "from pimmslearn.io.datasplits import DataSplits\n", + "from pimmslearn.sampling import feature_frequency\n", + "from pimmslearn.sklearn import get_PCA\n", "\n", - "logger = vaep.logging.setup_nb_logger()\n", + "logger = pimmslearn.logging.setup_nb_logger()\n", "logger.info(\"Split data and make diagnostic plots\")\n", "logging.getLogger('fontTools').setLevel(logging.WARNING)\n", "\n", @@ -57,7 +57,7 @@ "pd.options.display.max_columns = 32\n", "plt.rcParams['figure.figsize'] = [4, 2]\n", "\n", - "vaep.plotting.make_large_descriptors(7)\n", + "pimmslearn.plotting.make_large_descriptors(7)\n", "\n", "figures = {} # collection of ax or figures\n", "dumps = {} # collection of data dumps" @@ -130,7 +130,7 @@ }, "outputs": [], "source": [ - "args = vaep.nb.get_params(args, globals=globals())\n", + "args = pimmslearn.nb.get_params(args, globals=globals())\n", "args" ] }, @@ -144,7 +144,7 @@ }, "outputs": [], "source": [ - "args = vaep.nb.args_from_dict(args)\n", + "args = pimmslearn.nb.args_from_dict(args)\n", "args" ] }, @@ -216,7 +216,7 @@ "source": [ "# ! factor out file reading to a separate module, not class\n", "# AnalyzePeptides.from_csv\n", - "constructor = getattr(vaep.io.load, FILE_FORMAT_TO_CONSTRUCTOR[FILE_EXT])\n", + "constructor = getattr(pimmslearn.io.load, FILE_FORMAT_TO_CONSTRUCTOR[FILE_EXT])\n", "df = constructor(fname=args.FN_INTENSITIES,\n", " index_col=args.index_col,\n", " )\n", @@ -662,7 +662,7 @@ "ax.set_ylabel('observations')\n", "fname = args.out_figures / f'0_{group}_hist_features_per_sample'\n", "figures[fname.stem] = fname\n", - "vaep.savefig(ax.get_figure(), fname)" + "pimmslearn.savefig(ax.get_figure(), fname)" ] }, { @@ -684,7 +684,7 @@ "ax.set_ylabel('observations')\n", "fname = args.out_figures / f'0_{group}_feature_prevalence'\n", "figures[fname.stem] = fname\n", - "vaep.savefig(ax.get_figure(), fname)" + "pimmslearn.savefig(ax.get_figure(), fname)" ] }, { @@ -704,14 +704,14 @@ }, "outputs": [], "source": [ - "min_max = vaep.plotting.data.min_max(df.stack())\n", - "ax, bins = vaep.plotting.data.plot_histogram_intensities(\n", + "min_max = pimmslearn.plotting.data.min_max(df.stack())\n", + "ax, bins = pimmslearn.plotting.data.plot_histogram_intensities(\n", " df.stack(), min_max=min_max)\n", "ax.set_xlabel('Intensity binned')\n", "fname = args.out_figures / f'0_{group}_intensity_distribution_overall'\n", "\n", "figures[fname.stem] = fname\n", - "vaep.savefig(ax.get_figure(), fname)" + "pimmslearn.savefig(ax.get_figure(), fname)" ] }, { @@ -724,14 +724,14 @@ }, "outputs": [], "source": [ - "ax = vaep.plotting.data.plot_feat_median_over_prop_missing(\n", + "ax = pimmslearn.plotting.data.plot_feat_median_over_prop_missing(\n", " data=df, type='scatter')\n", "fname = args.out_figures / f'0_{group}_intensity_median_vs_prop_missing_scatter'\n", "ax.set_xlabel(\n", " f'{args.feat_name_display.capitalize()} binned by their median intensity'\n", " f' (N {args.feat_name_display})')\n", "figures[fname.stem] = fname\n", - "vaep.savefig(ax.get_figure(), fname)" + "pimmslearn.savefig(ax.get_figure(), fname)" ] }, { @@ -744,14 +744,14 @@ }, "outputs": [], "source": [ - "ax, _data_feat_median_over_prop_missing = vaep.plotting.data.plot_feat_median_over_prop_missing(\n", + "ax, _data_feat_median_over_prop_missing = pimmslearn.plotting.data.plot_feat_median_over_prop_missing(\n", " data=df, type='boxplot', return_plot_data=True)\n", "fname = args.out_figures / f'0_{group}_intensity_median_vs_prop_missing_boxplot'\n", "ax.set_xlabel(\n", " f'{args.feat_name_display.capitalize()} binned by their median intensity'\n", " f' (N {args.feat_name_display})')\n", "figures[fname.stem] = fname\n", - "vaep.savefig(ax.get_figure(), fname)\n", + "pimmslearn.savefig(ax.get_figure(), fname)\n", "_data_feat_median_over_prop_missing.to_csv(fname.with_suffix('.csv'))\n", "# _data_feat_median_over_prop_missing.to_excel(fname.with_suffix('.xlsx'))\n", "del _data_feat_median_over_prop_missing" @@ -829,7 +829,7 @@ " fname = (args.out_figures\n", " / f'0_{group}_pca_sample_by_{\"_\".join(args.meta_cat_col.split())}')\n", " figures[fname.stem] = fname\n", - " vaep.savefig(fig, fname)" + " pimmslearn.savefig(fig, fname)" ] }, { @@ -848,7 +848,7 @@ " df=pcs[pcs_name], ax=ax, dates=pcs[args.meta_date_col], title=f'by {args.meta_date_col}')\n", " fname = args.out_figures / f'0_{group}_pca_sample_by_date'\n", " figures[fname.stem] = fname\n", - " vaep.savefig(fig, fname)" + " pimmslearn.savefig(fig, fname)" ] }, { @@ -880,7 +880,7 @@ "fname = (args.out_figures\n", " / f'0_{group}_pca_sample_by_{\"_\".join(col_identified_feat.split())}.pdf')\n", "figures[fname.stem] = fname\n", - "vaep.savefig(fig, fname)" + "pimmslearn.savefig(fig, fname)" ] }, { @@ -981,12 +981,12 @@ " boxprops=dict(linewidth=.4, color='darkblue'),\n", " flierprops=dict(markersize=.4, color='lightblue'),\n", " )\n", - "_ = vaep.plotting.select_xticks(ax)\n", + "_ = pimmslearn.plotting.select_xticks(ax)\n", "fig = ax.get_figure()\n", "fname = args.out_figures / f'0_{group}_median_boxplot'\n", "df_w_date.to_pickle(fname.with_suffix('.pkl'))\n", "figures[fname.stem] = fname\n", - "vaep.savefig(fig, fname)\n", + "pimmslearn.savefig(fig, fname)\n", "del df_w_date" ] }, @@ -1041,13 +1041,13 @@ " # fontsize=6,\n", " figsize=(8, 2),\n", " s=5,\n", - " xticks=vaep.plotting.select_dates(\n", + " xticks=pimmslearn.plotting.select_dates(\n", " median_sample_intensity[dates.name])\n", " )\n", " fig = ax.get_figure()\n", " fname = args.out_figures / f'0_{group}_median_scatter'\n", " figures[fname.stem] = fname\n", - " vaep.savefig(fig, fname)" + " pimmslearn.savefig(fig, fname)" ] }, { @@ -1172,7 +1172,7 @@ }, "outputs": [], "source": [ - "df_long = vaep.io.datasplits.long_format(df)\n", + "df_long = pimmslearn.io.datasplits.long_format(df)\n", "df_long.head()" ] }, @@ -1188,7 +1188,7 @@ "source": [ "group = 2\n", "\n", - "splits, thresholds, fake_na_mcar, fake_na_mnar = vaep.sampling.sample_mnar_mcar(\n", + "splits, thresholds, fake_na_mcar, fake_na_mnar = pimmslearn.sampling.sample_mnar_mcar(\n", " df_long=df_long,\n", " frac_non_train=args.frac_non_train,\n", " frac_mnar=args.frac_mnar,\n", @@ -1213,7 +1213,7 @@ "\n", "fig, axes = plt.subplots(1, 2, figsize=(6, 2))\n", "ax = axes[0]\n", - "plot_histogram_intensities = partial(vaep.plotting.data.plot_histogram_intensities,\n", + "plot_histogram_intensities = partial(pimmslearn.plotting.data.plot_histogram_intensities,\n", " min_max=min_max,\n", " alpha=0.8)\n", "plot_histogram_intensities(\n", @@ -1243,7 +1243,7 @@ "ax.legend()\n", "fname = args.out_figures / f'0_{group}_mnar_mcar_histograms.pdf'\n", "figures[fname.stem] = fname\n", - "vaep.savefig(fig, fname)" + "pimmslearn.savefig(fig, fname)" ] }, { @@ -1257,7 +1257,7 @@ }, "outputs": [], "source": [ - "counts_per_bin = vaep.pandas.get_counts_per_bin(\n", + "counts_per_bin = pimmslearn.pandas.get_counts_per_bin(\n", " df=pd.concat(\n", " [df_long.squeeze().to_frame('observed'),\n", " thresholds.to_frame('threshold'),\n", @@ -1370,7 +1370,7 @@ "# -> or raise error as feature completness treshold is so low that less than 3 samples\n", "# per feature are allowd.\n", "\n", - "splits = vaep.sampling.check_split_integrity(splits)" + "splits = pimmslearn.sampling.check_split_integrity(splits)" ] }, { @@ -1540,7 +1540,7 @@ "ax.set_xlabel('Intensity bins')\n", "fname = args.out_figures / f'0_{group}_val_over_train_split.pdf'\n", "figures[fname.name] = fname\n", - "vaep.savefig(ax.get_figure(), fname)" + "pimmslearn.savefig(ax.get_figure(), fname)" ] }, { @@ -1553,7 +1553,7 @@ }, "outputs": [], "source": [ - "min_bin, max_bin = vaep.plotting.data.min_max(splits.val_y)\n", + "min_bin, max_bin = pimmslearn.plotting.data.min_max(splits.val_y)\n", "bins = range(int(min_bin), int(max_bin) + 1, 1)\n", "ax = splits_df.plot.hist(bins=bins,\n", " xticks=list(bins),\n", @@ -1568,7 +1568,7 @@ "ax.yaxis.set_major_formatter(\"{x:,.0f}\")\n", "fname = args.out_figures / f'0_{group}_splits_freq_stacked.pdf'\n", "figures[fname.name] = fname\n", - "vaep.savefig(ax.get_figure(), fname)" + "pimmslearn.savefig(ax.get_figure(), fname)" ] }, { @@ -1581,7 +1581,7 @@ }, "outputs": [], "source": [ - "counts_per_bin = vaep.pandas.get_counts_per_bin(df=splits_df, bins=bins)\n", + "counts_per_bin = pimmslearn.pandas.get_counts_per_bin(df=splits_df, bins=bins)\n", "counts_per_bin.to_excel(fname.with_suffix('.xlsx'))\n", "counts_per_bin" ] @@ -1610,7 +1610,7 @@ "ax.yaxis.set_major_formatter(\"{x:,.0f}\")\n", "fname = args.out_figures / f'0_{group}_val_test_split_freq_stacked_.pdf'\n", "figures[fname.name] = fname\n", - "vaep.savefig(ax.get_figure(), fname)" + "pimmslearn.savefig(ax.get_figure(), fname)" ] }, { @@ -1643,11 +1643,11 @@ }, "outputs": [], "source": [ - "ax = vaep.plotting.data.plot_feat_median_over_prop_missing(\n", + "ax = pimmslearn.plotting.data.plot_feat_median_over_prop_missing(\n", " data=splits.train_X, type='scatter')\n", "fname = args.out_figures / f'0_{group}_intensity_median_vs_prop_missing_scatter_train'\n", "figures[fname.stem] = fname\n", - "vaep.savefig(ax.get_figure(), fname)" + "pimmslearn.savefig(ax.get_figure(), fname)" ] }, { @@ -1660,11 +1660,11 @@ }, "outputs": [], "source": [ - "ax = vaep.plotting.data.plot_feat_median_over_prop_missing(\n", + "ax = pimmslearn.plotting.data.plot_feat_median_over_prop_missing(\n", " data=splits.train_X, type='boxplot')\n", "fname = args.out_figures / f'0_{group}_intensity_median_vs_prop_missing_boxplot_train'\n", "figures[fname.stem] = fname\n", - "vaep.savefig(ax.get_figure(), fname)" + "pimmslearn.savefig(ax.get_figure(), fname)" ] }, { @@ -1705,7 +1705,7 @@ " _ = ax.set_ylabel('Frequency')\n", "fname = args.out_figures / f'0_{group}_intensity_median_vs_prop_missing_boxplot_val_train'\n", "figures[fname.stem] = fname\n", - "vaep.savefig(ax.get_figure(), fname)" + "pimmslearn.savefig(ax.get_figure(), fname)" ] }, { diff --git a/project/01_0_split_data.py b/project/01_0_split_data.py index e4b4c2450..4142e0995 100644 --- a/project/01_0_split_data.py +++ b/project/01_0_split_data.py @@ -6,7 +6,7 @@ # extension: .py # format_name: percent # format_version: '1.3' -# jupytext_version: 1.15.0 +# jupytext_version: 1.16.2 # kernelspec: # display_name: Python 3 # language: python @@ -31,14 +31,14 @@ from IPython.display import display from sklearn.model_selection import train_test_split -import vaep -import vaep.io.load -from vaep.analyzers import analyzers -from vaep.io.datasplits import DataSplits -from vaep.sampling import feature_frequency -from vaep.sklearn import get_PCA +import pimmslearn +import pimmslearn.io.load +from pimmslearn.analyzers import analyzers +from pimmslearn.io.datasplits import DataSplits +from pimmslearn.sampling import feature_frequency +from pimmslearn.sklearn import get_PCA -logger = vaep.logging.setup_nb_logger() +logger = pimmslearn.logging.setup_nb_logger() logger.info("Split data and make diagnostic plots") logging.getLogger('fontTools').setLevel(logging.WARNING) @@ -57,7 +57,7 @@ def align_meta_data(df: pd.DataFrame, df_meta: pd.DataFrame): pd.options.display.max_columns = 32 plt.rcParams['figure.figsize'] = [4, 2] -vaep.plotting.make_large_descriptors(7) +pimmslearn.plotting.make_large_descriptors(7) figures = {} # collection of ax or figures dumps = {} # collection of data dumps @@ -96,11 +96,11 @@ def align_meta_data(df: pd.DataFrame, df_meta: pd.DataFrame): # %% tags=["hide-input"] -args = vaep.nb.get_params(args, globals=globals()) +args = pimmslearn.nb.get_params(args, globals=globals()) args # %% tags=["hide-input"] -args = vaep.nb.args_from_dict(args) +args = pimmslearn.nb.args_from_dict(args) args # %% tags=["hide-input"] @@ -134,7 +134,7 @@ def align_meta_data(df: pd.DataFrame, df_meta: pd.DataFrame): # %% tags=["hide-input"] # # ! factor out file reading to a separate module, not class # AnalyzePeptides.from_csv -constructor = getattr(vaep.io.load, FILE_FORMAT_TO_CONSTRUCTOR[FILE_EXT]) +constructor = getattr(pimmslearn.io.load, FILE_FORMAT_TO_CONSTRUCTOR[FILE_EXT]) df = constructor(fname=args.FN_INTENSITIES, index_col=args.index_col, ) @@ -364,7 +364,7 @@ def join_as_str(seq): ax.set_ylabel('observations') fname = args.out_figures / f'0_{group}_hist_features_per_sample' figures[fname.stem] = fname -vaep.savefig(ax.get_figure(), fname) +pimmslearn.savefig(ax.get_figure(), fname) # %% tags=["hide-input"] ax = df.notna().sum(axis=0).sort_values().plot() @@ -375,41 +375,41 @@ def join_as_str(seq): ax.set_ylabel('observations') fname = args.out_figures / f'0_{group}_feature_prevalence' figures[fname.stem] = fname -vaep.savefig(ax.get_figure(), fname) +pimmslearn.savefig(ax.get_figure(), fname) # %% [markdown] # ### Number off observations accross feature value # %% tags=["hide-input"] -min_max = vaep.plotting.data.min_max(df.stack()) -ax, bins = vaep.plotting.data.plot_histogram_intensities( +min_max = pimmslearn.plotting.data.min_max(df.stack()) +ax, bins = pimmslearn.plotting.data.plot_histogram_intensities( df.stack(), min_max=min_max) ax.set_xlabel('Intensity binned') fname = args.out_figures / f'0_{group}_intensity_distribution_overall' figures[fname.stem] = fname -vaep.savefig(ax.get_figure(), fname) +pimmslearn.savefig(ax.get_figure(), fname) # %% tags=["hide-input"] -ax = vaep.plotting.data.plot_feat_median_over_prop_missing( +ax = pimmslearn.plotting.data.plot_feat_median_over_prop_missing( data=df, type='scatter') fname = args.out_figures / f'0_{group}_intensity_median_vs_prop_missing_scatter' ax.set_xlabel( f'{args.feat_name_display.capitalize()} binned by their median intensity' f' (N {args.feat_name_display})') figures[fname.stem] = fname -vaep.savefig(ax.get_figure(), fname) +pimmslearn.savefig(ax.get_figure(), fname) # %% tags=["hide-input"] -ax, _data_feat_median_over_prop_missing = vaep.plotting.data.plot_feat_median_over_prop_missing( +ax, _data_feat_median_over_prop_missing = pimmslearn.plotting.data.plot_feat_median_over_prop_missing( data=df, type='boxplot', return_plot_data=True) fname = args.out_figures / f'0_{group}_intensity_median_vs_prop_missing_boxplot' ax.set_xlabel( f'{args.feat_name_display.capitalize()} binned by their median intensity' f' (N {args.feat_name_display})') figures[fname.stem] = fname -vaep.savefig(ax.get_figure(), fname) +pimmslearn.savefig(ax.get_figure(), fname) _data_feat_median_over_prop_missing.to_csv(fname.with_suffix('.csv')) # _data_feat_median_over_prop_missing.to_excel(fname.with_suffix('.xlsx')) del _data_feat_median_over_prop_missing @@ -443,7 +443,7 @@ def join_as_str(seq): fname = (args.out_figures / f'0_{group}_pca_sample_by_{"_".join(args.meta_cat_col.split())}') figures[fname.stem] = fname - vaep.savefig(fig, fname) + pimmslearn.savefig(fig, fname) # %% tags=["hide-input"] if args.meta_date_col != 'PlaceholderTime': @@ -452,7 +452,7 @@ def join_as_str(seq): df=pcs[pcs_name], ax=ax, dates=pcs[args.meta_date_col], title=f'by {args.meta_date_col}') fname = args.out_figures / f'0_{group}_pca_sample_by_date' figures[fname.stem] = fname - vaep.savefig(fig, fname) + pimmslearn.savefig(fig, fname) # %% [markdown] # - size: number of features in a single sample @@ -470,7 +470,7 @@ def join_as_str(seq): fname = (args.out_figures / f'0_{group}_pca_sample_by_{"_".join(col_identified_feat.split())}.pdf') figures[fname.stem] = fname -vaep.savefig(fig, fname) +pimmslearn.savefig(fig, fname) # %% tags=["hide-input"] # # ! write principal components to excel (if needed) @@ -517,12 +517,12 @@ def join_as_str(seq): boxprops=dict(linewidth=.4, color='darkblue'), flierprops=dict(markersize=.4, color='lightblue'), ) -_ = vaep.plotting.select_xticks(ax) +_ = pimmslearn.plotting.select_xticks(ax) fig = ax.get_figure() fname = args.out_figures / f'0_{group}_median_boxplot' df_w_date.to_pickle(fname.with_suffix('.pkl')) figures[fname.stem] = fname -vaep.savefig(fig, fname) +pimmslearn.savefig(fig, fname) del df_w_date # %% [markdown] @@ -549,13 +549,13 @@ def join_as_str(seq): # fontsize=6, figsize=(8, 2), s=5, - xticks=vaep.plotting.select_dates( + xticks=pimmslearn.plotting.select_dates( median_sample_intensity[dates.name]) ) fig = ax.get_figure() fname = args.out_figures / f'0_{group}_median_scatter' figures[fname.stem] = fname - vaep.savefig(fig, fname) + pimmslearn.savefig(fig, fname) # %% [markdown] # - the closer the labels are there denser the samples are measured around that time. @@ -610,13 +610,13 @@ def join_as_str(seq): # Simulated missing values are not used for validation and testing. # %% tags=["hide-input"] -df_long = vaep.io.datasplits.long_format(df) +df_long = pimmslearn.io.datasplits.long_format(df) df_long.head() # %% tags=["hide-input"] group = 2 -splits, thresholds, fake_na_mcar, fake_na_mnar = vaep.sampling.sample_mnar_mcar( +splits, thresholds, fake_na_mcar, fake_na_mnar = pimmslearn.sampling.sample_mnar_mcar( df_long=df_long, frac_non_train=args.frac_non_train, frac_mnar=args.frac_mnar, @@ -631,7 +631,7 @@ def join_as_str(seq): fig, axes = plt.subplots(1, 2, figsize=(6, 2)) ax = axes[0] -plot_histogram_intensities = partial(vaep.plotting.data.plot_histogram_intensities, +plot_histogram_intensities = partial(pimmslearn.plotting.data.plot_histogram_intensities, min_max=min_max, alpha=0.8) plot_histogram_intensities( @@ -661,10 +661,10 @@ def join_as_str(seq): ax.legend() fname = args.out_figures / f'0_{group}_mnar_mcar_histograms.pdf' figures[fname.stem] = fname -vaep.savefig(fig, fname) +pimmslearn.savefig(fig, fname) # %% tags=["hide-input"] -counts_per_bin = vaep.pandas.get_counts_per_bin( +counts_per_bin = pimmslearn.pandas.get_counts_per_bin( df=pd.concat( [df_long.squeeze().to_frame('observed'), thresholds.to_frame('threshold'), @@ -724,7 +724,7 @@ def join_as_str(seq): # -> or raise error as feature completness treshold is so low that less than 3 samples # per feature are allowd. -splits = vaep.sampling.check_split_integrity(splits) +splits = pimmslearn.sampling.check_split_integrity(splits) # %% [markdown] # Some tools require at least 4 observation in the training data, @@ -818,10 +818,10 @@ def join_as_str(seq): ax.set_xlabel('Intensity bins') fname = args.out_figures / f'0_{group}_val_over_train_split.pdf' figures[fname.name] = fname -vaep.savefig(ax.get_figure(), fname) +pimmslearn.savefig(ax.get_figure(), fname) # %% tags=["hide-input"] -min_bin, max_bin = vaep.plotting.data.min_max(splits.val_y) +min_bin, max_bin = pimmslearn.plotting.data.min_max(splits.val_y) bins = range(int(min_bin), int(max_bin) + 1, 1) ax = splits_df.plot.hist(bins=bins, xticks=list(bins), @@ -836,10 +836,10 @@ def join_as_str(seq): ax.yaxis.set_major_formatter("{x:,.0f}") fname = args.out_figures / f'0_{group}_splits_freq_stacked.pdf' figures[fname.name] = fname -vaep.savefig(ax.get_figure(), fname) +pimmslearn.savefig(ax.get_figure(), fname) # %% tags=["hide-input"] -counts_per_bin = vaep.pandas.get_counts_per_bin(df=splits_df, bins=bins) +counts_per_bin = pimmslearn.pandas.get_counts_per_bin(df=splits_df, bins=bins) counts_per_bin.to_excel(fname.with_suffix('.xlsx')) counts_per_bin @@ -857,7 +857,7 @@ def join_as_str(seq): ax.yaxis.set_major_formatter("{x:,.0f}") fname = args.out_figures / f'0_{group}_val_test_split_freq_stacked_.pdf' figures[fname.name] = fname -vaep.savefig(ax.get_figure(), fname) +pimmslearn.savefig(ax.get_figure(), fname) # %% [markdown] @@ -867,18 +867,18 @@ def join_as_str(seq): splits.to_wide_format() # %% tags=["hide-input"] -ax = vaep.plotting.data.plot_feat_median_over_prop_missing( +ax = pimmslearn.plotting.data.plot_feat_median_over_prop_missing( data=splits.train_X, type='scatter') fname = args.out_figures / f'0_{group}_intensity_median_vs_prop_missing_scatter_train' figures[fname.stem] = fname -vaep.savefig(ax.get_figure(), fname) +pimmslearn.savefig(ax.get_figure(), fname) # %% tags=["hide-input"] -ax = vaep.plotting.data.plot_feat_median_over_prop_missing( +ax = pimmslearn.plotting.data.plot_feat_median_over_prop_missing( data=splits.train_X, type='boxplot') fname = args.out_figures / f'0_{group}_intensity_median_vs_prop_missing_boxplot_train' figures[fname.stem] = fname -vaep.savefig(ax.get_figure(), fname) +pimmslearn.savefig(ax.get_figure(), fname) # %% tags=["hide-input"] medians = (splits @@ -909,7 +909,7 @@ def join_as_str(seq): _ = ax.set_ylabel('Frequency') fname = args.out_figures / f'0_{group}_intensity_median_vs_prop_missing_boxplot_val_train' figures[fname.stem] = fname -vaep.savefig(ax.get_figure(), fname) +pimmslearn.savefig(ax.get_figure(), fname) # %% [markdown] # ## Save parameters diff --git a/project/01_0_transform_data_to_wide_format.ipynb b/project/01_0_transform_data_to_wide_format.ipynb index bcd8a0bf1..587b1b281 100644 --- a/project/01_0_transform_data_to_wide_format.ipynb +++ b/project/01_0_transform_data_to_wide_format.ipynb @@ -21,9 +21,9 @@ "source": [ "import pandas as pd\n", "\n", - "import vaep\n", - "import vaep.models\n", - "from vaep.io import datasplits" + "import pimmslearn\n", + "import pimmslearn.models\n", + "from pimmslearn.io import datasplits" ] }, { @@ -80,7 +80,7 @@ }, "outputs": [], "source": [ - "args = vaep.nb.get_params(args, globals=globals())\n", + "args = pimmslearn.nb.get_params(args, globals=globals())\n", "args" ] }, @@ -95,7 +95,7 @@ }, "outputs": [], "source": [ - "params = vaep.nb.args_from_dict(args)\n", + "params = pimmslearn.nb.args_from_dict(args)\n", "# params = OmegaConf.create(args)\n", "params" ] diff --git a/project/01_0_transform_data_to_wide_format.py b/project/01_0_transform_data_to_wide_format.py index 5e1cee1cd..541e2365d 100644 --- a/project/01_0_transform_data_to_wide_format.py +++ b/project/01_0_transform_data_to_wide_format.py @@ -6,7 +6,7 @@ # extension: .py # format_name: percent # format_version: '1.3' -# jupytext_version: 1.15.0 +# jupytext_version: 1.16.2 # kernelspec: # display_name: Python 3 # language: python @@ -20,9 +20,9 @@ # %% tags=["hide-input"] import pandas as pd -import vaep -import vaep.models -from vaep.io import datasplits +import pimmslearn +import pimmslearn.models +from pimmslearn.io import datasplits # %% tags=["hide-input"] # catch passed parameters @@ -41,11 +41,11 @@ file_format_out: str = 'csv' # file format of transformed splits, default csv # %% tags=["hide-input"] -args = vaep.nb.get_params(args, globals=globals()) +args = pimmslearn.nb.get_params(args, globals=globals()) args # %% tags=["hide-input"] -params = vaep.nb.args_from_dict(args) +params = pimmslearn.nb.args_from_dict(args) # params = OmegaConf.create(args) params diff --git a/project/01_1_train_CF.ipynb b/project/01_1_train_CF.ipynb index b1d3eac71..b4220d45e 100644 --- a/project/01_1_train_CF.ipynb +++ b/project/01_1_train_CF.ipynb @@ -30,19 +30,19 @@ " MSELossFlat, default_device)\n", "from fastai.tabular.all import *\n", "\n", - "import vaep\n", - "import vaep.model\n", - "import vaep.models as models\n", - "import vaep.nb\n", - "from vaep.io import datasplits\n", - "from vaep.logging import setup_logger\n", - "from vaep.models import RecorderDump, plot_loss\n", + "import pimmslearn\n", + "import pimmslearn.model\n", + "import pimmslearn.models as models\n", + "import pimmslearn.nb\n", + "from pimmslearn.io import datasplits\n", + "from pimmslearn.logging import setup_logger\n", + "from pimmslearn.models import RecorderDump, plot_loss\n", "\n", "learner.Recorder.plot_loss = plot_loss\n", "# import fastai.callback.hook # Learner.summary\n", "\n", "\n", - "logger = setup_logger(logger=logging.getLogger('vaep'))\n", + "logger = setup_logger(logger=logging.getLogger('pimmslearn'))\n", "logger.info(\n", " \"Experiment 03 - Analysis of latent spaces and performance comparisions\")\n", "\n", @@ -123,7 +123,7 @@ }, "outputs": [], "source": [ - "args = vaep.nb.get_params(args, globals=globals())\n", + "args = pimmslearn.nb.get_params(args, globals=globals())\n", "args" ] }, @@ -139,7 +139,7 @@ }, "outputs": [], "source": [ - "args = vaep.nb.args_from_dict(args)\n", + "args = pimmslearn.nb.args_from_dict(args)\n", "\n", "# # Currently not needed -> DotProduct used, not a FNN\n", "# if isinstance(args.hidden_layers, str):\n", @@ -433,11 +433,11 @@ " recorder=ana_collab.learn.recorder, name='CF')\n", "recorder_dump.save(args.out_figures)\n", "del recorder_dump\n", - "vaep.savefig(fig, name='collab_training',\n", - " folder=args.out_figures)\n", + "pimmslearn.savefig(fig, name='collab_training',\n", + " folder=args.out_figures)\n", "ana_collab.model_kwargs['batch_size'] = ana_collab.batch_size\n", - "vaep.io.dump_json(ana_collab.model_kwargs, args.out_models /\n", - " TEMPLATE_MODEL_PARAMS.format('CF'))" + "pimmslearn.io.dump_json(ana_collab.model_kwargs, args.out_models /\n", + " TEMPLATE_MODEL_PARAMS.format('CF'))" ] }, { @@ -644,8 +644,8 @@ }, "outputs": [], "source": [ - "vaep.io.dump_json(d_metrics.metrics, args.out_metrics /\n", - " f'metrics_{args.model_key}.json')" + "pimmslearn.io.dump_json(d_metrics.metrics, args.out_metrics /\n", + " f'metrics_{args.model_key}.json')" ] }, { diff --git a/project/01_1_train_CF.py b/project/01_1_train_CF.py index 847860649..6a4909a67 100644 --- a/project/01_1_train_CF.py +++ b/project/01_1_train_CF.py @@ -6,7 +6,7 @@ # extension: .py # format_name: percent # format_version: '1.3' -# jupytext_version: 1.15.0 +# jupytext_version: 1.16.2 # kernelspec: # display_name: Python 3 # language: python @@ -28,19 +28,19 @@ MSELossFlat, default_device) from fastai.tabular.all import * -import vaep -import vaep.model -import vaep.models as models -import vaep.nb -from vaep.io import datasplits -from vaep.logging import setup_logger -from vaep.models import RecorderDump, plot_loss +import pimmslearn +import pimmslearn.model +import pimmslearn.models as models +import pimmslearn.nb +from pimmslearn.io import datasplits +from pimmslearn.logging import setup_logger +from pimmslearn.models import RecorderDump, plot_loss learner.Recorder.plot_loss = plot_loss # import fastai.callback.hook # Learner.summary -logger = setup_logger(logger=logging.getLogger('vaep')) +logger = setup_logger(logger=logging.getLogger('pimmslearn')) logger.info( "Experiment 03 - Analysis of latent spaces and performance comparisions") @@ -78,11 +78,11 @@ # Some argument transformations # %% tags=["hide-input"] -args = vaep.nb.get_params(args, globals=globals()) +args = pimmslearn.nb.get_params(args, globals=globals()) args # %% tags=["hide-input"] -args = vaep.nb.args_from_dict(args) +args = pimmslearn.nb.args_from_dict(args) # # Currently not needed -> DotProduct used, not a FNN # if isinstance(args.hidden_layers, str): @@ -216,11 +216,11 @@ recorder=ana_collab.learn.recorder, name='CF') recorder_dump.save(args.out_figures) del recorder_dump -vaep.savefig(fig, name='collab_training', - folder=args.out_figures) +pimmslearn.savefig(fig, name='collab_training', + folder=args.out_figures) ana_collab.model_kwargs['batch_size'] = ana_collab.batch_size -vaep.io.dump_json(ana_collab.model_kwargs, args.out_models / - TEMPLATE_MODEL_PARAMS.format('CF')) +pimmslearn.io.dump_json(ana_collab.model_kwargs, args.out_models / + TEMPLATE_MODEL_PARAMS.format('CF')) # %% [markdown] # ### Predictions @@ -300,8 +300,8 @@ # Save all metrics as json # %% -vaep.io.dump_json(d_metrics.metrics, args.out_metrics / - f'metrics_{args.model_key}.json') +pimmslearn.io.dump_json(d_metrics.metrics, args.out_metrics / + f'metrics_{args.model_key}.json') # %% diff --git a/project/01_1_train_DAE.ipynb b/project/01_1_train_DAE.ipynb index ac07f5a79..bd616e654 100644 --- a/project/01_1_train_DAE.ipynb +++ b/project/01_1_train_DAE.ipynb @@ -31,18 +31,18 @@ "from sklearn.impute import SimpleImputer\n", "from sklearn.preprocessing import StandardScaler\n", "\n", - "import vaep\n", - "import vaep.model\n", - "import vaep.models as models\n", - "from vaep.analyzers import analyzers\n", - "from vaep.io import datasplits\n", + "import pimmslearn\n", + "import pimmslearn.model\n", + "import pimmslearn.models as models\n", + "from pimmslearn.analyzers import analyzers\n", + "from pimmslearn.io import datasplits\n", "# overwriting Recorder callback with custom plot_loss\n", - "from vaep.models import ae, plot_loss\n", + "from pimmslearn.models import ae, plot_loss\n", "\n", "learner.Recorder.plot_loss = plot_loss\n", "\n", "\n", - "logger = vaep.logging.setup_logger(logging.getLogger('vaep'))\n", + "logger = pimmslearn.logging.setup_logger(logging.getLogger('pimmslearn'))\n", "logger.info(\n", " \"Experiment 03 - Analysis of latent spaces and performance comparisions\")\n", "\n", @@ -134,7 +134,7 @@ }, "outputs": [], "source": [ - "args = vaep.nb.get_params(args, globals=globals())\n", + "args = pimmslearn.nb.get_params(args, globals=globals())\n", "args" ] }, @@ -150,7 +150,7 @@ }, "outputs": [], "source": [ - "args = vaep.nb.args_from_dict(args)\n", + "args = pimmslearn.nb.args_from_dict(args)\n", "\n", "if isinstance(args.hidden_layers, str):\n", " args.overwrite_entry(\"hidden_layers\", [int(x)\n", @@ -556,8 +556,8 @@ }, "outputs": [], "source": [ - "vaep.io.dump_json(analysis.params, args.out_models /\n", - " TEMPLATE_MODEL_PARAMS.format(args.model_key))" + "pimmslearn.io.dump_json(analysis.params, args.out_models /\n", + " TEMPLATE_MODEL_PARAMS.format(args.model_key))" ] }, { @@ -750,9 +750,9 @@ "outputs": [], "source": [ "analysis.model.cpu()\n", - "df_latent = vaep.model.get_latent_space(analysis.model.encoder,\n", - " dl=analysis.dls.valid,\n", - " dl_index=analysis.dls.valid.data.index)\n", + "df_latent = pimmslearn.model.get_latent_space(analysis.model.encoder,\n", + " dl=analysis.dls.valid,\n", + " dl_index=analysis.dls.valid.data.index)\n", "df_latent" ] }, @@ -895,8 +895,8 @@ }, "outputs": [], "source": [ - "vaep.io.dump_json(d_metrics.metrics, args.out_metrics /\n", - " f'metrics_{args.model_key}.json')\n", + "pimmslearn.io.dump_json(d_metrics.metrics, args.out_metrics /\n", + " f'metrics_{args.model_key}.json')\n", "d_metrics" ] }, diff --git a/project/01_1_train_DAE.py b/project/01_1_train_DAE.py index 069b02a0e..59a19e508 100644 --- a/project/01_1_train_DAE.py +++ b/project/01_1_train_DAE.py @@ -6,7 +6,7 @@ # extension: .py # format_name: percent # format_version: '1.3' -# jupytext_version: 1.15.0 +# jupytext_version: 1.16.2 # kernelspec: # display_name: Python 3 # language: python @@ -28,18 +28,18 @@ from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler -import vaep -import vaep.model -import vaep.models as models -from vaep.analyzers import analyzers -from vaep.io import datasplits +import pimmslearn +import pimmslearn.model +import pimmslearn.models as models +from pimmslearn.analyzers import analyzers +from pimmslearn.io import datasplits # overwriting Recorder callback with custom plot_loss -from vaep.models import ae, plot_loss +from pimmslearn.models import ae, plot_loss learner.Recorder.plot_loss = plot_loss -logger = vaep.logging.setup_logger(logging.getLogger('vaep')) +logger = pimmslearn.logging.setup_logger(logging.getLogger('pimmslearn')) logger.info( "Experiment 03 - Analysis of latent spaces and performance comparisions") @@ -88,11 +88,11 @@ # %% tags=["hide-input"] -args = vaep.nb.get_params(args, globals=globals()) +args = pimmslearn.nb.get_params(args, globals=globals()) args # %% tags=["hide-input"] -args = vaep.nb.args_from_dict(args) +args = pimmslearn.nb.args_from_dict(args) if isinstance(args.hidden_layers, str): args.overwrite_entry("hidden_layers", [int(x) @@ -252,8 +252,8 @@ # dump model config # %% tags=["hide-input"] -vaep.io.dump_json(analysis.params, args.out_models / - TEMPLATE_MODEL_PARAMS.format(args.model_key)) +pimmslearn.io.dump_json(analysis.params, args.out_models / + TEMPLATE_MODEL_PARAMS.format(args.model_key)) # %% tags=["hide-input"] @@ -327,9 +327,9 @@ # %% tags=["hide-input"] analysis.model.cpu() -df_latent = vaep.model.get_latent_space(analysis.model.encoder, - dl=analysis.dls.valid, - dl_index=analysis.dls.valid.data.index) +df_latent = pimmslearn.model.get_latent_space(analysis.model.encoder, + dl=analysis.dls.valid, + dl_index=analysis.dls.valid.data.index) df_latent # %% tags=["hide-input"] @@ -381,8 +381,8 @@ # Save all metrics as json # %% tags=["hide-input"] -vaep.io.dump_json(d_metrics.metrics, args.out_metrics / - f'metrics_{args.model_key}.json') +pimmslearn.io.dump_json(d_metrics.metrics, args.out_metrics / + f'metrics_{args.model_key}.json') d_metrics # %% tags=["hide-input"] diff --git a/project/01_1_train_KNN.ipynb b/project/01_1_train_KNN.ipynb index ebd24a6d0..ac881a0e2 100644 --- a/project/01_1_train_KNN.ipynb +++ b/project/01_1_train_KNN.ipynb @@ -27,15 +27,15 @@ "import sklearn.impute\n", "from IPython.display import display\n", "\n", - "import vaep\n", - "import vaep.model\n", - "import vaep.models as models\n", - "import vaep.nb\n", - "from vaep import sampling\n", - "from vaep.io import datasplits\n", - "from vaep.models import ae\n", + "import pimmslearn\n", + "import pimmslearn.model\n", + "import pimmslearn.models as models\n", + "import pimmslearn.nb\n", + "from pimmslearn import sampling\n", + "from pimmslearn.io import datasplits\n", + "from pimmslearn.models import ae\n", "\n", - "logger = vaep.logging.setup_logger(logging.getLogger('vaep'))\n", + "logger = pimmslearn.logging.setup_logger(logging.getLogger('pimmslearn'))\n", "logger.info(\"Experiment 03 - Analysis of latent spaces and performance comparisions\")\n", "\n", "figures = {} # collection of ax or figures" @@ -119,8 +119,8 @@ }, "outputs": [], "source": [ - "args = vaep.nb.get_params(args, globals=globals())\n", - "args = vaep.nb.args_from_dict(args)\n", + "args = pimmslearn.nb.get_params(args, globals=globals())\n", + "args = pimmslearn.nb.args_from_dict(args)\n", "args" ] }, @@ -549,7 +549,7 @@ }, "outputs": [], "source": [ - "vaep.io.dump_json(d_metrics.metrics, args.out_metrics / f'metrics_{args.model_key}.json')\n", + "pimmslearn.io.dump_json(d_metrics.metrics, args.out_metrics / f'metrics_{args.model_key}.json')\n", "d_metrics" ] }, diff --git a/project/01_1_train_KNN.py b/project/01_1_train_KNN.py index ddd21c2aa..f81f16c51 100644 --- a/project/01_1_train_KNN.py +++ b/project/01_1_train_KNN.py @@ -6,7 +6,7 @@ # extension: .py # format_name: percent # format_version: '1.3' -# jupytext_version: 1.15.0 +# jupytext_version: 1.16.2 # kernelspec: # display_name: Python 3 # language: python @@ -24,15 +24,15 @@ import sklearn.impute from IPython.display import display -import vaep -import vaep.model -import vaep.models as models -import vaep.nb -from vaep import sampling -from vaep.io import datasplits -from vaep.models import ae +import pimmslearn +import pimmslearn.model +import pimmslearn.models as models +import pimmslearn.nb +from pimmslearn import sampling +from pimmslearn.io import datasplits +from pimmslearn.models import ae -logger = vaep.logging.setup_logger(logging.getLogger('vaep')) +logger = pimmslearn.logging.setup_logger(logging.getLogger('pimmslearn')) logger.info("Experiment 03 - Analysis of latent spaces and performance comparisions") figures = {} # collection of ax or figures @@ -73,8 +73,8 @@ # Some argument transformations # %% tags=["hide-input"] -args = vaep.nb.get_params(args, globals=globals()) -args = vaep.nb.args_from_dict(args) +args = pimmslearn.nb.get_params(args, globals=globals()) +args = pimmslearn.nb.args_from_dict(args) args @@ -222,7 +222,7 @@ # Save all metrics as json # %% tags=["hide-input"] -vaep.io.dump_json(d_metrics.metrics, args.out_metrics / f'metrics_{args.model_key}.json') +pimmslearn.io.dump_json(d_metrics.metrics, args.out_metrics / f'metrics_{args.model_key}.json') d_metrics # %% tags=["hide-input"] diff --git a/project/01_1_train_KNN_unique_samples.py b/project/01_1_train_KNN_unique_samples.py index 1cd24fe26..0adf97c95 100644 --- a/project/01_1_train_KNN_unique_samples.py +++ b/project/01_1_train_KNN_unique_samples.py @@ -6,7 +6,7 @@ # extension: .py # format_name: percent # format_version: '1.3' -# jupytext_version: 1.15.0 +# jupytext_version: 1.16.2 # kernelspec: # display_name: Python 3 # language: python @@ -23,15 +23,15 @@ import sklearn from sklearn.model_selection import train_test_split -import vaep -import vaep.model -import vaep.models as models -import vaep.nb -from vaep import sampling -from vaep.io import datasplits -from vaep.models import ae +import pimmslearn +import pimmslearn.model +import pimmslearn.models as models +import pimmslearn.nb +from pimmslearn import sampling +from pimmslearn.io import datasplits +from pimmslearn.models import ae -logger = vaep.logging.setup_logger(logging.getLogger('vaep')) +logger = pimmslearn.logging.setup_logger(logging.getLogger('vaep')) logger.info("Experiment 03 - Analysis of latent spaces and performance comparisions") figures = {} # collection of ax or figures @@ -80,8 +80,8 @@ # Some argument transformations # %% -args = vaep.nb.get_params(args, globals=globals()) -args = vaep.nb.args_from_dict(args) +args = pimmslearn.nb.get_params(args, globals=globals()) +args = pimmslearn.nb.args_from_dict(args) args @@ -271,7 +271,7 @@ # Save all metrics as json # %% -vaep.io.dump_json(d_metrics.metrics, args.out_metrics / f'metrics_{args.model_key}.json') +pimmslearn.io.dump_json(d_metrics.metrics, args.out_metrics / f'metrics_{args.model_key}.json') d_metrics # %% diff --git a/project/01_1_train_Median.ipynb b/project/01_1_train_Median.ipynb index 406691aab..ac1425012 100644 --- a/project/01_1_train_Median.ipynb +++ b/project/01_1_train_Median.ipynb @@ -25,13 +25,13 @@ "import pandas as pd\n", "from IPython.display import display\n", "\n", - "import vaep\n", - "import vaep.model\n", - "import vaep.models as models\n", - "import vaep.nb\n", - "from vaep.io import datasplits\n", + "import pimmslearn\n", + "import pimmslearn.model\n", + "import pimmslearn.models as models\n", + "import pimmslearn.nb\n", + "from pimmslearn.io import datasplits\n", "\n", - "logger = vaep.logging.setup_logger(logging.getLogger('vaep'))\n", + "logger = pimmslearn.logging.setup_logger(logging.getLogger('pimmslearn'))\n", "logger.info(\"Median Imputation\")\n", "\n", "figures = {} # collection of ax or figures" @@ -108,7 +108,7 @@ }, "outputs": [], "source": [ - "args = vaep.nb.get_params(args, globals=globals())\n", + "args = pimmslearn.nb.get_params(args, globals=globals())\n", "args" ] }, @@ -124,7 +124,7 @@ }, "outputs": [], "source": [ - "args = vaep.nb.args_from_dict(args)\n", + "args = pimmslearn.nb.args_from_dict(args)\n", "args" ] }, @@ -279,7 +279,7 @@ }, "outputs": [], "source": [ - "freq_feat = vaep.io.datasplits.load_freq(args.data)\n", + "freq_feat = pimmslearn.io.datasplits.load_freq(args.data)\n", "freq_feat.head() # training data" ] }, @@ -376,7 +376,7 @@ }, "outputs": [], "source": [ - "# interpolated = vaep.pandas.interpolate(wide_df = data.train_X)\n", + "# interpolated = pimmslearn.pandas.interpolate(wide_df = data.train_X)\n", "# val_pred_fake_na['interpolated'] = interpolated\n", "# test_pred_fake_na['interpolated'] = interpolated\n", "# del interpolated\n", @@ -663,7 +663,7 @@ }, "outputs": [], "source": [ - "vaep.io.dump_json(d_metrics.metrics, args.out_metrics / f'metrics_{args.model_key}.json')\n", + "pimmslearn.io.dump_json(d_metrics.metrics, args.out_metrics / f'metrics_{args.model_key}.json')\n", "d_metrics" ] }, diff --git a/project/01_1_train_Median.py b/project/01_1_train_Median.py index cf43e2e1b..0c698beed 100644 --- a/project/01_1_train_Median.py +++ b/project/01_1_train_Median.py @@ -6,7 +6,7 @@ # extension: .py # format_name: percent # format_version: '1.3' -# jupytext_version: 1.15.0 +# jupytext_version: 1.16.2 # kernelspec: # display_name: Python 3 # language: python @@ -22,13 +22,13 @@ import pandas as pd from IPython.display import display -import vaep -import vaep.model -import vaep.models as models -import vaep.nb -from vaep.io import datasplits +import pimmslearn +import pimmslearn.model +import pimmslearn.models as models +import pimmslearn.nb +from pimmslearn.io import datasplits -logger = vaep.logging.setup_logger(logging.getLogger('vaep')) +logger = pimmslearn.logging.setup_logger(logging.getLogger('pimmslearn')) logger.info("Median Imputation") figures = {} # collection of ax or figures @@ -62,11 +62,11 @@ # %% tags=["hide-input"] -args = vaep.nb.get_params(args, globals=globals()) +args = pimmslearn.nb.get_params(args, globals=globals()) args # %% tags=["hide-input"] -args = vaep.nb.args_from_dict(args) +args = pimmslearn.nb.args_from_dict(args) args @@ -126,7 +126,7 @@ # - [x] add some additional NAs based on distribution of data # %% tags=["hide-input"] -freq_feat = vaep.io.datasplits.load_freq(args.data) +freq_feat = pimmslearn.io.datasplits.load_freq(args.data) freq_feat.head() # training data # %% [markdown] @@ -159,7 +159,7 @@ # ### Add interpolation performance # %% tags=["hide-input"] -# interpolated = vaep.pandas.interpolate(wide_df = data.train_X) +# interpolated = pimmslearn.pandas.interpolate(wide_df = data.train_X) # val_pred_fake_na['interpolated'] = interpolated # test_pred_fake_na['interpolated'] = interpolated # del interpolated @@ -267,7 +267,7 @@ # ### Save all metrics as json # %% tags=["hide-input"] -vaep.io.dump_json(d_metrics.metrics, args.out_metrics / f'metrics_{args.model_key}.json') +pimmslearn.io.dump_json(d_metrics.metrics, args.out_metrics / f'metrics_{args.model_key}.json') d_metrics diff --git a/project/01_1_train_RSN.ipynb b/project/01_1_train_RSN.ipynb index fb5fc7b67..04a184004 100644 --- a/project/01_1_train_RSN.ipynb +++ b/project/01_1_train_RSN.ipynb @@ -25,14 +25,14 @@ "import pandas as pd\n", "from IPython.display import display\n", "\n", - "import vaep\n", - "import vaep.imputation\n", - "import vaep.model\n", - "import vaep.models as models\n", - "import vaep.nb\n", - "from vaep.io import datasplits\n", + "import pimmslearn\n", + "import pimmslearn.imputation\n", + "import pimmslearn.model\n", + "import pimmslearn.models as models\n", + "import pimmslearn.nb\n", + "from pimmslearn.io import datasplits\n", "\n", - "logger = vaep.logging.setup_logger(logging.getLogger('vaep'))\n", + "logger = pimmslearn.logging.setup_logger(logging.getLogger('pimmslearn'))\n", "logger.info(\"Median Imputation\")\n", "\n", "figures = {} # collection of ax or figures" @@ -114,7 +114,7 @@ }, "outputs": [], "source": [ - "args = vaep.nb.get_params(args, globals=globals())\n", + "args = pimmslearn.nb.get_params(args, globals=globals())\n", "args" ] }, @@ -130,7 +130,7 @@ }, "outputs": [], "source": [ - "args = vaep.nb.args_from_dict(args)\n", + "args = pimmslearn.nb.args_from_dict(args)\n", "args" ] }, @@ -281,7 +281,7 @@ }, "outputs": [], "source": [ - "freq_feat = vaep.io.datasplits.load_freq(args.data)\n", + "freq_feat = pimmslearn.io.datasplits.load_freq(args.data)\n", "freq_feat.head() # training data" ] }, @@ -375,7 +375,7 @@ }, "outputs": [], "source": [ - "imputed_shifted_normal = vaep.imputation.impute_shifted_normal(\n", + "imputed_shifted_normal = pimmslearn.imputation.impute_shifted_normal(\n", " data.train_X,\n", " mean_shift=1.8,\n", " std_shrinkage=0.3,\n", @@ -453,7 +453,7 @@ }, "outputs": [], "source": [ - "ax, _ = vaep.plotting.errors.plot_errors_binned(val_pred_fake_na)" + "ax, _ = pimmslearn.plotting.errors.plot_errors_binned(val_pred_fake_na)" ] }, { @@ -467,7 +467,7 @@ }, "outputs": [], "source": [ - "ax, _ = vaep.plotting.errors.plot_errors_binned(test_pred_fake_na)" + "ax, _ = pimmslearn.plotting.errors.plot_errors_binned(test_pred_fake_na)" ] }, { @@ -582,8 +582,8 @@ }, "outputs": [], "source": [ - "vaep.io.dump_json(d_metrics.metrics, args.out_metrics /\n", - " f'metrics_{args.model_key}.json')\n", + "pimmslearn.io.dump_json(d_metrics.metrics, args.out_metrics /\n", + " f'metrics_{args.model_key}.json')\n", "d_metrics" ] }, diff --git a/project/01_1_train_RSN.py b/project/01_1_train_RSN.py index c21769ac2..c4f608c10 100644 --- a/project/01_1_train_RSN.py +++ b/project/01_1_train_RSN.py @@ -6,7 +6,7 @@ # extension: .py # format_name: percent # format_version: '1.3' -# jupytext_version: 1.15.0 +# jupytext_version: 1.16.2 # kernelspec: # display_name: Python 3 # language: python @@ -22,14 +22,14 @@ import pandas as pd from IPython.display import display -import vaep -import vaep.imputation -import vaep.model -import vaep.models as models -import vaep.nb -from vaep.io import datasplits +import pimmslearn +import pimmslearn.imputation +import pimmslearn.model +import pimmslearn.models as models +import pimmslearn.nb +from pimmslearn.io import datasplits -logger = vaep.logging.setup_logger(logging.getLogger('vaep')) +logger = pimmslearn.logging.setup_logger(logging.getLogger('pimmslearn')) logger.info("Median Imputation") figures = {} # collection of ax or figures @@ -68,11 +68,11 @@ # %% tags=["hide-input"] -args = vaep.nb.get_params(args, globals=globals()) +args = pimmslearn.nb.get_params(args, globals=globals()) args # %% tags=["hide-input"] -args = vaep.nb.args_from_dict(args) +args = pimmslearn.nb.args_from_dict(args) args @@ -129,7 +129,7 @@ # # %% tags=["hide-input"] -freq_feat = vaep.io.datasplits.load_freq(args.data) +freq_feat = pimmslearn.io.datasplits.load_freq(args.data) freq_feat.head() # training data # %% [markdown] @@ -159,7 +159,7 @@ # ### Impute using shifted normal distribution # %% tags=["hide-input"] -imputed_shifted_normal = vaep.imputation.impute_shifted_normal( +imputed_shifted_normal = pimmslearn.imputation.impute_shifted_normal( data.train_X, mean_shift=1.8, std_shrinkage=0.3, @@ -197,10 +197,10 @@ # ### Plots # # %% tags=["hide-input"] -ax, _ = vaep.plotting.errors.plot_errors_binned(val_pred_fake_na) +ax, _ = pimmslearn.plotting.errors.plot_errors_binned(val_pred_fake_na) # %% tags=["hide-input"] -ax, _ = vaep.plotting.errors.plot_errors_binned(test_pred_fake_na) +ax, _ = pimmslearn.plotting.errors.plot_errors_binned(test_pred_fake_na) # %% [markdown] # ## Comparisons @@ -240,8 +240,8 @@ # ### Save all metrics as json # %% tags=["hide-input"] -vaep.io.dump_json(d_metrics.metrics, args.out_metrics / - f'metrics_{args.model_key}.json') +pimmslearn.io.dump_json(d_metrics.metrics, args.out_metrics / + f'metrics_{args.model_key}.json') d_metrics # %% tags=["hide-input"] diff --git a/project/01_1_train_VAE.ipynb b/project/01_1_train_VAE.ipynb index be38aa642..c2f8ccd8f 100644 --- a/project/01_1_train_VAE.ipynb +++ b/project/01_1_train_VAE.ipynb @@ -38,19 +38,19 @@ "from sklearn.preprocessing import StandardScaler\n", "from torch.nn import Sigmoid\n", "\n", - "import vaep\n", - "import vaep.model\n", - "import vaep.models as models\n", - "import vaep.nb\n", - "from vaep.analyzers import analyzers\n", - "from vaep.io import datasplits\n", + "import pimmslearn\n", + "import pimmslearn.model\n", + "import pimmslearn.models as models\n", + "import pimmslearn.nb\n", + "from pimmslearn.analyzers import analyzers\n", + "from pimmslearn.io import datasplits\n", "# overwriting Recorder callback with custom plot_loss\n", - "from vaep.models import ae, plot_loss\n", + "from pimmslearn.models import ae, plot_loss\n", "\n", "learner.Recorder.plot_loss = plot_loss\n", "\n", "\n", - "logger = vaep.logging.setup_logger(logging.getLogger('vaep'))\n", + "logger = pimmslearn.logging.setup_logger(logging.getLogger('pimmslearn'))\n", "logger.info(\n", " \"Experiment 03 - Analysis of latent spaces and performance comparisions\")\n", "\n", @@ -140,7 +140,7 @@ }, "outputs": [], "source": [ - "args = vaep.nb.get_params(args, globals=globals())\n", + "args = pimmslearn.nb.get_params(args, globals=globals())\n", "args" ] }, @@ -156,7 +156,7 @@ }, "outputs": [], "source": [ - "args = vaep.nb.args_from_dict(args)\n", + "args = pimmslearn.nb.args_from_dict(args)\n", "\n", "if isinstance(args.hidden_layers, str):\n", " args.overwrite_entry(\"hidden_layers\", [int(x)\n", @@ -320,7 +320,7 @@ }, "outputs": [], "source": [ - "freq_feat = vaep.io.datasplits.load_freq(args.data)\n", + "freq_feat = pimmslearn.io.datasplits.load_freq(args.data)\n", "freq_feat.head() # training data" ] }, @@ -651,8 +651,8 @@ "# needs class as argument, not instance, but serialization needs instance\n", "analysis.params['last_decoder_activation'] = Sigmoid()\n", "\n", - "vaep.io.dump_json(\n", - " vaep.io.parse_dict(\n", + "pimmslearn.io.dump_json(\n", + " pimmslearn.io.parse_dict(\n", " analysis.params, types=[\n", " (torch.nn.modules.module.Module, lambda m: str(m))\n", " ]),\n", @@ -840,13 +840,13 @@ "# assert analysis.dls.valid.data.equals(analysis.dls.train.data)\n", "# Reconstruct DataLoader for case that during training singleton batches were dropped\n", "_dl = torch.utils.data.DataLoader(\n", - " vaep.io.datasets.DatasetWithTarget(\n", + " pimmslearn.io.datasets.DatasetWithTarget(\n", " analysis.dls.valid.data),\n", " batch_size=args.batch_size,\n", " shuffle=False)\n", - "df_latent = vaep.model.get_latent_space(analysis.model.get_mu_and_logvar,\n", - " dl=_dl,\n", - " dl_index=analysis.dls.valid.data.index)\n", + "df_latent = pimmslearn.model.get_latent_space(analysis.model.get_mu_and_logvar,\n", + " dl=_dl,\n", + " dl_index=analysis.dls.valid.data.index)\n", "df_latent" ] }, @@ -1075,8 +1075,8 @@ }, "outputs": [], "source": [ - "vaep.io.dump_json(d_metrics.metrics, args.out_metrics /\n", - " f'metrics_{args.model_key}.json')\n", + "pimmslearn.io.dump_json(d_metrics.metrics, args.out_metrics /\n", + " f'metrics_{args.model_key}.json')\n", "d_metrics" ] }, diff --git a/project/01_1_train_VAE.py b/project/01_1_train_VAE.py index d68428410..f28c5d012 100644 --- a/project/01_1_train_VAE.py +++ b/project/01_1_train_VAE.py @@ -6,7 +6,7 @@ # extension: .py # format_name: percent # format_version: '1.3' -# jupytext_version: 1.15.0 +# jupytext_version: 1.16.2 # kernelspec: # display_name: Python 3 # language: python @@ -35,19 +35,19 @@ from sklearn.preprocessing import StandardScaler from torch.nn import Sigmoid -import vaep -import vaep.model -import vaep.models as models -import vaep.nb -from vaep.analyzers import analyzers -from vaep.io import datasplits +import pimmslearn +import pimmslearn.model +import pimmslearn.models as models +import pimmslearn.nb +from pimmslearn.analyzers import analyzers +from pimmslearn.io import datasplits # overwriting Recorder callback with custom plot_loss -from vaep.models import ae, plot_loss +from pimmslearn.models import ae, plot_loss learner.Recorder.plot_loss = plot_loss -logger = vaep.logging.setup_logger(logging.getLogger('vaep')) +logger = pimmslearn.logging.setup_logger(logging.getLogger('pimmslearn')) logger.info( "Experiment 03 - Analysis of latent spaces and performance comparisions") @@ -94,11 +94,11 @@ # %% tags=["hide-input"] -args = vaep.nb.get_params(args, globals=globals()) +args = pimmslearn.nb.get_params(args, globals=globals()) args # %% tags=["hide-input"] -args = vaep.nb.args_from_dict(args) +args = pimmslearn.nb.args_from_dict(args) if isinstance(args.hidden_layers, str): args.overwrite_entry("hidden_layers", [int(x) @@ -167,7 +167,7 @@ # - [x] add some additional NAs based on distribution of data # %% tags=["hide-input"] -freq_feat = vaep.io.datasplits.load_freq(args.data) +freq_feat = pimmslearn.io.datasplits.load_freq(args.data) freq_feat.head() # training data # %% [markdown] @@ -288,8 +288,8 @@ # needs class as argument, not instance, but serialization needs instance analysis.params['last_decoder_activation'] = Sigmoid() -vaep.io.dump_json( - vaep.io.parse_dict( +pimmslearn.io.dump_json( + pimmslearn.io.parse_dict( analysis.params, types=[ (torch.nn.modules.module.Module, lambda m: str(m)) ]), @@ -364,13 +364,13 @@ # assert analysis.dls.valid.data.equals(analysis.dls.train.data) # Reconstruct DataLoader for case that during training singleton batches were dropped _dl = torch.utils.data.DataLoader( - vaep.io.datasets.DatasetWithTarget( + pimmslearn.io.datasets.DatasetWithTarget( analysis.dls.valid.data), batch_size=args.batch_size, shuffle=False) -df_latent = vaep.model.get_latent_space(analysis.model.get_mu_and_logvar, - dl=_dl, - dl_index=analysis.dls.valid.data.index) +df_latent = pimmslearn.model.get_latent_space(analysis.model.get_mu_and_logvar, + dl=_dl, + dl_index=analysis.dls.valid.data.index) df_latent # %% tags=["hide-input"] @@ -453,8 +453,8 @@ # Save all metrics as json # %% tags=["hide-input"] -vaep.io.dump_json(d_metrics.metrics, args.out_metrics / - f'metrics_{args.model_key}.json') +pimmslearn.io.dump_json(d_metrics.metrics, args.out_metrics / + f'metrics_{args.model_key}.json') d_metrics # %% tags=["hide-input"] diff --git a/project/01_1_transfer_NAGuideR_pred.ipynb b/project/01_1_transfer_NAGuideR_pred.ipynb index a4fbd5e14..adba71b08 100644 --- a/project/01_1_transfer_NAGuideR_pred.ipynb +++ b/project/01_1_transfer_NAGuideR_pred.ipynb @@ -25,14 +25,14 @@ "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "\n", - "import vaep\n", - "import vaep.models\n", - "import vaep.pandas\n", - "from vaep.io import datasplits\n", + "import pimmslearn\n", + "import pimmslearn.models\n", + "import pimmslearn.pandas\n", + "from pimmslearn.io import datasplits\n", "\n", - "vaep.plotting.make_large_descriptors(5)\n", + "pimmslearn.plotting.make_large_descriptors(5)\n", "\n", - "logger = vaep.logging.setup_logger(logging.getLogger('vaep'))" + "logger = pimmslearn.logging.setup_logger(logging.getLogger('pimmslearn'))" ] }, { @@ -100,8 +100,8 @@ }, "outputs": [], "source": [ - "args = vaep.nb.get_params(args, globals=globals())\n", - "args = vaep.nb.args_from_dict(args)\n", + "args = pimmslearn.nb.get_params(args, globals=globals())\n", + "args = pimmslearn.nb.args_from_dict(args)\n", "args" ] }, @@ -295,7 +295,7 @@ "outputs": [], "source": [ "# papermill_description=metrics\n", - "d_metrics = vaep.models.Metrics()" + "d_metrics = pimmslearn.models.Metrics()" ] }, { @@ -347,7 +347,7 @@ }, "outputs": [], "source": [ - "metrics_df = vaep.models.get_df_from_nested_dict(\n", + "metrics_df = pimmslearn.models.get_df_from_nested_dict(\n", " d_metrics.metrics, column_levels=['model', 'metric_name']).T\n", "metrics_df" ] @@ -395,13 +395,13 @@ "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=(8, 2))\n", - "ax, errors_bind = vaep.plotting.errors.plot_errors_binned(\n", + "ax, errors_bind = pimmslearn.plotting.errors.plot_errors_binned(\n", " val_pred_fake_na[top_5],\n", " ax=ax,\n", ")\n", "fname = args.out_figures / 'NAGuideR_errors_per_bin_val.png'\n", "files_out[fname.name] = fname.as_posix()\n", - "vaep.savefig(ax.get_figure(), fname)" + "pimmslearn.savefig(ax.get_figure(), fname)" ] }, { diff --git a/project/01_1_transfer_NAGuideR_pred.py b/project/01_1_transfer_NAGuideR_pred.py index bddf8f604..5cb05a555 100644 --- a/project/01_1_transfer_NAGuideR_pred.py +++ b/project/01_1_transfer_NAGuideR_pred.py @@ -6,7 +6,7 @@ # extension: .py # format_name: percent # format_version: '1.3' -# jupytext_version: 1.15.0 +# jupytext_version: 1.16.2 # kernelspec: # display_name: Python 3 # language: python @@ -24,14 +24,14 @@ import matplotlib.pyplot as plt import pandas as pd -import vaep -import vaep.models -import vaep.pandas -from vaep.io import datasplits +import pimmslearn +import pimmslearn.models +import pimmslearn.pandas +from pimmslearn.io import datasplits -vaep.plotting.make_large_descriptors(5) +pimmslearn.plotting.make_large_descriptors(5) -logger = vaep.logging.setup_logger(logging.getLogger('vaep')) +logger = pimmslearn.logging.setup_logger(logging.getLogger('pimmslearn')) # %% tags=["hide-input"] # catch passed parameters @@ -55,8 +55,8 @@ # %% tags=["hide-input"] -args = vaep.nb.get_params(args, globals=globals()) -args = vaep.nb.args_from_dict(args) +args = pimmslearn.nb.get_params(args, globals=globals()) +args = pimmslearn.nb.args_from_dict(args) args # %% tags=["hide-input"] @@ -134,7 +134,7 @@ # %% tags=["hide-input"] # papermill_description=metrics -d_metrics = vaep.models.Metrics() +d_metrics = pimmslearn.models.Metrics() # %% tags=["hide-input"] added_metrics = d_metrics.add_metrics(val_pred_fake_na.dropna(how='all', axis=1), 'valid_fake_na') @@ -148,7 +148,7 @@ pd.DataFrame(added_metrics) # %% tags=["hide-input"] -metrics_df = vaep.models.get_df_from_nested_dict( +metrics_df = pimmslearn.models.get_df_from_nested_dict( d_metrics.metrics, column_levels=['model', 'metric_name']).T metrics_df @@ -163,13 +163,13 @@ # %% tags=["hide-input"] fig, ax = plt.subplots(figsize=(8, 2)) -ax, errors_bind = vaep.plotting.errors.plot_errors_binned( +ax, errors_bind = pimmslearn.plotting.errors.plot_errors_binned( val_pred_fake_na[top_5], ax=ax, ) fname = args.out_figures / 'NAGuideR_errors_per_bin_val.png' files_out[fname.name] = fname.as_posix() -vaep.savefig(ax.get_figure(), fname) +pimmslearn.savefig(ax.get_figure(), fname) # %% tags=["hide-input"] files_out diff --git a/project/01_2_performance_plots.ipynb b/project/01_2_performance_plots.ipynb index 3d48f53a9..4e7acb9a5 100644 --- a/project/01_2_performance_plots.ipynb +++ b/project/01_2_performance_plots.ipynb @@ -39,22 +39,22 @@ "import yaml\n", "from IPython.display import display\n", "\n", - "import vaep\n", - "import vaep.imputation\n", - "import vaep.models\n", - "import vaep.nb\n", - "from vaep.analyzers import compare_predictions\n", - "from vaep.io import datasplits\n", - "from vaep.models.collect_dumps import collect, select_content\n", + "import pimmslearn\n", + "import pimmslearn.imputation\n", + "import pimmslearn.models\n", + "import pimmslearn.nb\n", + "from pimmslearn.analyzers import compare_predictions\n", + "from pimmslearn.io import datasplits\n", + "from pimmslearn.models.collect_dumps import collect, select_content\n", "\n", "pd.options.display.max_rows = 30\n", "pd.options.display.min_rows = 10\n", "pd.options.display.max_colwidth = 100\n", "\n", "plt.rcParams.update({'figure.figsize': (4, 2)})\n", - "vaep.plotting.make_large_descriptors(7)\n", + "pimmslearn.plotting.make_large_descriptors(7)\n", "\n", - "logger = vaep.logging.setup_nb_logger()\n", + "logger = pimmslearn.logging.setup_nb_logger()\n", "logging.getLogger('fontTools').setLevel(logging.WARNING)\n", "\n", "\n", @@ -149,7 +149,7 @@ }, "outputs": [], "source": [ - "args = vaep.nb.get_params(args, globals=globals())\n", + "args = pimmslearn.nb.get_params(args, globals=globals())\n", "args" ] }, @@ -164,7 +164,7 @@ }, "outputs": [], "source": [ - "args = vaep.nb.args_from_dict(args)\n", + "args = pimmslearn.nb.args_from_dict(args)\n", "args" ] }, @@ -251,9 +251,9 @@ "source": [ "fig, axes = plt.subplots(1, 2, sharey=True, sharex=True)\n", "\n", - "vaep.plotting.data.plot_observations(data.val_y.unstack(), ax=axes[0],\n", + "pimmslearn.plotting.data.plot_observations(data.val_y.unstack(), ax=axes[0],\n", " title='Validation split', size=1, xlabel='')\n", - "vaep.plotting.data.plot_observations(data.test_y.unstack(), ax=axes[1],\n", + "pimmslearn.plotting.data.plot_observations(data.test_y.unstack(), ax=axes[1],\n", " title='Test split', size=1, xlabel='')\n", "fig.suptitle(\"Simulated missing values per sample\", size=8)\n", "# hide axis and use only for common x label\n", @@ -263,7 +263,7 @@ "group = 1\n", "fname = args.out_figures / f'2_{group}_fake_na_val_test_splits.png'\n", "figures[fname.stem] = fname\n", - "vaep.savefig(fig, name=fname)" + "pimmslearn.savefig(fig, name=fname)" ] }, { @@ -287,7 +287,7 @@ "source": [ "# load frequency of training features...\n", "# needs to be pickle -> index.name needed\n", - "freq_feat = vaep.io.datasplits.load_freq(args.data, file='freq_features.json')\n", + "freq_feat = pimmslearn.io.datasplits.load_freq(args.data, file='freq_features.json')\n", "freq_feat.head() # training data" ] }, @@ -593,8 +593,8 @@ }, "outputs": [], "source": [ - "COLORS_TO_USE = vaep.plotting.defaults.assign_colors(list(k.upper() for k in ORDER_MODELS))\n", - "vaep.plotting.defaults.ModelColorVisualizer(ORDER_MODELS, COLORS_TO_USE)" + "COLORS_TO_USE = pimmslearn.plotting.defaults.assign_colors(list(k.upper() for k in ORDER_MODELS))\n", + "pimmslearn.plotting.defaults.ModelColorVisualizer(ORDER_MODELS, COLORS_TO_USE)" ] }, { @@ -651,7 +651,7 @@ " horizontalalignment='right')\n", "fname = args.out_figures / f'2_{group}_pred_corr_val_per_sample.pdf'\n", "figures[fname.stem] = fname\n", - "vaep.savefig(ax.get_figure(), name=fname)\n", + "pimmslearn.savefig(ax.get_figure(), name=fname)\n", "\n", "fname = args.out_figures / f'2_{group}_pred_corr_val_per_sample.xlsx'\n", "dumps[fname.stem] = fname\n", @@ -680,7 +680,7 @@ }, "outputs": [], "source": [ - "treshold = vaep.pandas.get_lower_whiskers(\n", + "treshold = pimmslearn.pandas.get_lower_whiskers(\n", " corr_per_sample_val[TOP_N_ORDER]).min()\n", "mask = (corr_per_sample_val[TOP_N_ORDER] < treshold).any(axis=1)\n", "corr_per_sample_val.loc[mask].style.highlight_min(\n", @@ -782,7 +782,7 @@ "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=(8, 3))\n", - "ax, errors_binned = vaep.plotting.errors.plot_errors_by_median(\n", + "ax, errors_binned = pimmslearn.plotting.errors.plot_errors_by_median(\n", " pred_val[\n", " [TARGET_COL] + TOP_N_ORDER\n", " ],\n", @@ -795,7 +795,7 @@ "ax.legend(loc='best', ncols=len(TOP_N_ORDER))\n", "fname = args.out_figures / f'2_{group}_errors_binned_by_feat_median_val.pdf'\n", "figures[fname.stem] = fname\n", - "vaep.savefig(ax.get_figure(), name=fname)" + "pimmslearn.savefig(ax.get_figure(), name=fname)" ] }, { @@ -811,7 +811,7 @@ "outputs": [], "source": [ "# ! only used for reporting\n", - "plotted = vaep.plotting.errors.get_data_for_errors_by_median(\n", + "plotted = pimmslearn.plotting.errors.get_data_for_errors_by_median(\n", " errors=errors_binned,\n", " feat_name=FEAT_NAME_DISPLAY,\n", " metric_name=METRIC\n", @@ -890,7 +890,7 @@ }, "outputs": [], "source": [ - "errors_test_mae = vaep.pandas.calc_errors.get_absolute_error(\n", + "errors_test_mae = pimmslearn.pandas.calc_errors.get_absolute_error(\n", " pred_test\n", ")\n", "mae_stats_ordered_test = errors_test_mae.describe()[ORDER_MODELS]\n", @@ -969,7 +969,7 @@ }, "outputs": [], "source": [ - "min_max = vaep.plotting.data.min_max(pred_test[TARGET_COL])\n", + "min_max = pimmslearn.plotting.data.min_max(pred_test[TARGET_COL])\n", "top_n = 4\n", "fig, axes = plt.subplots(ncols=top_n, figsize=(8, 2), sharey=True)\n", "\n", @@ -978,13 +978,13 @@ " COLORS_TO_USE[:top_n],\n", " axes):\n", "\n", - " ax, bins = vaep.plotting.data.plot_histogram_intensities(\n", + " ax, bins = pimmslearn.plotting.data.plot_histogram_intensities(\n", " pred_test[TARGET_COL],\n", " color='grey',\n", " min_max=min_max,\n", " ax=ax\n", " )\n", - " ax, _ = vaep.plotting.data.plot_histogram_intensities(\n", + " ax, _ = pimmslearn.plotting.data.plot_histogram_intensities(\n", " pred_test[model],\n", " color=color,\n", " min_max=min_max,\n", @@ -999,7 +999,7 @@ "\n", "fname = args.out_figures / f'2_{group}_intensity_binned_top_{top_n}_models_test.pdf'\n", "figures[fname.stem] = fname\n", - "vaep.savefig(fig, name=fname)" + "pimmslearn.savefig(fig, name=fname)" ] }, { @@ -1013,7 +1013,7 @@ }, "outputs": [], "source": [ - "counts_per_bin = vaep.pandas.get_counts_per_bin(df=pred_test,\n", + "counts_per_bin = pimmslearn.pandas.get_counts_per_bin(df=pred_test,\n", " bins=bins,\n", " columns=[TARGET_COL, *ORDER_MODELS[:top_n]])\n", "\n", @@ -1078,7 +1078,7 @@ " horizontalalignment='right')\n", "fname = args.out_figures / f'2_{group}_pred_corr_test_per_sample.pdf'\n", "figures[fname.stem] = fname\n", - "vaep.savefig(ax.get_figure(), name=fname)\n", + "pimmslearn.savefig(ax.get_figure(), name=fname)\n", "\n", "dumps[fname.stem] = fname.with_suffix('.xlsx')\n", "with pd.ExcelWriter(fname.with_suffix('.xlsx')) as w:\n", @@ -1106,7 +1106,7 @@ }, "outputs": [], "source": [ - "treshold = vaep.pandas.get_lower_whiskers(\n", + "treshold = pimmslearn.pandas.get_lower_whiskers(\n", " corr_per_sample_test[TOP_N_ORDER]).min()\n", "mask = (corr_per_sample_test[TOP_N_ORDER] < treshold).any(axis=1)\n", "corr_per_sample_test.loc[mask].style.highlight_min(\n", @@ -1210,7 +1210,7 @@ " horizontalalignment='right')\n", "fname = args.out_figures / f'2_{group}_pred_corr_test_per_feat.pdf'\n", "figures[fname.stem] = fname\n", - "vaep.savefig(ax.get_figure(), name=fname)\n", + "pimmslearn.savefig(ax.get_figure(), name=fname)\n", "dumps[fname.stem] = fname.with_suffix('.xlsx')\n", "with pd.ExcelWriter(fname.with_suffix('.xlsx')) as w:\n", " corr_per_feat_test.loc[~too_few_obs].describe().to_excel(\n", @@ -1246,7 +1246,7 @@ }, "outputs": [], "source": [ - "treshold = vaep.pandas.get_lower_whiskers(\n", + "treshold = pimmslearn.pandas.get_lower_whiskers(\n", " corr_per_feat_test[TOP_N_ORDER]).min()\n", "mask = (corr_per_feat_test[TOP_N_ORDER] < treshold).any(axis=1)\n", "\n", @@ -1289,7 +1289,7 @@ }, "outputs": [], "source": [ - "metrics = vaep.models.Metrics()\n", + "metrics = pimmslearn.models.Metrics()\n", "test_metrics = metrics.add_metrics(\n", " pred_test[['observed', *TOP_N_ORDER]], key='test data')\n", "test_metrics = pd.DataFrame(test_metrics)[TOP_N_ORDER]\n", @@ -1371,12 +1371,12 @@ " color=COLORS_TO_USE,\n", " ax=ax,\n", " width=.7)\n", - "ax = vaep.plotting.add_height_to_barplot(ax, size=7)\n", - "ax = vaep.plotting.add_text_to_barplot(ax, _to_plot.loc[\"text\"], size=7)\n", + "ax = pimmslearn.plotting.add_height_to_barplot(ax, size=7)\n", + "ax = pimmslearn.plotting.add_text_to_barplot(ax, _to_plot.loc[\"text\"], size=7)\n", "ax.set_xticklabels([])\n", "fname = args.out_figures / f'2_{group}_performance_test.pdf'\n", "figures[fname.stem] = fname\n", - "vaep.savefig(fig, name=fname)" + "pimmslearn.savefig(fig, name=fname)" ] }, { @@ -1419,10 +1419,10 @@ }, "outputs": [], "source": [ - "vaep.plotting.make_large_descriptors(7)\n", + "pimmslearn.plotting.make_large_descriptors(7)\n", "fig, ax = plt.subplots(figsize=(8, 2))\n", "\n", - "ax, errors_binned = vaep.plotting.errors.plot_errors_by_median(\n", + "ax, errors_binned = pimmslearn.plotting.errors.plot_errors_by_median(\n", " pred=pred_test[\n", " [TARGET_COL] + TOP_N_ORDER\n", " ],\n", @@ -1433,10 +1433,10 @@ " palette=COLORS_TO_USE\n", ")\n", "ax.legend(loc='best', ncols=len(TOP_N_ORDER))\n", - "vaep.plotting.make_large_descriptors(6)\n", + "pimmslearn.plotting.make_large_descriptors(6)\n", "fname = args.out_figures / f'2_{group}_test_errors_binned_by_feat_medians.pdf'\n", "figures[fname.stem] = fname\n", - "vaep.savefig(ax.get_figure(), name=fname)\n", + "pimmslearn.savefig(ax.get_figure(), name=fname)\n", "\n", "dumps[fname.stem] = fname.with_suffix('.csv')\n", "errors_binned.to_csv(fname.with_suffix('.csv'))\n", @@ -1456,7 +1456,7 @@ "outputs": [], "source": [ "# ! only used for reporting\n", - "plotted = vaep.plotting.errors.get_data_for_errors_by_median(\n", + "plotted = pimmslearn.plotting.errors.get_data_for_errors_by_median(\n", " errors=errors_binned,\n", " feat_name=FEAT_NAME_DISPLAY,\n", " metric_name=METRIC\n", @@ -1505,7 +1505,7 @@ "outputs": [], "source": [ "if SEL_MODELS:\n", - " metrics = vaep.models.Metrics()\n", + " metrics = pimmslearn.models.Metrics()\n", " test_metrics = metrics.add_metrics(\n", " pred_test[['observed', *SEL_MODELS]], key='test data')\n", " test_metrics = pd.DataFrame(test_metrics)[SEL_MODELS]\n", @@ -1535,18 +1535,18 @@ " rot=0,\n", " ylabel=f\"{METRIC} for {FEAT_NAME_DISPLAY} ({n_in_comparison:,} intensities)\",\n", " # title=f'performance on test data (based on {n_in_comparison:,} measurements)',\n", - " color=vaep.plotting.defaults.assign_colors(\n", + " color=pimmslearn.plotting.defaults.assign_colors(\n", " list(k.upper() for k in SEL_MODELS)),\n", " ax=ax,\n", " width=.7)\n", " ax.legend(loc='best', ncols=len(SEL_MODELS))\n", - " ax = vaep.plotting.add_height_to_barplot(ax, size=5)\n", - " ax = vaep.plotting.add_text_to_barplot(ax, _to_plot.loc[\"text\"], size=5)\n", + " ax = pimmslearn.plotting.add_height_to_barplot(ax, size=5)\n", + " ax = pimmslearn.plotting.add_text_to_barplot(ax, _to_plot.loc[\"text\"], size=5)\n", " ax.set_xticklabels([])\n", "\n", " fname = args.out_figures / f'2_{group}_performance_test_sel.pdf'\n", " figures[fname.stem] = fname\n", - " vaep.savefig(fig, name=fname)\n", + " pimmslearn.savefig(fig, name=fname)\n", "\n", " dumps[fname.stem] = fname.with_suffix('.csv')\n", " _to_plot_long = _to_plot.T\n", @@ -1571,10 +1571,10 @@ "source": [ "# custom selection\n", "if SEL_MODELS:\n", - " vaep.plotting.make_large_descriptors(7)\n", + " pimmslearn.plotting.make_large_descriptors(7)\n", " fig, ax = plt.subplots(figsize=(8, 2))\n", "\n", - " ax, errors_binned = vaep.plotting.errors.plot_errors_by_median(\n", + " ax, errors_binned = pimmslearn.plotting.errors.plot_errors_by_median(\n", " pred=pred_test[\n", " [TARGET_COL] + SEL_MODELS\n", " ],\n", @@ -1582,7 +1582,7 @@ " ax=ax,\n", " metric_name=METRIC,\n", " feat_name=FEAT_NAME_DISPLAY,\n", - " palette=vaep.plotting.defaults.assign_colors(\n", + " palette=pimmslearn.plotting.defaults.assign_colors(\n", " list(k.upper() for k in SEL_MODELS))\n", " )\n", " # ax.set_ylim(0, 1.5)\n", @@ -1591,16 +1591,16 @@ " # text.set_fontsize(6)\n", " fname = args.out_figures / f'2_{group}_test_errors_binned_by_feat_medians_sel.pdf'\n", " figures[fname.stem] = fname\n", - " vaep.savefig(ax.get_figure(), name=fname)\n", + " pimmslearn.savefig(ax.get_figure(), name=fname)\n", " plt.show(fig)\n", "\n", " dumps[fname.stem] = fname.with_suffix('.csv')\n", " errors_binned.to_csv(fname.with_suffix('.csv'))\n", - " vaep.plotting.make_large_descriptors(6)\n", + " pimmslearn.plotting.make_large_descriptors(6)\n", " # ax.xaxis.set_tick_params(rotation=0) # horizontal\n", "\n", " # ! only used for reporting\n", - " plotted = vaep.plotting.errors.get_data_for_errors_by_median(\n", + " plotted = pimmslearn.plotting.errors.get_data_for_errors_by_median(\n", " errors=errors_binned,\n", " feat_name=FEAT_NAME_DISPLAY,\n", " metric_name=METRIC\n", @@ -1631,7 +1631,7 @@ "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=(8, 2))\n", - "ax, errors_binned = vaep.plotting.errors.plot_errors_binned(\n", + "ax, errors_binned = pimmslearn.plotting.errors.plot_errors_binned(\n", " pred_test[\n", " [TARGET_COL] + TOP_N_ORDER\n", " ],\n", @@ -1642,7 +1642,7 @@ "ax.legend(loc='best', ncols=len(TOP_N_ORDER))\n", "fname = args.out_figures / f'2_{group}_test_errors_binned_by_int.pdf'\n", "figures[fname.stem] = fname\n", - "vaep.savefig(ax.get_figure(), name=fname)" + "pimmslearn.savefig(ax.get_figure(), name=fname)" ] }, { diff --git a/project/01_2_performance_plots.py b/project/01_2_performance_plots.py index fd175a7f1..ee4c5ed09 100644 --- a/project/01_2_performance_plots.py +++ b/project/01_2_performance_plots.py @@ -6,7 +6,7 @@ # extension: .py # format_name: percent # format_version: '1.3' -# jupytext_version: 1.15.0 +# jupytext_version: 1.16.2 # kernelspec: # display_name: Python 3 (ipykernel) # language: python @@ -37,22 +37,22 @@ import yaml from IPython.display import display -import vaep -import vaep.imputation -import vaep.models -import vaep.nb -from vaep.analyzers import compare_predictions -from vaep.io import datasplits -from vaep.models.collect_dumps import collect, select_content +import pimmslearn +import pimmslearn.imputation +import pimmslearn.models +import pimmslearn.nb +from pimmslearn.analyzers import compare_predictions +from pimmslearn.io import datasplits +from pimmslearn.models.collect_dumps import collect, select_content pd.options.display.max_rows = 30 pd.options.display.min_rows = 10 pd.options.display.max_colwidth = 100 plt.rcParams.update({'figure.figsize': (4, 2)}) -vaep.plotting.make_large_descriptors(7) +pimmslearn.plotting.make_large_descriptors(7) -logger = vaep.logging.setup_nb_logger() +logger = pimmslearn.logging.setup_nb_logger() logging.getLogger('fontTools').setLevel(logging.WARNING) @@ -105,11 +105,11 @@ def build_text(s): # Some argument transformations # %% tags=["hide-input"] -args = vaep.nb.get_params(args, globals=globals()) +args = pimmslearn.nb.get_params(args, globals=globals()) args # %% tags=["hide-input"] -args = vaep.nb.args_from_dict(args) +args = pimmslearn.nb.args_from_dict(args) args # %% tags=["hide-input"] @@ -141,9 +141,9 @@ def build_text(s): # %% tags=["hide-input"] fig, axes = plt.subplots(1, 2, sharey=True, sharex=True) -vaep.plotting.data.plot_observations(data.val_y.unstack(), ax=axes[0], +pimmslearn.plotting.data.plot_observations(data.val_y.unstack(), ax=axes[0], title='Validation split', size=1, xlabel='') -vaep.plotting.data.plot_observations(data.test_y.unstack(), ax=axes[1], +pimmslearn.plotting.data.plot_observations(data.test_y.unstack(), ax=axes[1], title='Test split', size=1, xlabel='') fig.suptitle("Simulated missing values per sample", size=8) # hide axis and use only for common x label @@ -153,7 +153,7 @@ def build_text(s): group = 1 fname = args.out_figures / f'2_{group}_fake_na_val_test_splits.png' figures[fname.stem] = fname -vaep.savefig(fig, name=fname) +pimmslearn.savefig(fig, name=fname) # %% [markdown] # ## data completeness across entire data @@ -161,7 +161,7 @@ def build_text(s): # %% tags=["hide-input"] # load frequency of training features... # needs to be pickle -> index.name needed -freq_feat = vaep.io.datasplits.load_freq(args.data, file='freq_features.json') +freq_feat = pimmslearn.io.datasplits.load_freq(args.data, file='freq_features.json') freq_feat.head() # training data # %% tags=["hide-input"] @@ -283,8 +283,8 @@ def build_text(s): # > 2. User defined model keys for the same model with two configuration will yield different colors. # %% tags=["hide-input"] -COLORS_TO_USE = vaep.plotting.defaults.assign_colors(list(k.upper() for k in ORDER_MODELS)) -vaep.plotting.defaults.ModelColorVisualizer(ORDER_MODELS, COLORS_TO_USE) +COLORS_TO_USE = pimmslearn.plotting.defaults.assign_colors(list(k.upper() for k in ORDER_MODELS)) +pimmslearn.plotting.defaults.ModelColorVisualizer(ORDER_MODELS, COLORS_TO_USE) # %% tags=["hide-input"] TOP_N_ORDER = ORDER_MODELS[:args.plot_to_n] @@ -314,7 +314,7 @@ def build_text(s): horizontalalignment='right') fname = args.out_figures / f'2_{group}_pred_corr_val_per_sample.pdf' figures[fname.stem] = fname -vaep.savefig(ax.get_figure(), name=fname) +pimmslearn.savefig(ax.get_figure(), name=fname) fname = args.out_figures / f'2_{group}_pred_corr_val_per_sample.xlsx' dumps[fname.stem] = fname @@ -327,7 +327,7 @@ def build_text(s): # identify samples which are below lower whisker for models # %% tags=["hide-input"] -treshold = vaep.pandas.get_lower_whiskers( +treshold = pimmslearn.pandas.get_lower_whiskers( corr_per_sample_val[TOP_N_ORDER]).min() mask = (corr_per_sample_val[TOP_N_ORDER] < treshold).any(axis=1) corr_per_sample_val.loc[mask].style.highlight_min( @@ -364,7 +364,7 @@ def build_text(s): # %% tags=["hide-input"] fig, ax = plt.subplots(figsize=(8, 3)) -ax, errors_binned = vaep.plotting.errors.plot_errors_by_median( +ax, errors_binned = pimmslearn.plotting.errors.plot_errors_by_median( pred_val[ [TARGET_COL] + TOP_N_ORDER ], @@ -377,11 +377,11 @@ def build_text(s): ax.legend(loc='best', ncols=len(TOP_N_ORDER)) fname = args.out_figures / f'2_{group}_errors_binned_by_feat_median_val.pdf' figures[fname.stem] = fname -vaep.savefig(ax.get_figure(), name=fname) +pimmslearn.savefig(ax.get_figure(), name=fname) # %% tags=["hide-input"] # # ! only used for reporting -plotted = vaep.plotting.errors.get_data_for_errors_by_median( +plotted = pimmslearn.plotting.errors.get_data_for_errors_by_median( errors=errors_binned, feat_name=FEAT_NAME_DISPLAY, metric_name=METRIC @@ -418,7 +418,7 @@ def build_text(s): # Write averages for all models to excel (from before?) # %% tags=["hide-input"] -errors_test_mae = vaep.pandas.calc_errors.get_absolute_error( +errors_test_mae = pimmslearn.pandas.calc_errors.get_absolute_error( pred_test ) mae_stats_ordered_test = errors_test_mae.describe()[ORDER_MODELS] @@ -445,7 +445,7 @@ def build_text(s): # ### Intensity distribution as histogram # Plot top 4 models predictions for intensities in test data # %% tags=["hide-input"] -min_max = vaep.plotting.data.min_max(pred_test[TARGET_COL]) +min_max = pimmslearn.plotting.data.min_max(pred_test[TARGET_COL]) top_n = 4 fig, axes = plt.subplots(ncols=top_n, figsize=(8, 2), sharey=True) @@ -454,13 +454,13 @@ def build_text(s): COLORS_TO_USE[:top_n], axes): - ax, bins = vaep.plotting.data.plot_histogram_intensities( + ax, bins = pimmslearn.plotting.data.plot_histogram_intensities( pred_test[TARGET_COL], color='grey', min_max=min_max, ax=ax ) - ax, _ = vaep.plotting.data.plot_histogram_intensities( + ax, _ = pimmslearn.plotting.data.plot_histogram_intensities( pred_test[model], color=color, min_max=min_max, @@ -475,10 +475,10 @@ def build_text(s): fname = args.out_figures / f'2_{group}_intensity_binned_top_{top_n}_models_test.pdf' figures[fname.stem] = fname -vaep.savefig(fig, name=fname) +pimmslearn.savefig(fig, name=fname) # %% tags=["hide-input"] -counts_per_bin = vaep.pandas.get_counts_per_bin(df=pred_test, +counts_per_bin = pimmslearn.pandas.get_counts_per_bin(df=pred_test, bins=bins, columns=[TARGET_COL, *ORDER_MODELS[:top_n]]) @@ -516,7 +516,7 @@ def build_text(s): horizontalalignment='right') fname = args.out_figures / f'2_{group}_pred_corr_test_per_sample.pdf' figures[fname.stem] = fname -vaep.savefig(ax.get_figure(), name=fname) +pimmslearn.savefig(ax.get_figure(), name=fname) dumps[fname.stem] = fname.with_suffix('.xlsx') with pd.ExcelWriter(fname.with_suffix('.xlsx')) as w: @@ -528,7 +528,7 @@ def build_text(s): # identify samples which are below lower whisker for models # %% tags=["hide-input"] -treshold = vaep.pandas.get_lower_whiskers( +treshold = pimmslearn.pandas.get_lower_whiskers( corr_per_sample_test[TOP_N_ORDER]).min() mask = (corr_per_sample_test[TOP_N_ORDER] < treshold).any(axis=1) corr_per_sample_test.loc[mask].style.highlight_min( @@ -572,7 +572,7 @@ def build_text(s): horizontalalignment='right') fname = args.out_figures / f'2_{group}_pred_corr_test_per_feat.pdf' figures[fname.stem] = fname -vaep.savefig(ax.get_figure(), name=fname) +pimmslearn.savefig(ax.get_figure(), name=fname) dumps[fname.stem] = fname.with_suffix('.xlsx') with pd.ExcelWriter(fname.with_suffix('.xlsx')) as w: corr_per_feat_test.loc[~too_few_obs].describe().to_excel( @@ -586,7 +586,7 @@ def build_text(s): feat_count_test.head() # %% tags=["hide-input"] -treshold = vaep.pandas.get_lower_whiskers( +treshold = pimmslearn.pandas.get_lower_whiskers( corr_per_feat_test[TOP_N_ORDER]).min() mask = (corr_per_feat_test[TOP_N_ORDER] < treshold).any(axis=1) @@ -613,7 +613,7 @@ def highlight_min(s, color, tolerence=0.00001): # ### Error plot # %% tags=["hide-input"] -metrics = vaep.models.Metrics() +metrics = pimmslearn.models.Metrics() test_metrics = metrics.add_metrics( pred_test[['observed', *TOP_N_ORDER]], key='test data') test_metrics = pd.DataFrame(test_metrics)[TOP_N_ORDER] @@ -651,12 +651,12 @@ def highlight_min(s, color, tolerence=0.00001): color=COLORS_TO_USE, ax=ax, width=.7) -ax = vaep.plotting.add_height_to_barplot(ax, size=7) -ax = vaep.plotting.add_text_to_barplot(ax, _to_plot.loc["text"], size=7) +ax = pimmslearn.plotting.add_height_to_barplot(ax, size=7) +ax = pimmslearn.plotting.add_text_to_barplot(ax, _to_plot.loc["text"], size=7) ax.set_xticklabels([]) fname = args.out_figures / f'2_{group}_performance_test.pdf' figures[fname.stem] = fname -vaep.savefig(fig, name=fname) +pimmslearn.savefig(fig, name=fname) # %% tags=["hide-input"] dumps[fname.stem] = fname.with_suffix('.csv') @@ -672,10 +672,10 @@ def highlight_min(s, color, tolerence=0.00001): # ### Plot error by median feature intensity # %% tags=["hide-input"] -vaep.plotting.make_large_descriptors(7) +pimmslearn.plotting.make_large_descriptors(7) fig, ax = plt.subplots(figsize=(8, 2)) -ax, errors_binned = vaep.plotting.errors.plot_errors_by_median( +ax, errors_binned = pimmslearn.plotting.errors.plot_errors_by_median( pred=pred_test[ [TARGET_COL] + TOP_N_ORDER ], @@ -686,10 +686,10 @@ def highlight_min(s, color, tolerence=0.00001): palette=COLORS_TO_USE ) ax.legend(loc='best', ncols=len(TOP_N_ORDER)) -vaep.plotting.make_large_descriptors(6) +pimmslearn.plotting.make_large_descriptors(6) fname = args.out_figures / f'2_{group}_test_errors_binned_by_feat_medians.pdf' figures[fname.stem] = fname -vaep.savefig(ax.get_figure(), name=fname) +pimmslearn.savefig(ax.get_figure(), name=fname) dumps[fname.stem] = fname.with_suffix('.csv') errors_binned.to_csv(fname.with_suffix('.csv')) @@ -697,7 +697,7 @@ def highlight_min(s, color, tolerence=0.00001): # %% tags=["hide-input"] # # ! only used for reporting -plotted = vaep.plotting.errors.get_data_for_errors_by_median( +plotted = pimmslearn.plotting.errors.get_data_for_errors_by_median( errors=errors_binned, feat_name=FEAT_NAME_DISPLAY, metric_name=METRIC @@ -719,7 +719,7 @@ def highlight_min(s, color, tolerence=0.00001): # %% tags=["hide-input"] if SEL_MODELS: - metrics = vaep.models.Metrics() + metrics = pimmslearn.models.Metrics() test_metrics = metrics.add_metrics( pred_test[['observed', *SEL_MODELS]], key='test data') test_metrics = pd.DataFrame(test_metrics)[SEL_MODELS] @@ -749,18 +749,18 @@ def highlight_min(s, color, tolerence=0.00001): rot=0, ylabel=f"{METRIC} for {FEAT_NAME_DISPLAY} ({n_in_comparison:,} intensities)", # title=f'performance on test data (based on {n_in_comparison:,} measurements)', - color=vaep.plotting.defaults.assign_colors( + color=pimmslearn.plotting.defaults.assign_colors( list(k.upper() for k in SEL_MODELS)), ax=ax, width=.7) ax.legend(loc='best', ncols=len(SEL_MODELS)) - ax = vaep.plotting.add_height_to_barplot(ax, size=5) - ax = vaep.plotting.add_text_to_barplot(ax, _to_plot.loc["text"], size=5) + ax = pimmslearn.plotting.add_height_to_barplot(ax, size=5) + ax = pimmslearn.plotting.add_text_to_barplot(ax, _to_plot.loc["text"], size=5) ax.set_xticklabels([]) fname = args.out_figures / f'2_{group}_performance_test_sel.pdf' figures[fname.stem] = fname - vaep.savefig(fig, name=fname) + pimmslearn.savefig(fig, name=fname) dumps[fname.stem] = fname.with_suffix('.csv') _to_plot_long = _to_plot.T @@ -774,10 +774,10 @@ def highlight_min(s, color, tolerence=0.00001): # %% tags=["hide-input"] # custom selection if SEL_MODELS: - vaep.plotting.make_large_descriptors(7) + pimmslearn.plotting.make_large_descriptors(7) fig, ax = plt.subplots(figsize=(8, 2)) - ax, errors_binned = vaep.plotting.errors.plot_errors_by_median( + ax, errors_binned = pimmslearn.plotting.errors.plot_errors_by_median( pred=pred_test[ [TARGET_COL] + SEL_MODELS ], @@ -785,7 +785,7 @@ def highlight_min(s, color, tolerence=0.00001): ax=ax, metric_name=METRIC, feat_name=FEAT_NAME_DISPLAY, - palette=vaep.plotting.defaults.assign_colors( + palette=pimmslearn.plotting.defaults.assign_colors( list(k.upper() for k in SEL_MODELS)) ) # ax.set_ylim(0, 1.5) @@ -794,16 +794,16 @@ def highlight_min(s, color, tolerence=0.00001): # text.set_fontsize(6) fname = args.out_figures / f'2_{group}_test_errors_binned_by_feat_medians_sel.pdf' figures[fname.stem] = fname - vaep.savefig(ax.get_figure(), name=fname) + pimmslearn.savefig(ax.get_figure(), name=fname) plt.show(fig) dumps[fname.stem] = fname.with_suffix('.csv') errors_binned.to_csv(fname.with_suffix('.csv')) - vaep.plotting.make_large_descriptors(6) + pimmslearn.plotting.make_large_descriptors(6) # ax.xaxis.set_tick_params(rotation=0) # horizontal # # ! only used for reporting - plotted = vaep.plotting.errors.get_data_for_errors_by_median( + plotted = pimmslearn.plotting.errors.get_data_for_errors_by_median( errors=errors_binned, feat_name=FEAT_NAME_DISPLAY, metric_name=METRIC @@ -819,7 +819,7 @@ def highlight_min(s, color, tolerence=0.00001): # %% tags=["hide-input"] fig, ax = plt.subplots(figsize=(8, 2)) -ax, errors_binned = vaep.plotting.errors.plot_errors_binned( +ax, errors_binned = pimmslearn.plotting.errors.plot_errors_binned( pred_test[ [TARGET_COL] + TOP_N_ORDER ], @@ -830,7 +830,7 @@ def highlight_min(s, color, tolerence=0.00001): ax.legend(loc='best', ncols=len(TOP_N_ORDER)) fname = args.out_figures / f'2_{group}_test_errors_binned_by_int.pdf' figures[fname.stem] = fname -vaep.savefig(ax.get_figure(), name=fname) +pimmslearn.savefig(ax.get_figure(), name=fname) # %% tags=["hide-input"] dumps[fname.stem] = fname.with_suffix('.csv') diff --git a/project/01_3_revision3.py b/project/01_3_revision3.py index 9362de276..9100d35aa 100644 --- a/project/01_3_revision3.py +++ b/project/01_3_revision3.py @@ -27,21 +27,21 @@ import pandas as pd import yaml -import vaep -import vaep.imputation -import vaep.models -import vaep.nb -from vaep.analyzers import compare_predictions -from vaep.models.collect_dumps import select_content +import pimmslearn +import pimmslearn.imputation +import pimmslearn.models +import pimmslearn.nb +from pimmslearn.analyzers import compare_predictions +from pimmslearn.models.collect_dumps import select_content pd.options.display.max_rows = 30 pd.options.display.min_rows = 10 pd.options.display.max_colwidth = 100 plt.rcParams.update({'figure.figsize': (3, 2)}) -vaep.plotting.make_large_descriptors(7) +pimmslearn.plotting.make_large_descriptors(7) -logger = vaep.logging.setup_nb_logger() +logger = pimmslearn.logging.setup_nb_logger() logging.getLogger('fontTools').setLevel(logging.WARNING) @@ -97,11 +97,11 @@ def build_text(s): # Some argument transformations # %% -args = vaep.nb.get_params(args, globals=globals()) +args = pimmslearn.nb.get_params(args, globals=globals()) args # %% -args = vaep.nb.args_from_dict(args) +args = pimmslearn.nb.args_from_dict(args) args # %% @@ -154,14 +154,14 @@ def build_text(s): pred_test = pred_test.dropna() # %% -metrics = vaep.models.Metrics() +metrics = pimmslearn.models.Metrics() test_metrics = metrics.add_metrics( pred_test, key='test data') test_metrics = pd.DataFrame(test_metrics) test_metrics # %% -metrics = vaep.models.Metrics() +metrics = pimmslearn.models.Metrics() val_metrics = metrics.add_metrics( pred_val, key='validation data') val_metrics = pd.DataFrame(val_metrics) diff --git a/project/02_1_aggregate_metrics.py.ipynb b/project/02_1_aggregate_metrics.py.ipynb index 32421032d..16f37781c 100644 --- a/project/02_1_aggregate_metrics.py.ipynb +++ b/project/02_1_aggregate_metrics.py.ipynb @@ -10,7 +10,7 @@ "from pathlib import Path\n", "import pandas as pd\n", "\n", - "from vaep.models.collect_dumps import collect_metrics" + "from pimmslearn.models.collect_dumps import collect_metrics" ] }, { diff --git a/project/02_1_aggregate_metrics.py.py b/project/02_1_aggregate_metrics.py.py index ea11f334a..575a89fce 100644 --- a/project/02_1_aggregate_metrics.py.py +++ b/project/02_1_aggregate_metrics.py.py @@ -5,7 +5,7 @@ # extension: .py # format_name: percent # format_version: '1.3' -# jupytext_version: 1.15.0 +# jupytext_version: 1.16.2 # kernelspec: # display_name: Python 3 # language: python @@ -16,7 +16,7 @@ from pathlib import Path import pandas as pd -from vaep.models.collect_dumps import collect_metrics +from pimmslearn.models.collect_dumps import collect_metrics # %% all_metrics = collect_metrics(snakemake.input) diff --git a/project/02_1_join_metrics.py.py b/project/02_1_join_metrics.py.py index 8b395c187..4b3aacb1e 100644 --- a/project/02_1_join_metrics.py.py +++ b/project/02_1_join_metrics.py.py @@ -5,7 +5,7 @@ # extension: .py # format_name: percent # format_version: '1.3' -# jupytext_version: 1.15.0 +# jupytext_version: 1.16.2 # kernelspec: # display_name: Python 3 # language: python diff --git a/project/02_2_aggregate_configs.py.ipynb b/project/02_2_aggregate_configs.py.ipynb index cdb9d77fb..e396665db 100644 --- a/project/02_2_aggregate_configs.py.ipynb +++ b/project/02_2_aggregate_configs.py.ipynb @@ -21,8 +21,8 @@ "from pathlib import Path\n", "import pandas as pd\n", "\n", - "from vaep.logging import setup_nb_logger\n", - "from vaep.models.collect_dumps import collect_configs\n", + "from pimmslearn.logging import setup_nb_logger\n", + "from pimmslearn.models.collect_dumps import collect_configs\n", "\n", "pd.options.display.max_columns = 30\n", "\n", diff --git a/project/02_2_aggregate_configs.py.py b/project/02_2_aggregate_configs.py.py index dc8ba3a3a..5f31369e4 100644 --- a/project/02_2_aggregate_configs.py.py +++ b/project/02_2_aggregate_configs.py.py @@ -5,7 +5,7 @@ # extension: .py # format_name: percent # format_version: '1.3' -# jupytext_version: 1.15.0 +# jupytext_version: 1.16.2 # kernelspec: # display_name: Python 3 # language: python @@ -21,8 +21,8 @@ from pathlib import Path import pandas as pd -from vaep.logging import setup_nb_logger -from vaep.models.collect_dumps import collect_configs +from pimmslearn.logging import setup_nb_logger +from pimmslearn.models.collect_dumps import collect_configs pd.options.display.max_columns = 30 diff --git a/project/02_2_join_configs.py.py b/project/02_2_join_configs.py.py index d8381e119..1d6aaa5d8 100644 --- a/project/02_2_join_configs.py.py +++ b/project/02_2_join_configs.py.py @@ -5,7 +5,7 @@ # extension: .py # format_name: percent # format_version: '1.3' -# jupytext_version: 1.15.0 +# jupytext_version: 1.16.2 # kernelspec: # display_name: Python 3 # language: python diff --git a/project/02_3_grid_search_analysis.ipynb b/project/02_3_grid_search_analysis.ipynb index 4a3028aa6..32ca396d9 100644 --- a/project/02_3_grid_search_analysis.ipynb +++ b/project/02_3_grid_search_analysis.ipynb @@ -16,7 +16,6 @@ "metadata": {}, "outputs": [], "source": [ - "import snakemake\n", "import logging\n", "import pathlib\n", "\n", @@ -25,15 +24,16 @@ "import pandas as pd\n", "import plotly.express as px\n", "import seaborn as sns\n", + "import snakemake\n", "\n", - "import vaep.io\n", - "import vaep.nb\n", - "import vaep.pandas\n", - "import vaep.plotting.plotly as px_vaep\n", - "import vaep.utils\n", - "from vaep import sampling\n", - "from vaep.analyzers import compare_predictions\n", - "from vaep.io import datasplits\n", + "import pimmslearn.io\n", + "import pimmslearn.nb\n", + "import pimmslearn.pandas\n", + "import pimmslearn.plotting.plotly as px_pimmslearn\n", + "import pimmslearn.utils\n", + "from pimmslearn import sampling\n", + "from pimmslearn.analyzers import compare_predictions\n", + "from pimmslearn.io import datasplits\n", "\n", "matplotlib.rcParams['figure.figsize'] = [12.0, 6.0]\n", "\n", @@ -42,7 +42,7 @@ "pd.options.display.max_rows = 100\n", "pd.options.display.multi_sparse = False\n", "\n", - "logger = vaep.logging.setup_nb_logger()\n", + "logger = pimmslearn.logging.setup_nb_logger()\n", "logging.getLogger('fontTools').setLevel(logging.WARNING)" ] }, @@ -340,7 +340,7 @@ " horizontalalignment='right')\n", "fig = ax.get_figure()\n", "fig.tight_layout()\n", - "vaep.savefig(fig, name='top_10_models_validation_fake_na', folder=FOLDER)" + "pimmslearn.savefig(fig, name='top_10_models_validation_fake_na', folder=FOLDER)" ] }, { @@ -627,7 +627,7 @@ "plt.rcParams['figure.figsize'] = (7, 4)\n", "plt.rcParams['lines.linewidth'] = 2\n", "plt.rcParams['lines.markersize'] = 3\n", - "vaep.plotting.make_large_descriptors(7)\n", + "pimmslearn.plotting.make_large_descriptors(7)\n", "\n", "col_order = ('valid_fake_na', 'test_fake_na')\n", "row_order = ('MAE', 'MSE')\n", @@ -641,7 +641,7 @@ " row_order=row_order,\n", " hue=\"model\",\n", " # style=\"day\",\n", - " palette=vaep.plotting.defaults.color_model_mapping,\n", + " palette=pimmslearn.plotting.defaults.color_model_mapping,\n", " height=2,\n", " aspect=1.8,\n", " kind=\"scatter\",\n", @@ -1041,7 +1041,7 @@ "metadata": {}, "outputs": [], "source": [ - "errors = vaep.pandas.calc_errors_per_feat(\n", + "errors = pimmslearn.pandas.calc_errors_per_feat(\n", " pred=pred_split, freq_feat=freq_feat, target_col='observed')\n", "errors" ] @@ -1062,7 +1062,7 @@ " xlabel='number of samples',\n", " ylabel='observations')\n", " )\n", - "vaep.savefig(ax.get_figure(), files_out[f'n_obs_error_counts_{dataset}.pdf'])" + "pimmslearn.savefig(ax.get_figure(), files_out[f'n_obs_error_counts_{dataset}.pdf'])" ] }, { @@ -1147,7 +1147,7 @@ "\n", "files_out[f'best_models_ld_{min_latent}_rolling_errors_by_freq'] = (\n", " FOLDER / f'best_models_ld_{min_latent}_rolling_errors_by_freq')\n", - "vaep.savefig(\n", + "pimmslearn.savefig(\n", " ax.get_figure(),\n", " name=files_out[f'best_models_ld_{min_latent}_rolling_errors_by_freq'])" ] @@ -1182,19 +1182,19 @@ }, "outputs": [], "source": [ - "fig = px_vaep.line((errors_smoothed_long\n", - " .loc[errors_smoothed_long[freq_feat.name] >= FREQ_MIN]\n", - " .join(n_obs_error_is_based_on)\n", - " .sort_values(by='freq')),\n", - " x=freq_feat.name,\n", - " color='model',\n", - " y='rolling error average',\n", - " hover_data=['n_obs'],\n", - " # title=f'Rolling average error by feature frequency {msg_annotation}',\n", - " labels=labels_dict,\n", - " category_orders={'model': order},\n", - " )\n", - "fig = px_vaep.apply_default_layout(fig)\n", + "fig = px.line((errors_smoothed_long\n", + " .loc[errors_smoothed_long[freq_feat.name] >= FREQ_MIN]\n", + " .join(n_obs_error_is_based_on)\n", + " .sort_values(by='freq')),\n", + " x=freq_feat.name,\n", + " color='model',\n", + " y='rolling error average',\n", + " hover_data=['n_obs'],\n", + " # title=f'Rolling average error by feature frequency {msg_annotation}',\n", + " labels=labels_dict,\n", + " category_orders={'model': order},\n", + " )\n", + "fig = px_pimmslearn.apply_default_layout(fig)\n", "fig.update_layout(legend_title_text='') # remove legend title\n", "files_out[f'best_models_ld_{min_latent}_errors_by_freq_plotly.html'] = (\n", " FOLDER / f'best_models_ld_{min_latent}_errors_by_freq_plotly.html')\n", @@ -1238,7 +1238,7 @@ ")\n", "files_out[f'best_models_ld_{min_latent}_errors_by_freq_averaged'] = (\n", " FOLDER / f'best_models_ld_{min_latent}_errors_by_freq_averaged')\n", - "vaep.savefig(\n", + "pimmslearn.savefig(\n", " ax.get_figure(),\n", " files_out[f'best_models_ld_{min_latent}_errors_by_freq_averaged'])" ] @@ -1397,7 +1397,7 @@ "metadata": {}, "outputs": [], "source": [ - "errors = vaep.pandas.calc_errors_per_feat(\n", + "errors = pimmslearn.pandas.calc_errors_per_feat(\n", " pred=pred_split, freq_feat=freq_feat, target_col='observed')\n", "idx_name = errors.index.name\n", "errors" @@ -1413,8 +1413,8 @@ "files_out[f'best_models_errors_counts_obs_{dataset}.pdf'] = (FOLDER /\n", " f'n_obs_error_counts_{dataset}.pdf')\n", "ax = errors['n_obs'].value_counts().sort_index().plot(style='.')\n", - "vaep.savefig(ax.get_figure(),\n", - " files_out[f'best_models_errors_counts_obs_{dataset}.pdf'])" + "pimmslearn.savefig(ax.get_figure(),\n", + " files_out[f'best_models_errors_counts_obs_{dataset}.pdf'])" ] }, { @@ -1465,7 +1465,7 @@ " # title=f'Rolling average error by feature frequency {msg_annotation}'\n", " ))\n", "\n", - "vaep.savefig(\n", + "pimmslearn.savefig(\n", " ax.get_figure(),\n", " folder=FOLDER,\n", " name=f'best_models_rolling_errors_{dataset}')" @@ -1600,7 +1600,7 @@ " horizontalalignment='right')\n", "files_out[f'pred_corr_per_feat_{dataset}'] = (FOLDER /\n", " f'pred_corr_per_feat_{dataset}')\n", - "vaep.savefig(ax.get_figure(), name=files_out[f'pred_corr_per_feat_{dataset}'])" + "pimmslearn.savefig(ax.get_figure(), name=files_out[f'pred_corr_per_feat_{dataset}'])" ] }, { @@ -1659,8 +1659,8 @@ " horizontalalignment='right')\n", "files_out[f'pred_corr_per_sample_{dataset}'] = (FOLDER /\n", " f'pred_corr_per_sample_{dataset}')\n", - "vaep.savefig(ax.get_figure(),\n", - " name=files_out[f'pred_corr_per_sample_{dataset}'])" + "pimmslearn.savefig(ax.get_figure(),\n", + " name=files_out[f'pred_corr_per_sample_{dataset}'])" ] }, { diff --git a/project/02_3_grid_search_analysis.py b/project/02_3_grid_search_analysis.py index 540bebc05..99891a7b3 100644 --- a/project/02_3_grid_search_analysis.py +++ b/project/02_3_grid_search_analysis.py @@ -6,7 +6,7 @@ # extension: .py # format_name: percent # format_version: '1.3' -# jupytext_version: 1.15.0 +# jupytext_version: 1.16.2 # kernelspec: # display_name: Python 3 # language: python @@ -17,7 +17,6 @@ # # Analyis of grid hyperparameter search # %% -import snakemake import logging import pathlib @@ -26,15 +25,16 @@ import pandas as pd import plotly.express as px import seaborn as sns +import snakemake -import vaep.io -import vaep.nb -import vaep.pandas -import vaep.plotting.plotly as px_vaep -import vaep.utils -from vaep import sampling -from vaep.analyzers import compare_predictions -from vaep.io import datasplits +import pimmslearn.io +import pimmslearn.nb +import pimmslearn.pandas +import pimmslearn.plotting.plotly as px_pimmslearn +import pimmslearn.utils +from pimmslearn import sampling +from pimmslearn.analyzers import compare_predictions +from pimmslearn.io import datasplits matplotlib.rcParams['figure.figsize'] = [12.0, 6.0] @@ -43,7 +43,7 @@ pd.options.display.max_rows = 100 pd.options.display.multi_sparse = False -logger = vaep.logging.setup_nb_logger() +logger = pimmslearn.logging.setup_nb_logger() logging.getLogger('fontTools').setLevel(logging.WARNING) # %% [markdown] @@ -187,7 +187,7 @@ horizontalalignment='right') fig = ax.get_figure() fig.tight_layout() -vaep.savefig(fig, name='top_10_models_validation_fake_na', folder=FOLDER) +pimmslearn.savefig(fig, name='top_10_models_validation_fake_na', folder=FOLDER) # %% [markdown] # ## Create metrics in long format @@ -330,7 +330,7 @@ plt.rcParams['figure.figsize'] = (7, 4) plt.rcParams['lines.linewidth'] = 2 plt.rcParams['lines.markersize'] = 3 -vaep.plotting.make_large_descriptors(7) +pimmslearn.plotting.make_large_descriptors(7) col_order = ('valid_fake_na', 'test_fake_na') row_order = ('MAE', 'MSE') @@ -344,7 +344,7 @@ row_order=row_order, hue="model", # style="day", - palette=vaep.plotting.defaults.color_model_mapping, + palette=pimmslearn.plotting.defaults.color_model_mapping, height=2, aspect=1.8, kind="scatter", @@ -589,7 +589,7 @@ def get_plotly_figure(dataset: str, x='latent_dim'): freq_feat.head() # training data # %% -errors = vaep.pandas.calc_errors_per_feat( +errors = pimmslearn.pandas.calc_errors_per_feat( pred=pred_split, freq_feat=freq_feat, target_col='observed') errors @@ -603,7 +603,7 @@ def get_plotly_figure(dataset: str, x='latent_dim'): xlabel='number of samples', ylabel='observations') ) -vaep.savefig(ax.get_figure(), files_out[f'n_obs_error_counts_{dataset}.pdf']) +pimmslearn.savefig(ax.get_figure(), files_out[f'n_obs_error_counts_{dataset}.pdf']) # %% ax = errors.plot.scatter('freq', 'n_obs') @@ -645,7 +645,7 @@ def get_plotly_figure(dataset: str, x='latent_dim'): files_out[f'best_models_ld_{min_latent}_rolling_errors_by_freq'] = ( FOLDER / f'best_models_ld_{min_latent}_rolling_errors_by_freq') -vaep.savefig( +pimmslearn.savefig( ax.get_figure(), name=files_out[f'best_models_ld_{min_latent}_rolling_errors_by_freq']) @@ -658,19 +658,19 @@ def get_plotly_figure(dataset: str, x='latent_dim'): # Save html versin of curve with annotation of errors # %% -fig = px_vaep.line((errors_smoothed_long - .loc[errors_smoothed_long[freq_feat.name] >= FREQ_MIN] - .join(n_obs_error_is_based_on) - .sort_values(by='freq')), - x=freq_feat.name, - color='model', - y='rolling error average', - hover_data=['n_obs'], - # title=f'Rolling average error by feature frequency {msg_annotation}', - labels=labels_dict, - category_orders={'model': order}, - ) -fig = px_vaep.apply_default_layout(fig) +fig = px.line((errors_smoothed_long + .loc[errors_smoothed_long[freq_feat.name] >= FREQ_MIN] + .join(n_obs_error_is_based_on) + .sort_values(by='freq')), + x=freq_feat.name, + color='model', + y='rolling error average', + hover_data=['n_obs'], + # title=f'Rolling average error by feature frequency {msg_annotation}', + labels=labels_dict, + category_orders={'model': order}, + ) +fig = px_pimmslearn.apply_default_layout(fig) fig.update_layout(legend_title_text='') # remove legend title files_out[f'best_models_ld_{min_latent}_errors_by_freq_plotly.html'] = ( FOLDER / f'best_models_ld_{min_latent}_errors_by_freq_plotly.html') @@ -700,7 +700,7 @@ def get_plotly_figure(dataset: str, x='latent_dim'): ) files_out[f'best_models_ld_{min_latent}_errors_by_freq_averaged'] = ( FOLDER / f'best_models_ld_{min_latent}_errors_by_freq_averaged') -vaep.savefig( +pimmslearn.savefig( ax.get_figure(), files_out[f'best_models_ld_{min_latent}_errors_by_freq_averaged']) @@ -783,7 +783,7 @@ def get_plotly_figure(dataset: str, x='latent_dim'): freq_feat # %% -errors = vaep.pandas.calc_errors_per_feat( +errors = pimmslearn.pandas.calc_errors_per_feat( pred=pred_split, freq_feat=freq_feat, target_col='observed') idx_name = errors.index.name errors @@ -792,8 +792,8 @@ def get_plotly_figure(dataset: str, x='latent_dim'): files_out[f'best_models_errors_counts_obs_{dataset}.pdf'] = (FOLDER / f'n_obs_error_counts_{dataset}.pdf') ax = errors['n_obs'].value_counts().sort_index().plot(style='.') -vaep.savefig(ax.get_figure(), - files_out[f'best_models_errors_counts_obs_{dataset}.pdf']) +pimmslearn.savefig(ax.get_figure(), + files_out[f'best_models_errors_counts_obs_{dataset}.pdf']) # %% n_obs_error_is_based_on = errors['n_obs'] @@ -823,7 +823,7 @@ def get_plotly_figure(dataset: str, x='latent_dim'): # title=f'Rolling average error by feature frequency {msg_annotation}' )) -vaep.savefig( +pimmslearn.savefig( ax.get_figure(), folder=FOLDER, name=f'best_models_rolling_errors_{dataset}') @@ -891,7 +891,7 @@ def get_plotly_figure(dataset: str, x='latent_dim'): horizontalalignment='right') files_out[f'pred_corr_per_feat_{dataset}'] = (FOLDER / f'pred_corr_per_feat_{dataset}') -vaep.savefig(ax.get_figure(), name=files_out[f'pred_corr_per_feat_{dataset}']) +pimmslearn.savefig(ax.get_figure(), name=files_out[f'pred_corr_per_feat_{dataset}']) # %% files_out[f'pred_corr_per_feat_{dataset}.xlsx'] = (FOLDER / @@ -923,8 +923,8 @@ def get_plotly_figure(dataset: str, x='latent_dim'): horizontalalignment='right') files_out[f'pred_corr_per_sample_{dataset}'] = (FOLDER / f'pred_corr_per_sample_{dataset}') -vaep.savefig(ax.get_figure(), - name=files_out[f'pred_corr_per_sample_{dataset}']) +pimmslearn.savefig(ax.get_figure(), + name=files_out[f'pred_corr_per_sample_{dataset}']) # %% files_out[f'pred_corr_per_sample_{dataset}.xlsx'] = (FOLDER / diff --git a/project/02_4_best_models_over_all_data.ipynb b/project/02_4_best_models_over_all_data.ipynb index f252973d2..96830da20 100644 --- a/project/02_4_best_models_over_all_data.ipynb +++ b/project/02_4_best_models_over_all_data.ipynb @@ -20,8 +20,8 @@ "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import plotly.express as px\n", - "import vaep.plotting\n", - "import vaep.nb\n", + "import pimmslearn.plotting\n", + "import pimmslearn.nb\n", "\n", "\n", "pd.options.display.max_columns = 45\n", @@ -30,9 +30,9 @@ "\n", "plt.rcParams['figure.figsize'] = [12.0, 6.0]\n", "\n", - "vaep.plotting.make_large_descriptors()\n", + "pimmslearn.plotting.make_large_descriptors()\n", "\n", - "logger = vaep.logging.setup_nb_logger()" + "logger = pimmslearn.logging.setup_nb_logger()" ] }, { @@ -316,11 +316,11 @@ " xlabel='',\n", " ylabel=f\"{METRIC} (log2 intensities)\",\n", " width=.8)\n", - "ax = vaep.plotting.add_height_to_barplot(ax, size=12)\n", - "ax = vaep.plotting.add_text_to_barplot(ax, text, size=12)\n", + "ax = pimmslearn.plotting.add_height_to_barplot(ax, size=12)\n", + "ax = pimmslearn.plotting.add_text_to_barplot(ax, text, size=12)\n", "fig = ax.get_figure()\n", "fig.tight_layout()\n", - "vaep.savefig(fig, fname, folder=FOLDER)" + "pimmslearn.savefig(fig, fname, folder=FOLDER)" ] }, { @@ -366,11 +366,11 @@ " xlabel='',\n", " ylabel=f\"{METRIC} (log2 intensities)\",\n", " width=.8)\n", - "ax = vaep.plotting.add_height_to_barplot(ax, size=12)\n", - "ax = vaep.plotting.add_text_to_barplot(ax, text, size=12)\n", + "ax = pimmslearn.plotting.add_height_to_barplot(ax, size=12)\n", + "ax = pimmslearn.plotting.add_text_to_barplot(ax, text, size=12)\n", "fig = ax.get_figure()\n", "fig.tight_layout()\n", - "vaep.savefig(fig, fname, folder=FOLDER)" + "pimmslearn.savefig(fig, fname, folder=FOLDER)" ] }, { @@ -533,10 +533,10 @@ " '#b15928']\n", " )\n", " )\n", - "ax = vaep.plotting.add_height_to_barplot(ax, size=11)\n", + "ax = pimmslearn.plotting.add_height_to_barplot(ax, size=11)\n", "ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right')\n", "fig.tight_layout()\n", - "vaep.savefig(fig, fname, folder=FOLDER)" + "pimmslearn.savefig(fig, fname, folder=FOLDER)" ] }, { @@ -644,10 +644,10 @@ " '#b15928']\n", " )\n", " )\n", - "ax = vaep.plotting.add_height_to_barplot(ax, size=11)\n", + "ax = pimmslearn.plotting.add_height_to_barplot(ax, size=11)\n", "ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right')\n", "fig.tight_layout()\n", - "vaep.savefig(fig, fname, folder=FOLDER)" + "pimmslearn.savefig(fig, fname, folder=FOLDER)" ] }, { diff --git a/project/02_4_best_models_over_all_data.py b/project/02_4_best_models_over_all_data.py index 3aea8ecae..972da3dc6 100644 --- a/project/02_4_best_models_over_all_data.py +++ b/project/02_4_best_models_over_all_data.py @@ -6,7 +6,7 @@ # extension: .py # format_name: percent # format_version: '1.3' -# jupytext_version: 1.15.0 +# jupytext_version: 1.16.2 # kernelspec: # display_name: Python 3 # language: python @@ -21,8 +21,8 @@ import pandas as pd import matplotlib.pyplot as plt import plotly.express as px -import vaep.plotting -import vaep.nb +import pimmslearn.plotting +import pimmslearn.nb pd.options.display.max_columns = 45 @@ -31,9 +31,9 @@ plt.rcParams['figure.figsize'] = [12.0, 6.0] -vaep.plotting.make_large_descriptors() +pimmslearn.plotting.make_large_descriptors() -logger = vaep.logging.setup_nb_logger() +logger = pimmslearn.logging.setup_nb_logger() # %% [markdown] # ## Read input @@ -190,11 +190,11 @@ xlabel='', ylabel=f"{METRIC} (log2 intensities)", width=.8) -ax = vaep.plotting.add_height_to_barplot(ax, size=12) -ax = vaep.plotting.add_text_to_barplot(ax, text, size=12) +ax = pimmslearn.plotting.add_height_to_barplot(ax, size=12) +ax = pimmslearn.plotting.add_text_to_barplot(ax, text, size=12) fig = ax.get_figure() fig.tight_layout() -vaep.savefig(fig, fname, folder=FOLDER) +pimmslearn.savefig(fig, fname, folder=FOLDER) # %% [markdown] # ### Validation data results @@ -220,11 +220,11 @@ xlabel='', ylabel=f"{METRIC} (log2 intensities)", width=.8) -ax = vaep.plotting.add_height_to_barplot(ax, size=12) -ax = vaep.plotting.add_text_to_barplot(ax, text, size=12) +ax = pimmslearn.plotting.add_height_to_barplot(ax, size=12) +ax = pimmslearn.plotting.add_text_to_barplot(ax, text, size=12) fig = ax.get_figure() fig.tight_layout() -vaep.savefig(fig, fname, folder=FOLDER) +pimmslearn.savefig(fig, fname, folder=FOLDER) # %% fname = 'best_models_1_val_plotly' @@ -333,10 +333,10 @@ '#b15928'] ) ) -ax = vaep.plotting.add_height_to_barplot(ax, size=11) +ax = pimmslearn.plotting.add_height_to_barplot(ax, size=11) ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right') fig.tight_layout() -vaep.savefig(fig, fname, folder=FOLDER) +pimmslearn.savefig(fig, fname, folder=FOLDER) # %% [markdown] # plotly version with additional information @@ -411,10 +411,10 @@ '#b15928'] ) ) -ax = vaep.plotting.add_height_to_barplot(ax, size=11) +ax = pimmslearn.plotting.add_height_to_barplot(ax, size=11) ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right') fig.tight_layout() -vaep.savefig(fig, fname, folder=FOLDER) +pimmslearn.savefig(fig, fname, folder=FOLDER) # %% [markdown] # plotly version with additional information diff --git a/project/03_1_best_models_comparison.ipynb b/project/03_1_best_models_comparison.ipynb index af3948834..dc1433b03 100644 --- a/project/03_1_best_models_comparison.ipynb +++ b/project/03_1_best_models_comparison.ipynb @@ -13,15 +13,15 @@ "import pandas as pd\n", "import seaborn as sns\n", "\n", - "import vaep.nb\n", - "import vaep.pandas\n", - "import vaep.plotting\n", - "from vaep.logging import setup_logger\n", + "import pimmslearn.nb\n", + "import pimmslearn.pandas\n", + "import pimmslearn.plotting\n", + "from pimmslearn.logging import setup_logger\n", "\n", "logger = setup_logger(logger=logging.getLogger('vaep'), level=10)\n", "\n", "plt.rcParams['figure.figsize'] = [4.0, 2.0]\n", - "vaep.plotting.make_large_descriptors(7)" + "pimmslearn.plotting.make_large_descriptors(7)" ] }, { @@ -123,7 +123,7 @@ "logger.setLevel(20) # reset debug\n", "ax = to_plot['mean'].plot.bar(rot=0,\n", " width=.8,\n", - " color=vaep.plotting.defaults.color_model_mapping,\n", + " color=pimmslearn.plotting.defaults.color_model_mapping,\n", " yerr=to_plot['std'])\n", "ax.set_xlabel('')" ] @@ -176,7 +176,7 @@ " errcolor=\"black\",\n", " hue_order=IDX[1],\n", " order=IDX[0],\n", - " palette=vaep.plotting.defaults.color_model_mapping,\n", + " palette=pimmslearn.plotting.defaults.color_model_mapping,\n", " alpha=0.9,\n", " height=2, # set the height of the figure\n", " aspect=1.8 # set the aspect ratio of the figure\n", @@ -185,7 +185,7 @@ "# map data to stripplot\n", "g.map(sns.stripplot, 'data level', 'MAE', 'model',\n", " hue_order=IDX[1], order=IDX[0],\n", - " palette=vaep.plotting.defaults.color_model_mapping,\n", + " palette=pimmslearn.plotting.defaults.color_model_mapping,\n", " dodge=True, alpha=1, ec='k', linewidth=1,\n", " s=2)\n", "\n", @@ -200,7 +200,7 @@ "metadata": {}, "outputs": [], "source": [ - "vaep.savefig(fig, FOLDER / \"model_performance_repeated_runs.pdf\", tight_layout=False)" + "pimmslearn.savefig(fig, FOLDER / \"model_performance_repeated_runs.pdf\", tight_layout=False)" ] }, { diff --git a/project/03_1_best_models_comparison.py b/project/03_1_best_models_comparison.py index 6a2daadce..1c6c31b23 100644 --- a/project/03_1_best_models_comparison.py +++ b/project/03_1_best_models_comparison.py @@ -6,7 +6,7 @@ # extension: .py # format_name: percent # format_version: '1.3' -# jupytext_version: 1.15.0 +# jupytext_version: 1.16.2 # kernelspec: # display_name: Python 3 # language: python @@ -21,15 +21,15 @@ import pandas as pd import seaborn as sns -import vaep.nb -import vaep.pandas -import vaep.plotting -from vaep.logging import setup_logger +import pimmslearn.nb +import pimmslearn.pandas +import pimmslearn.plotting +from pimmslearn.logging import setup_logger logger = setup_logger(logger=logging.getLogger('vaep'), level=10) plt.rcParams['figure.figsize'] = [4.0, 2.0] -vaep.plotting.make_large_descriptors(7) +pimmslearn.plotting.make_large_descriptors(7) # %% IDX = [['proteinGroups', 'peptides', 'evidence'], @@ -81,7 +81,7 @@ logger.setLevel(20) # reset debug ax = to_plot['mean'].plot.bar(rot=0, width=.8, - color=vaep.plotting.defaults.color_model_mapping, + color=pimmslearn.plotting.defaults.color_model_mapping, yerr=to_plot['std']) ax.set_xlabel('') @@ -114,7 +114,7 @@ errcolor="black", hue_order=IDX[1], order=IDX[0], - palette=vaep.plotting.defaults.color_model_mapping, + palette=pimmslearn.plotting.defaults.color_model_mapping, alpha=0.9, height=2, # set the height of the figure aspect=1.8 # set the aspect ratio of the figure @@ -123,7 +123,7 @@ # map data to stripplot g.map(sns.stripplot, 'data level', 'MAE', 'model', hue_order=IDX[1], order=IDX[0], - palette=vaep.plotting.defaults.color_model_mapping, + palette=pimmslearn.plotting.defaults.color_model_mapping, dodge=True, alpha=1, ec='k', linewidth=1, s=2) @@ -132,7 +132,7 @@ _ = ax.set_xlabel('') # %% -vaep.savefig(fig, FOLDER / "model_performance_repeated_runs.pdf", tight_layout=False) +pimmslearn.savefig(fig, FOLDER / "model_performance_repeated_runs.pdf", tight_layout=False) # %% writer.close() diff --git a/project/03_2_best_models_comparison_fig2.ipynb b/project/03_2_best_models_comparison_fig2.ipynb index 6370691f4..5685696f7 100644 --- a/project/03_2_best_models_comparison_fig2.ipynb +++ b/project/03_2_best_models_comparison_fig2.ipynb @@ -17,12 +17,12 @@ "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", - "import vaep.plotting\n", - "import vaep.pandas\n", - "import vaep.nb\n", + "import pimmslearn.plotting\n", + "import pimmslearn.pandas\n", + "import pimmslearn.nb\n", "\n", "import logging\n", - "from vaep.logging import setup_logger\n", + "from pimmslearn.logging import setup_logger\n", "logger = setup_logger(logger=logging.getLogger('vaep'), level=10)" ] }, @@ -102,7 +102,7 @@ "metadata": {}, "outputs": [], "source": [ - "COLORS_TO_USE_MAPPTING = vaep.plotting.defaults.color_model_mapping\n", + "COLORS_TO_USE_MAPPTING = pimmslearn.plotting.defaults.color_model_mapping\n", "print(COLORS_TO_USE_MAPPTING.keys())\n", "sns.color_palette(palette=COLORS_TO_USE_MAPPTING.values())" ] @@ -155,7 +155,7 @@ " ))\n", "\n", "\n", - "ax = vaep.plotting.add_height_to_barplot(ax, size=6, rotated=True)\n", + "ax = pimmslearn.plotting.add_height_to_barplot(ax, size=6, rotated=True)\n", "ax.set_ylim((0, 0.75))\n", "ax.legend(fontsize=5, loc='lower right')\n", "text = (\n", @@ -165,10 +165,10 @@ " .stack().loc[pd.IndexSlice[ORDER_MODELS, ORDER_DATA]]\n", "\n", ")\n", - "ax = vaep.plotting.add_text_to_barplot(ax, text, size=6)\n", + "ax = pimmslearn.plotting.add_text_to_barplot(ax, text, size=6)\n", "fig = ax.get_figure()\n", "fig.tight_layout()\n", - "vaep.savefig(fig, fname)" + "pimmslearn.savefig(fig, fname)" ] }, { diff --git a/project/03_2_best_models_comparison_fig2.py b/project/03_2_best_models_comparison_fig2.py index d5b69eae1..f5901e9c0 100644 --- a/project/03_2_best_models_comparison_fig2.py +++ b/project/03_2_best_models_comparison_fig2.py @@ -6,7 +6,7 @@ # extension: .py # format_name: percent # format_version: '1.3' -# jupytext_version: 1.15.2 +# jupytext_version: 1.16.2 # kernelspec: # display_name: Python 3 # language: python @@ -22,12 +22,12 @@ import matplotlib.pyplot as plt import seaborn as sns -import vaep.plotting -import vaep.pandas -import vaep.nb +import pimmslearn.plotting +import pimmslearn.pandas +import pimmslearn.nb import logging -from vaep.logging import setup_logger +from pimmslearn.logging import setup_logger logger = setup_logger(logger=logging.getLogger('vaep'), level=10) @@ -68,7 +68,7 @@ # color mapping globally defined for article figures # %% -COLORS_TO_USE_MAPPTING = vaep.plotting.defaults.color_model_mapping +COLORS_TO_USE_MAPPTING = pimmslearn.plotting.defaults.color_model_mapping print(COLORS_TO_USE_MAPPTING.keys()) sns.color_palette(palette=COLORS_TO_USE_MAPPTING.values()) @@ -105,7 +105,7 @@ )) -ax = vaep.plotting.add_height_to_barplot(ax, size=6, rotated=True) +ax = pimmslearn.plotting.add_height_to_barplot(ax, size=6, rotated=True) ax.set_ylim((0, 0.75)) ax.legend(fontsize=5, loc='lower right') text = ( @@ -115,10 +115,10 @@ .stack().loc[pd.IndexSlice[ORDER_MODELS, ORDER_DATA]] ) -ax = vaep.plotting.add_text_to_barplot(ax, text, size=6) +ax = pimmslearn.plotting.add_text_to_barplot(ax, text, size=6) fig = ax.get_figure() fig.tight_layout() -vaep.savefig(fig, fname) +pimmslearn.savefig(fig, fname) # %% diff --git a/project/03_3_combine_experiment_result_tables.py b/project/03_3_combine_experiment_result_tables.py index 37dd49f26..1a0bbad75 100644 --- a/project/03_3_combine_experiment_result_tables.py +++ b/project/03_3_combine_experiment_result_tables.py @@ -5,7 +5,7 @@ # extension: .py # format_name: percent # format_version: '1.3' -# jupytext_version: 1.15.0 +# jupytext_version: 1.16.2 # kernelspec: # display_name: Python 3 # language: python diff --git a/project/03_6_setup_comparison_rev3.py b/project/03_6_setup_comparison_rev3.py index 50bd6f522..805c4678e 100644 --- a/project/03_6_setup_comparison_rev3.py +++ b/project/03_6_setup_comparison_rev3.py @@ -1,20 +1,21 @@ -# %% [markdown] +# %% [markdown] # # Compare setup of different samling strategies of simulated data # # 1. sampling from all samples # 2. sampling from subset of samples # %% +import logging from pathlib import Path + import pandas as pd -import vaep.plotting -import vaep.pandas -import vaep.nb +import pimmslearn.nb +import pimmslearn.pandas +import pimmslearn.plotting +from pimmslearn.logging import setup_logger -import logging -from vaep.logging import setup_logger -logger = setup_logger(logger=logging.getLogger('vaep'), level=10) +logger = setup_logger(logger=logging.getLogger('pimmslearn'), level=10) # %% @@ -55,7 +56,7 @@ for key, file_in in pred_in.items(): _ = (pd.read_csv(file_in, index_col=[0, 1]) ).dropna(axis=1, how='all') - _ = vaep.pandas.calc_errors.get_absolute_error(_) + _ = pimmslearn.pandas.calc_errors.get_absolute_error(_) _.columns = pd.MultiIndex.from_tuples((key, k) for k in _.columns) pred.append(_) pred = pd.concat(pred, axis=1) diff --git a/project/04_1_train_pimms_models.ipynb b/project/04_1_train_pimms_models.ipynb index 657874a23..c6c77a27e 100644 --- a/project/04_1_train_pimms_models.ipynb +++ b/project/04_1_train_pimms_models.ipynb @@ -92,17 +92,17 @@ "import pandas as pd\n", "from IPython.display import display\n", "\n", - "import vaep.filter\n", - "import vaep.plotting.data\n", - "import vaep.sampling\n", - "from vaep.plotting.defaults import color_model_mapping\n", - "from vaep.sklearn.ae_transformer import AETransformer\n", - "from vaep.sklearn.cf_transformer import CollaborativeFilteringTransformer\n", + "import pimmslearn.filter\n", + "import pimmslearn.plotting.data\n", + "import pimmslearn.sampling\n", + "from pimmslearn.plotting.defaults import color_model_mapping\n", + "from pimmslearn.sklearn.ae_transformer import AETransformer\n", + "from pimmslearn.sklearn.cf_transformer import CollaborativeFilteringTransformer\n", "\n", - "vaep.plotting.make_large_descriptors(8)\n", + "pimmslearn.plotting.make_large_descriptors(8)\n", "\n", "\n", - "logger = logger = vaep.logging.setup_nb_logger()\n", + "logger = logger = pimmslearn.logging.setup_nb_logger()\n", "logging.getLogger('fontTools').setLevel(logging.ERROR)" ] }, @@ -235,7 +235,7 @@ }, "outputs": [], "source": [ - "ax = vaep.plotting.data.plot_feat_median_over_prop_missing(\n", + "ax = pimmslearn.plotting.data.plot_feat_median_over_prop_missing(\n", " data=df, type='boxplot')" ] }, @@ -277,8 +277,8 @@ "source": [ "if select_features:\n", " # potentially this can take a few iterations to stabilize.\n", - " df = vaep.filter.select_features(df, feat_prevalence=feat_prevalence)\n", - " df = vaep.filter.select_features(df=df, feat_prevalence=sample_completeness, axis=1)\n", + " df = pimmslearn.filter.select_features(df, feat_prevalence=feat_prevalence)\n", + " df = pimmslearn.filter.select_features(df=df, feat_prevalence=sample_completeness, axis=1)\n", "df.shape" ] }, @@ -327,15 +327,15 @@ "outputs": [], "source": [ "if sample_splits:\n", - " splits, thresholds, fake_na_mcar, fake_na_mnar = vaep.sampling.sample_mnar_mcar(\n", + " splits, thresholds, fake_na_mcar, fake_na_mnar = pimmslearn.sampling.sample_mnar_mcar(\n", " df_long=df,\n", " frac_non_train=frac_non_train,\n", " frac_mnar=frac_mnar,\n", " random_state=random_state,\n", " )\n", - " splits = vaep.sampling.check_split_integrity(splits)\n", + " splits = pimmslearn.sampling.check_split_integrity(splits)\n", "else:\n", - " splits = vaep.sampling.DataSplits(is_wide_format=False)\n", + " splits = pimmslearn.sampling.DataSplits(is_wide_format=False)\n", " splits.train_X = df" ] }, @@ -464,10 +464,10 @@ "\n", "fig, axes = plt.subplots(2, figsize=(8, 4))\n", "\n", - "min_max = vaep.plotting.data.get_min_max_iterable(\n", + "min_max = pimmslearn.plotting.data.get_min_max_iterable(\n", " [observed, imputed])\n", "label_template = '{method} (N={n:,d})'\n", - "ax, _ = vaep.plotting.data.plot_histogram_intensities(\n", + "ax, _ = pimmslearn.plotting.data.plot_histogram_intensities(\n", " observed,\n", " ax=axes[0],\n", " min_max=min_max,\n", @@ -477,7 +477,7 @@ " color='grey',\n", " alpha=1)\n", "_ = ax.legend()\n", - "ax, _ = vaep.plotting.data.plot_histogram_intensities(\n", + "ax, _ = pimmslearn.plotting.data.plot_histogram_intensities(\n", " imputed,\n", " ax=axes[1],\n", " min_max=min_max,\n", @@ -640,12 +640,12 @@ "if splits.val_y is not None:\n", " pred_val = splits.val_y.stack().to_frame('observed')\n", " pred_val[model_selected] = df_imputed\n", - " val_metrics = vaep.models.calculte_metrics(pred_val, 'observed')\n", + " val_metrics = pimmslearn.models.calculte_metrics(pred_val, 'observed')\n", " display(val_metrics)\n", "\n", " fig, ax = plt.subplots(figsize=(8, 2))\n", "\n", - " ax, errors_binned = vaep.plotting.errors.plot_errors_by_median(\n", + " ax, errors_binned = pimmslearn.plotting.errors.plot_errors_by_median(\n", " pred=pred_val,\n", " target_col='observed',\n", " feat_medians=splits.train_X.median(),\n", @@ -702,10 +702,10 @@ "\n", "fig, axes = plt.subplots(2, figsize=(8, 4))\n", "\n", - "min_max = vaep.plotting.data.get_min_max_iterable([observed, imputed])\n", + "min_max = pimmslearn.plotting.data.get_min_max_iterable([observed, imputed])\n", "\n", "label_template = '{method} (N={n:,d})'\n", - "ax, _ = vaep.plotting.data.plot_histogram_intensities(\n", + "ax, _ = pimmslearn.plotting.data.plot_histogram_intensities(\n", " observed,\n", " ax=axes[0],\n", " min_max=min_max,\n", @@ -715,7 +715,7 @@ " color='grey',\n", " alpha=1)\n", "_ = ax.legend()\n", - "ax, _ = vaep.plotting.data.plot_histogram_intensities(\n", + "ax, _ = pimmslearn.plotting.data.plot_histogram_intensities(\n", " imputed,\n", " ax=axes[1],\n", " min_max=min_max,\n", diff --git a/project/04_1_train_pimms_models.py b/project/04_1_train_pimms_models.py index 79db2634b..8b7d11826 100644 --- a/project/04_1_train_pimms_models.py +++ b/project/04_1_train_pimms_models.py @@ -51,17 +51,17 @@ import pandas as pd from IPython.display import display -import vaep.filter -import vaep.plotting.data -import vaep.sampling -from vaep.plotting.defaults import color_model_mapping -from vaep.sklearn.ae_transformer import AETransformer -from vaep.sklearn.cf_transformer import CollaborativeFilteringTransformer +import pimmslearn.filter +import pimmslearn.plotting.data +import pimmslearn.sampling +from pimmslearn.plotting.defaults import color_model_mapping +from pimmslearn.sklearn.ae_transformer import AETransformer +from pimmslearn.sklearn.cf_transformer import CollaborativeFilteringTransformer -vaep.plotting.make_large_descriptors(8) +pimmslearn.plotting.make_large_descriptors(8) -logger = logger = vaep.logging.setup_nb_logger() +logger = logger = pimmslearn.logging.setup_nb_logger() logging.getLogger('fontTools').setLevel(logging.ERROR) # %% [markdown] @@ -113,7 +113,7 @@ # 2. CDF of available intensities per protein group # %% -ax = vaep.plotting.data.plot_feat_median_over_prop_missing( +ax = pimmslearn.plotting.data.plot_feat_median_over_prop_missing( data=df, type='boxplot') @@ -128,8 +128,8 @@ # %% if select_features: # potentially this can take a few iterations to stabilize. - df = vaep.filter.select_features(df, feat_prevalence=feat_prevalence) - df = vaep.filter.select_features(df=df, feat_prevalence=sample_completeness, axis=1) + df = pimmslearn.filter.select_features(df, feat_prevalence=feat_prevalence) + df = pimmslearn.filter.select_features(df=df, feat_prevalence=sample_completeness, axis=1) df.shape @@ -147,15 +147,15 @@ # %% if sample_splits: - splits, thresholds, fake_na_mcar, fake_na_mnar = vaep.sampling.sample_mnar_mcar( + splits, thresholds, fake_na_mcar, fake_na_mnar = pimmslearn.sampling.sample_mnar_mcar( df_long=df, frac_non_train=frac_non_train, frac_mnar=frac_mnar, random_state=random_state, ) - splits = vaep.sampling.check_split_integrity(splits) + splits = pimmslearn.sampling.check_split_integrity(splits) else: - splits = vaep.sampling.DataSplits(is_wide_format=False) + splits = pimmslearn.sampling.DataSplits(is_wide_format=False) splits.train_X = df # %% [markdown] @@ -215,10 +215,10 @@ fig, axes = plt.subplots(2, figsize=(8, 4)) -min_max = vaep.plotting.data.get_min_max_iterable( +min_max = pimmslearn.plotting.data.get_min_max_iterable( [observed, imputed]) label_template = '{method} (N={n:,d})' -ax, _ = vaep.plotting.data.plot_histogram_intensities( +ax, _ = pimmslearn.plotting.data.plot_histogram_intensities( observed, ax=axes[0], min_max=min_max, @@ -228,7 +228,7 @@ color='grey', alpha=1) _ = ax.legend() -ax, _ = vaep.plotting.data.plot_histogram_intensities( +ax, _ = pimmslearn.plotting.data.plot_histogram_intensities( imputed, ax=axes[1], min_max=min_max, @@ -296,12 +296,12 @@ if splits.val_y is not None: pred_val = splits.val_y.stack().to_frame('observed') pred_val[model_selected] = df_imputed - val_metrics = vaep.models.calculte_metrics(pred_val, 'observed') + val_metrics = pimmslearn.models.calculte_metrics(pred_val, 'observed') display(val_metrics) fig, ax = plt.subplots(figsize=(8, 2)) - ax, errors_binned = vaep.plotting.errors.plot_errors_by_median( + ax, errors_binned = pimmslearn.plotting.errors.plot_errors_by_median( pred=pred_val, target_col='observed', feat_medians=splits.train_X.median(), @@ -326,10 +326,10 @@ fig, axes = plt.subplots(2, figsize=(8, 4)) -min_max = vaep.plotting.data.get_min_max_iterable([observed, imputed]) +min_max = pimmslearn.plotting.data.get_min_max_iterable([observed, imputed]) label_template = '{method} (N={n:,d})' -ax, _ = vaep.plotting.data.plot_histogram_intensities( +ax, _ = pimmslearn.plotting.data.plot_histogram_intensities( observed, ax=axes[0], min_max=min_max, @@ -339,7 +339,7 @@ color='grey', alpha=1) _ = ax.legend() -ax, _ = vaep.plotting.data.plot_histogram_intensities( +ax, _ = pimmslearn.plotting.data.plot_histogram_intensities( imputed, ax=axes[1], min_max=min_max, diff --git a/project/10_0_ald_data.ipynb b/project/10_0_ald_data.ipynb index e006f94e4..29ea91f23 100644 --- a/project/10_0_ald_data.ipynb +++ b/project/10_0_ald_data.ipynb @@ -21,9 +21,9 @@ "import yaml\n", "import numpy as np\n", "import pandas as pd\n", - "import vaep\n", + "import pimmslearn\n", "\n", - "logger = vaep.logging.setup_nb_logger()\n", + "logger = pimmslearn.logging.setup_nb_logger()\n", "\n", "pd.options.display.max_columns = 50\n", "pd.options.display.max_rows = 100" @@ -54,7 +54,7 @@ " annotations=folder_data / 'ald_experiment_annotations.csv',\n", " clinic=folder_data / 'labtest_integrated_numeric.csv',\n", " raw_meta=folder_data / 'ald_metadata_rawfiles.csv')\n", - "fnames = vaep.nb.Config.from_dict(fnames) # could be handeled kwargs as in normal dict" + "fnames = pimmslearn.nb.Config.from_dict(fnames) # could be handeled kwargs as in normal dict" ] }, { @@ -759,12 +759,12 @@ " 'ylabel': 'peptide was found in # samples',\n", " 'title': 'peptide measurement distribution'}\n", "\n", - "ax = vaep.plotting.plot_counts(des_data.T.sort_values(by='count', ascending=False).reset_index(\n", + "ax = pimmslearn.plotting.plot_counts(des_data.T.sort_values(by='count', ascending=False).reset_index(\n", "), feat_col_name='count', feature_name='Aggregated peptides', n_samples=len(df), ax=None, **kwargs)\n", "\n", "fig = ax.get_figure()\n", "fig.tight_layout()\n", - "vaep.savefig(fig, name='data_aggPeptides_completness', folder=folder_run)" + "pimmslearn.savefig(fig, name='data_aggPeptides_completness', folder=folder_run)" ] }, { @@ -1137,12 +1137,12 @@ " 'ylabel': 'peptide was found in # samples',\n", " 'title': 'protein group measurement distribution'}\n", "\n", - "ax = vaep.plotting.plot_counts(des_data.T.sort_values(by='count', ascending=False).reset_index(\n", + "ax = pimmslearn.plotting.plot_counts(des_data.T.sort_values(by='count', ascending=False).reset_index(\n", "), feat_col_name='count', n_samples=len(df), ax=None, min_feat_prop=.0, **kwargs)\n", "\n", "fig = ax.get_figure()\n", "fig.tight_layout()\n", - "vaep.savefig(fig, name='data_proteinGroups_completness', folder=folder_run)" + "pimmslearn.savefig(fig, name='data_proteinGroups_completness', folder=folder_run)" ] }, { @@ -1539,7 +1539,7 @@ "metadata": {}, "outputs": [], "source": [ - "df = vaep.pandas.select_max_by(df=df.reset_index(),\n", + "df = pimmslearn.pandas.select_max_by(df=df.reset_index(),\n", " grouping_columns=sel_cols[:-1],\n", " selection_column=sel_cols[-1]).set_index(sel_cols[:-1])" ] @@ -1662,12 +1662,12 @@ " 'ylabel': 'peptide was found in # samples',\n", " 'title': 'peptide measurement distribution'}\n", "\n", - "ax = vaep.plotting.plot_counts(des_data.T.sort_values(by='count', ascending=False).reset_index(\n", + "ax = pimmslearn.plotting.plot_counts(des_data.T.sort_values(by='count', ascending=False).reset_index(\n", "), feat_col_name='count', feature_name='Aggregated peptides', n_samples=len(df), ax=None, **kwargs)\n", "\n", "fig = ax.get_figure()\n", "fig.tight_layout()\n", - "vaep.savefig(fig, name='data_liver_aggPeptides_completness', folder=folder_run)" + "pimmslearn.savefig(fig, name='data_liver_aggPeptides_completness', folder=folder_run)" ] }, { @@ -2035,13 +2035,13 @@ " 'ylabel': 'peptide was found in # samples',\n", " 'title': 'protein group measurement distribution'}\n", "\n", - "ax = vaep.plotting.plot_counts(des_data.T.sort_values(by='count', ascending=False).reset_index(\n", + "ax = pimmslearn.plotting.plot_counts(des_data.T.sort_values(by='count', ascending=False).reset_index(\n", "), feat_col_name='count', n_samples=len(df), ax=None, **kwargs)\n", "\n", "fig = ax.get_figure()\n", "fig.tight_layout()\n", "fnames.fig_liver_pg_completness = folder_run / 'data_liver_proteinGroups_completness'\n", - "vaep.savefig(fig, name=fnames.fig_liver_pg_completness)" + "pimmslearn.savefig(fig, name=fnames.fig_liver_pg_completness)" ] }, { diff --git a/project/10_0_ald_data.py b/project/10_0_ald_data.py index 1e179b55e..5caeb2a7c 100644 --- a/project/10_0_ald_data.py +++ b/project/10_0_ald_data.py @@ -6,7 +6,7 @@ # extension: .py # format_name: percent # format_version: '1.3' -# jupytext_version: 1.15.0 +# jupytext_version: 1.16.2 # kernelspec: # display_name: Python 3 # language: python @@ -21,9 +21,9 @@ import yaml import numpy as np import pandas as pd -import vaep +import pimmslearn -logger = vaep.logging.setup_nb_logger() +logger = pimmslearn.logging.setup_nb_logger() pd.options.display.max_columns = 50 pd.options.display.max_rows = 100 @@ -45,7 +45,7 @@ annotations=folder_data / 'ald_experiment_annotations.csv', clinic=folder_data / 'labtest_integrated_numeric.csv', raw_meta=folder_data / 'ald_metadata_rawfiles.csv') -fnames = vaep.nb.Config.from_dict(fnames) # could be handeled kwargs as in normal dict +fnames = pimmslearn.nb.Config.from_dict(fnames) # could be handeled kwargs as in normal dict # %% @@ -343,12 +343,12 @@ 'ylabel': 'peptide was found in # samples', 'title': 'peptide measurement distribution'} -ax = vaep.plotting.plot_counts(des_data.T.sort_values(by='count', ascending=False).reset_index( +ax = pimmslearn.plotting.plot_counts(des_data.T.sort_values(by='count', ascending=False).reset_index( ), feat_col_name='count', feature_name='Aggregated peptides', n_samples=len(df), ax=None, **kwargs) fig = ax.get_figure() fig.tight_layout() -vaep.savefig(fig, name='data_aggPeptides_completness', folder=folder_run) +pimmslearn.savefig(fig, name='data_aggPeptides_completness', folder=folder_run) # %% [markdown] # ### Select features which are present in at least 25% of the samples @@ -505,12 +505,12 @@ def find_idx_to_drop(df: pd.DataFrame, idx_to_keep: list): 'ylabel': 'peptide was found in # samples', 'title': 'protein group measurement distribution'} -ax = vaep.plotting.plot_counts(des_data.T.sort_values(by='count', ascending=False).reset_index( +ax = pimmslearn.plotting.plot_counts(des_data.T.sort_values(by='count', ascending=False).reset_index( ), feat_col_name='count', n_samples=len(df), ax=None, min_feat_prop=.0, **kwargs) fig = ax.get_figure() fig.tight_layout() -vaep.savefig(fig, name='data_proteinGroups_completness', folder=folder_run) +pimmslearn.savefig(fig, name='data_proteinGroups_completness', folder=folder_run) # %% [markdown] @@ -678,7 +678,7 @@ def find_idx_to_drop(df: pd.DataFrame, idx_to_keep: list): df.loc[mask_idx_duplicated].sort_index() # %% -df = vaep.pandas.select_max_by(df=df.reset_index(), +df = pimmslearn.pandas.select_max_by(df=df.reset_index(), grouping_columns=sel_cols[:-1], selection_column=sel_cols[-1]).set_index(sel_cols[:-1]) @@ -730,12 +730,12 @@ def find_idx_to_drop(df: pd.DataFrame, idx_to_keep: list): 'ylabel': 'peptide was found in # samples', 'title': 'peptide measurement distribution'} -ax = vaep.plotting.plot_counts(des_data.T.sort_values(by='count', ascending=False).reset_index( +ax = pimmslearn.plotting.plot_counts(des_data.T.sort_values(by='count', ascending=False).reset_index( ), feat_col_name='count', feature_name='Aggregated peptides', n_samples=len(df), ax=None, **kwargs) fig = ax.get_figure() fig.tight_layout() -vaep.savefig(fig, name='data_liver_aggPeptides_completness', folder=folder_run) +pimmslearn.savefig(fig, name='data_liver_aggPeptides_completness', folder=folder_run) # %% [markdown] # ### Select features which are present in at least 25% of the samples @@ -884,13 +884,13 @@ def find_idx_to_drop(df: pd.DataFrame, idx_to_keep: list): 'ylabel': 'peptide was found in # samples', 'title': 'protein group measurement distribution'} -ax = vaep.plotting.plot_counts(des_data.T.sort_values(by='count', ascending=False).reset_index( +ax = pimmslearn.plotting.plot_counts(des_data.T.sort_values(by='count', ascending=False).reset_index( ), feat_col_name='count', n_samples=len(df), ax=None, **kwargs) fig = ax.get_figure() fig.tight_layout() fnames.fig_liver_pg_completness = folder_run / 'data_liver_proteinGroups_completness' -vaep.savefig(fig, name=fnames.fig_liver_pg_completness) +pimmslearn.savefig(fig, name=fnames.fig_liver_pg_completness) # %% [markdown] # ### Select features which are present in at least 25% of the samples diff --git a/project/10_1_ald_diff_analysis.ipynb b/project/10_1_ald_diff_analysis.ipynb index 24dd7ce90..fc260c472 100644 --- a/project/10_1_ald_diff_analysis.ipynb +++ b/project/10_1_ald_diff_analysis.ipynb @@ -30,13 +30,13 @@ "import pandas as pd\n", "from IPython.display import display\n", "\n", - "import vaep\n", - "import vaep.analyzers\n", - "import vaep.imputation\n", - "import vaep.io.datasplits\n", - "import vaep.nb\n", + "import pimmslearn\n", + "import pimmslearn.analyzers\n", + "import pimmslearn.imputation\n", + "import pimmslearn.io.datasplits\n", + "import pimmslearn.nb\n", "\n", - "logger = vaep.logging.setup_nb_logger()\n", + "logger = pimmslearn.logging.setup_nb_logger()\n", "logging.getLogger('fontTools').setLevel(logging.WARNING)" ] }, @@ -111,7 +111,7 @@ "source": [ "if not model:\n", " model = model_key\n", - "params = vaep.nb.get_params(args, globals=globals(), remove=True)\n", + "params = pimmslearn.nb.get_params(args, globals=globals(), remove=True)\n", "params" ] }, @@ -125,10 +125,10 @@ }, "outputs": [], "source": [ - "args = vaep.nb.Config()\n", + "args = pimmslearn.nb.Config()\n", "args.fn_clinical_data = Path(params[\"fn_clinical_data\"])\n", "args.folder_experiment = Path(params[\"folder_experiment\"])\n", - "args = vaep.nb.add_default_paths(args,\n", + "args = pimmslearn.nb.add_default_paths(args,\n", " out_root=(args.folder_experiment\n", " / params[\"out_folder\"]\n", " / params[\"target\"]\n", @@ -184,7 +184,7 @@ }, "outputs": [], "source": [ - "data = vaep.io.datasplits.DataSplits.from_folder(\n", + "data = pimmslearn.io.datasplits.DataSplits.from_folder(\n", " args.data, file_format=args.file_format)" ] }, @@ -223,7 +223,7 @@ "source": [ "df_clinic = pd.read_csv(args.fn_clinical_data, index_col=0)\n", "df_clinic = df_clinic.loc[observed.index.levels[0]]\n", - "cols_clinic = vaep.pandas.get_columns_accessor(df_clinic)\n", + "cols_clinic = pimmslearn.pandas.get_columns_accessor(df_clinic)\n", "df_clinic[[args.target, *args.covar]].describe()" ] }, @@ -389,7 +389,7 @@ "FRAC_PROTEIN_GROUPS: int = 0.622\n", "CV_QC_SAMPLE: float = 0.4 # Coef. of variation on 13 QC samples\n", "\n", - "ald_study, cutoffs = vaep.analyzers.diff_analysis.select_raw_data(observed.unstack(\n", + "ald_study, cutoffs = pimmslearn.analyzers.diff_analysis.select_raw_data(observed.unstack(\n", "), data_completeness=DATA_COMPLETENESS, frac_protein_groups=FRAC_PROTEIN_GROUPS)\n", "\n", "ald_study" @@ -413,10 +413,10 @@ " fig, ax = plt.subplots(figsize=(4, 7))\n", " ax = qc_cv_feat.plot.box(ax=ax)\n", " ax.set_ylabel('Coefficient of Variation')\n", - " vaep.savefig(fig, name='cv_qc_samples', folder=args.out_figures)\n", + " pimmslearn.savefig(fig, name='cv_qc_samples', folder=args.out_figures)\n", " print((qc_cv_feat < CV_QC_SAMPLE).value_counts())\n", " # only to ald_study data\n", - " ald_study = ald_study[vaep.analyzers.diff_analysis.select_feat(qc_samples[ald_study.columns])]\n", + " ald_study = ald_study[pimmslearn.analyzers.diff_analysis.select_feat(qc_samples[ald_study.columns])]\n", "\n", "ald_study" ] @@ -432,10 +432,10 @@ }, "outputs": [], "source": [ - "fig, axes = vaep.plotting.plot_cutoffs(observed.unstack(),\n", + "fig, axes = pimmslearn.plotting.plot_cutoffs(observed.unstack(),\n", " feat_completness_over_samples=cutoffs.feat_completness_over_samples,\n", " min_feat_in_sample=cutoffs.min_feat_in_sample)\n", - "vaep.savefig(fig, name='tresholds_normal_imputation', folder=args.out_figures)" + "pimmslearn.savefig(fig, name='tresholds_normal_imputation', folder=args.out_figures)" ] }, { @@ -489,7 +489,7 @@ "source": [ "pred_real_na = None\n", "if args.model_key and str(args.model_key) != 'None':\n", - " pred_real_na = (vaep\n", + " pred_real_na = (pimmslearn\n", " .analyzers\n", " .compare_predictions\n", " .load_single_csv_pred_file(fname)\n", @@ -531,7 +531,7 @@ " sharex=True):\n", " \"\"\"Plots distributions of intensities provided as dictionary of labels to pd.Series.\"\"\"\n", " series_ = [observed, imputation] if imputation is not None else [observed]\n", - " min_bin, max_bin = vaep.plotting.data.get_min_max_iterable([observed])\n", + " min_bin, max_bin = pimmslearn.plotting.data.get_min_max_iterable([observed])\n", "\n", " if imputation is not None:\n", " fig, axes = plt.subplots(len(series_), figsize=figsize, sharex=sharex)\n", @@ -551,7 +551,7 @@ " if imputation is not None:\n", " ax = axes[1]\n", " label = f'Missing values imputed using {model_key.upper()}'\n", - " color = vaep.plotting.defaults.color_model_mapping.get(model_key, None)\n", + " color = pimmslearn.plotting.defaults.color_model_mapping.get(model_key, None)\n", " if color is None:\n", " color = f'C{1}'\n", " ax = imputation.hist(ax=ax, bins=bins, color=color)\n", @@ -562,13 +562,13 @@ " return fig, bins\n", "\n", "\n", - "vaep.plotting.make_large_descriptors(6)\n", + "pimmslearn.plotting.make_large_descriptors(6)\n", "fig, bins = plot_distributions(observed,\n", " imputation=pred_real_na,\n", " model_key=args.model_key, figsize=(2.5, 2))\n", "fname = args.out_folder / 'dist_plots' / f'real_na_obs_vs_{args.model_key}.pdf'\n", "files_out[fname.name] = fname.as_posix()\n", - "vaep.savefig(fig, name=fname)" + "pimmslearn.savefig(fig, name=fname)" ] }, { @@ -591,8 +591,8 @@ "source": [ "if pred_real_na is not None:\n", " counts_per_bin = pd.concat([\n", - " vaep.pandas.get_counts_per_bin(observed.to_frame('observed'), bins=bins),\n", - " vaep.pandas.get_counts_per_bin(pred_real_na.to_frame(args.model_key), bins=bins)\n", + " pimmslearn.pandas.get_counts_per_bin(observed.to_frame('observed'), bins=bins),\n", + " pimmslearn.pandas.get_counts_per_bin(pred_real_na.to_frame(args.model_key), bins=bins)\n", " ], axis=1)\n", " counts_per_bin.to_excel(fname.with_suffix('.xlsx'))\n", " logger.info(\"Counts per bin saved to %s\", fname.with_suffix('.xlsx'))\n", @@ -620,7 +620,7 @@ "outputs": [], "source": [ "if pred_real_na is not None:\n", - " shifts = (vaep.imputation.compute_moments_shift(observed, pred_real_na,\n", + " shifts = (pimmslearn.imputation.compute_moments_shift(observed, pred_real_na,\n", " names=('observed', args.model_key)))\n", " display(pd.DataFrame(shifts).T)" ] @@ -645,8 +645,8 @@ "if pred_real_na is not None:\n", " index_level = 0 # per sample\n", " mean_by_sample = pd.DataFrame(\n", - " {'observed': vaep.imputation.stats_by_level(observed, index_level=index_level),\n", - " args.model_key: vaep.imputation.stats_by_level(pred_real_na, index_level=index_level)\n", + " {'observed': pimmslearn.imputation.stats_by_level(observed, index_level=index_level),\n", + " args.model_key: pimmslearn.imputation.stats_by_level(pred_real_na, index_level=index_level)\n", " })\n", " mean_by_sample.loc['mean_shift'] = (mean_by_sample.loc['mean', 'observed'] -\n", " mean_by_sample.loc['mean']).abs() / mean_by_sample.loc['std', 'observed']\n", diff --git a/project/10_1_ald_diff_analysis.py b/project/10_1_ald_diff_analysis.py index 9bb4de7f1..44258408a 100644 --- a/project/10_1_ald_diff_analysis.py +++ b/project/10_1_ald_diff_analysis.py @@ -5,7 +5,7 @@ # extension: .py # format_name: percent # format_version: '1.3' -# jupytext_version: 1.15.0 +# jupytext_version: 1.16.2 # kernelspec: # display_name: Python 3 # language: python @@ -29,13 +29,13 @@ import pandas as pd from IPython.display import display -import vaep -import vaep.analyzers -import vaep.imputation -import vaep.io.datasplits -import vaep.nb +import pimmslearn +import pimmslearn.analyzers +import pimmslearn.imputation +import pimmslearn.io.datasplits +import pimmslearn.nb -logger = vaep.logging.setup_nb_logger() +logger = pimmslearn.logging.setup_nb_logger() logging.getLogger('fontTools').setLevel(logging.WARNING) # %% tags=["hide-input"] @@ -71,14 +71,14 @@ # %% tags=["hide-input"] if not model: model = model_key -params = vaep.nb.get_params(args, globals=globals(), remove=True) +params = pimmslearn.nb.get_params(args, globals=globals(), remove=True) params # %% tags=["hide-input"] -args = vaep.nb.Config() +args = pimmslearn.nb.Config() args.fn_clinical_data = Path(params["fn_clinical_data"]) args.folder_experiment = Path(params["folder_experiment"]) -args = vaep.nb.add_default_paths(args, +args = pimmslearn.nb.add_default_paths(args, out_root=(args.folder_experiment / params["out_folder"] / params["target"] @@ -102,7 +102,7 @@ # Aggregated from data splits of the imputation workflow run before. # %% tags=["hide-input"] -data = vaep.io.datasplits.DataSplits.from_folder( +data = pimmslearn.io.datasplits.DataSplits.from_folder( args.data, file_format=args.file_format) # %% tags=["hide-input"] @@ -116,7 +116,7 @@ # %% tags=["hide-input"] df_clinic = pd.read_csv(args.fn_clinical_data, index_col=0) df_clinic = df_clinic.loc[observed.index.levels[0]] -cols_clinic = vaep.pandas.get_columns_accessor(df_clinic) +cols_clinic = pimmslearn.pandas.get_columns_accessor(df_clinic) df_clinic[[args.target, *args.covar]].describe() @@ -193,7 +193,7 @@ FRAC_PROTEIN_GROUPS: int = 0.622 CV_QC_SAMPLE: float = 0.4 # Coef. of variation on 13 QC samples -ald_study, cutoffs = vaep.analyzers.diff_analysis.select_raw_data(observed.unstack( +ald_study, cutoffs = pimmslearn.analyzers.diff_analysis.select_raw_data(observed.unstack( ), data_completeness=DATA_COMPLETENESS, frac_protein_groups=FRAC_PROTEIN_GROUPS) ald_study @@ -207,18 +207,18 @@ fig, ax = plt.subplots(figsize=(4, 7)) ax = qc_cv_feat.plot.box(ax=ax) ax.set_ylabel('Coefficient of Variation') - vaep.savefig(fig, name='cv_qc_samples', folder=args.out_figures) + pimmslearn.savefig(fig, name='cv_qc_samples', folder=args.out_figures) print((qc_cv_feat < CV_QC_SAMPLE).value_counts()) # only to ald_study data - ald_study = ald_study[vaep.analyzers.diff_analysis.select_feat(qc_samples[ald_study.columns])] + ald_study = ald_study[pimmslearn.analyzers.diff_analysis.select_feat(qc_samples[ald_study.columns])] ald_study # %% tags=["hide-input"] -fig, axes = vaep.plotting.plot_cutoffs(observed.unstack(), +fig, axes = pimmslearn.plotting.plot_cutoffs(observed.unstack(), feat_completness_over_samples=cutoffs.feat_completness_over_samples, min_feat_in_sample=cutoffs.min_feat_in_sample) -vaep.savefig(fig, name='tresholds_normal_imputation', folder=args.out_figures) +pimmslearn.savefig(fig, name='tresholds_normal_imputation', folder=args.out_figures) # %% [markdown] @@ -240,7 +240,7 @@ # %% tags=["hide-input"] pred_real_na = None if args.model_key and str(args.model_key) != 'None': - pred_real_na = (vaep + pred_real_na = (pimmslearn .analyzers .compare_predictions .load_single_csv_pred_file(fname) @@ -268,7 +268,7 @@ def plot_distributions(observed: pd.Series, sharex=True): """Plots distributions of intensities provided as dictionary of labels to pd.Series.""" series_ = [observed, imputation] if imputation is not None else [observed] - min_bin, max_bin = vaep.plotting.data.get_min_max_iterable([observed]) + min_bin, max_bin = pimmslearn.plotting.data.get_min_max_iterable([observed]) if imputation is not None: fig, axes = plt.subplots(len(series_), figsize=figsize, sharex=sharex) @@ -288,7 +288,7 @@ def plot_distributions(observed: pd.Series, if imputation is not None: ax = axes[1] label = f'Missing values imputed using {model_key.upper()}' - color = vaep.plotting.defaults.color_model_mapping.get(model_key, None) + color = pimmslearn.plotting.defaults.color_model_mapping.get(model_key, None) if color is None: color = f'C{1}' ax = imputation.hist(ax=ax, bins=bins, color=color) @@ -299,13 +299,13 @@ def plot_distributions(observed: pd.Series, return fig, bins -vaep.plotting.make_large_descriptors(6) +pimmslearn.plotting.make_large_descriptors(6) fig, bins = plot_distributions(observed, imputation=pred_real_na, model_key=args.model_key, figsize=(2.5, 2)) fname = args.out_folder / 'dist_plots' / f'real_na_obs_vs_{args.model_key}.pdf' files_out[fname.name] = fname.as_posix() -vaep.savefig(fig, name=fname) +pimmslearn.savefig(fig, name=fname) # %% [markdown] # Dump frequency of histograms to file for reporting (if imputed values are used) @@ -313,8 +313,8 @@ def plot_distributions(observed: pd.Series, # %% tags=["hide-input"] if pred_real_na is not None: counts_per_bin = pd.concat([ - vaep.pandas.get_counts_per_bin(observed.to_frame('observed'), bins=bins), - vaep.pandas.get_counts_per_bin(pred_real_na.to_frame(args.model_key), bins=bins) + pimmslearn.pandas.get_counts_per_bin(observed.to_frame('observed'), bins=bins), + pimmslearn.pandas.get_counts_per_bin(pred_real_na.to_frame(args.model_key), bins=bins) ], axis=1) counts_per_bin.to_excel(fname.with_suffix('.xlsx')) logger.info("Counts per bin saved to %s", fname.with_suffix('.xlsx')) @@ -328,7 +328,7 @@ def plot_distributions(observed: pd.Series, # %% tags=["hide-input"] if pred_real_na is not None: - shifts = (vaep.imputation.compute_moments_shift(observed, pred_real_na, + shifts = (pimmslearn.imputation.compute_moments_shift(observed, pred_real_na, names=('observed', args.model_key))) display(pd.DataFrame(shifts).T) @@ -339,8 +339,8 @@ def plot_distributions(observed: pd.Series, if pred_real_na is not None: index_level = 0 # per sample mean_by_sample = pd.DataFrame( - {'observed': vaep.imputation.stats_by_level(observed, index_level=index_level), - args.model_key: vaep.imputation.stats_by_level(pred_real_na, index_level=index_level) + {'observed': pimmslearn.imputation.stats_by_level(observed, index_level=index_level), + args.model_key: pimmslearn.imputation.stats_by_level(pred_real_na, index_level=index_level) }) mean_by_sample.loc['mean_shift'] = (mean_by_sample.loc['mean', 'observed'] - mean_by_sample.loc['mean']).abs() / mean_by_sample.loc['std', 'observed'] diff --git a/project/10_2_ald_compare_methods.ipynb b/project/10_2_ald_compare_methods.ipynb index 02ff40448..fe890b062 100644 --- a/project/10_2_ald_compare_methods.ipynb +++ b/project/10_2_ald_compare_methods.ipynb @@ -29,14 +29,14 @@ "import seaborn as sns\n", "from IPython.display import display\n", "\n", - "import vaep\n", - "import vaep.databases.diseases\n", + "import pimmslearn\n", + "import pimmslearn.databases.diseases\n", "\n", - "logger = vaep.logging.setup_nb_logger()\n", + "logger = pimmslearn.logging.setup_nb_logger()\n", "\n", "plt.rcParams['figure.figsize'] = (2, 2)\n", "fontsize = 5\n", - "vaep.plotting.make_large_descriptors(fontsize)\n", + "pimmslearn.plotting.make_large_descriptors(fontsize)\n", "logging.getLogger('fontTools').setLevel(logging.ERROR)\n", "\n", "# catch passed parameters\n", @@ -96,10 +96,10 @@ }, "outputs": [], "source": [ - "params = vaep.nb.get_params(args, globals=globals())\n", - "args = vaep.nb.Config()\n", + "params = pimmslearn.nb.get_params(args, globals=globals())\n", + "args = pimmslearn.nb.Config()\n", "args.folder_experiment = Path(params[\"folder_experiment\"])\n", - "args = vaep.nb.add_default_paths(args,\n", + "args = pimmslearn.nb.add_default_paths(args,\n", " out_root=(\n", " args.folder_experiment\n", " / params[\"out_folder\"]\n", @@ -242,8 +242,8 @@ }, "outputs": [], "source": [ - "models = vaep.nb.Config.from_dict(\n", - " vaep.pandas.index_to_dict(scores.columns.get_level_values(0)))\n", + "models = pimmslearn.nb.Config.from_dict(\n", + " pimmslearn.pandas.index_to_dict(scores.columns.get_level_values(0)))\n", "vars(models)" ] }, @@ -557,7 +557,7 @@ " args.out_folder /\n", " f'diff_analysis_comparision_1_{args.model_key}')\n", "fname = files_out[f'diff_analysis_comparision_1_{args.model_key}']\n", - "vaep.savefig(fig, name=fname)" + "pimmslearn.savefig(fig, name=fname)" ] }, { @@ -599,7 +599,7 @@ "sns.move_legend(ax, \"upper right\")\n", "files_out[f'diff_analysis_comparision_2_{args.model_key}'] = (\n", " args.out_folder / f'diff_analysis_comparision_2_{args.model_key}')\n", - "vaep.savefig(\n", + "pimmslearn.savefig(\n", " fig, name=files_out[f'diff_analysis_comparision_2_{args.model_key}'])" ] }, @@ -668,7 +668,7 @@ }, "outputs": [], "source": [ - "data = vaep.databases.diseases.get_disease_association(\n", + "data = pimmslearn.databases.diseases.get_disease_association(\n", " doid=args.disease_ontology, limit=10000)\n", "data = pd.DataFrame.from_dict(data, orient='index').rename_axis('ENSP', axis=0)\n", "data = data.rename(columns={'name': args.annotaitons_gene_col}).reset_index(\n", diff --git a/project/10_2_ald_compare_methods.py b/project/10_2_ald_compare_methods.py index 9268c089c..314c6fcac 100644 --- a/project/10_2_ald_compare_methods.py +++ b/project/10_2_ald_compare_methods.py @@ -5,7 +5,7 @@ # extension: .py # format_name: percent # format_version: '1.3' -# jupytext_version: 1.15.0 +# jupytext_version: 1.16.2 # kernelspec: # display_name: Python 3 # language: python @@ -26,14 +26,14 @@ import seaborn as sns from IPython.display import display -import vaep -import vaep.databases.diseases +import pimmslearn +import pimmslearn.databases.diseases -logger = vaep.logging.setup_nb_logger() +logger = pimmslearn.logging.setup_nb_logger() plt.rcParams['figure.figsize'] = (2, 2) fontsize = 5 -vaep.plotting.make_large_descriptors(fontsize) +pimmslearn.plotting.make_large_descriptors(fontsize) logging.getLogger('fontTools').setLevel(logging.ERROR) # catch passed parameters @@ -61,10 +61,10 @@ # Add set parameters to configuration # %% tags=["hide-input"] -params = vaep.nb.get_params(args, globals=globals()) -args = vaep.nb.Config() +params = pimmslearn.nb.get_params(args, globals=globals()) +args = pimmslearn.nb.Config() args.folder_experiment = Path(params["folder_experiment"]) -args = vaep.nb.add_default_paths(args, +args = pimmslearn.nb.add_default_paths(args, out_root=( args.folder_experiment / params["out_folder"] @@ -124,8 +124,8 @@ # Models in comparison (name mapping) # %% tags=["hide-input"] -models = vaep.nb.Config.from_dict( - vaep.pandas.index_to_dict(scores.columns.get_level_values(0))) +models = pimmslearn.nb.Config.from_dict( + pimmslearn.pandas.index_to_dict(scores.columns.get_level_values(0))) vars(models) # %% [markdown] @@ -262,7 +262,7 @@ def annotate_decision(scores, model, model_column): args.out_folder / f'diff_analysis_comparision_1_{args.model_key}') fname = files_out[f'diff_analysis_comparision_1_{args.model_key}'] -vaep.savefig(fig, name=fname) +pimmslearn.savefig(fig, name=fname) # %% [markdown] # - also showing how many features were measured ("observed") by size of circle @@ -288,7 +288,7 @@ def annotate_decision(scores, model, model_column): sns.move_legend(ax, "upper right") files_out[f'diff_analysis_comparision_2_{args.model_key}'] = ( args.out_folder / f'diff_analysis_comparision_2_{args.model_key}') -vaep.savefig( +pimmslearn.savefig( fig, name=files_out[f'diff_analysis_comparision_2_{args.model_key}']) # %% [markdown] @@ -325,7 +325,7 @@ def annotate_decision(scores, model, model_column): # Query diseases database for gene associations with specified disease ontology id. # %% tags=["hide-input"] -data = vaep.databases.diseases.get_disease_association( +data = pimmslearn.databases.diseases.get_disease_association( doid=args.disease_ontology, limit=10000) data = pd.DataFrame.from_dict(data, orient='index').rename_axis('ENSP', axis=0) data = data.rename(columns={'name': args.annotaitons_gene_col}).reset_index( diff --git a/project/10_3_ald_ml_new_feat.ipynb b/project/10_3_ald_ml_new_feat.ipynb index 8d9ebd273..10223a536 100644 --- a/project/10_3_ald_ml_new_feat.ipynb +++ b/project/10_3_ald_ml_new_feat.ipynb @@ -36,19 +36,19 @@ "from njab.plotting.metrics import plot_split_auc, plot_split_prc\n", "from njab.sklearn.types import Splits\n", "\n", - "import vaep\n", - "import vaep.analyzers\n", - "import vaep.io.datasplits\n", + "import pimmslearn\n", + "import pimmslearn.analyzers\n", + "import pimmslearn.io.datasplits\n", "\n", "plt.rcParams['figure.figsize'] = (2.5, 2.5)\n", "plt.rcParams['lines.linewidth'] = 1\n", "plt.rcParams['lines.markersize'] = 2\n", "fontsize = 5\n", "figsize = (2.5, 2.5)\n", - "vaep.plotting.make_large_descriptors(fontsize)\n", + "pimmslearn.plotting.make_large_descriptors(fontsize)\n", "\n", "\n", - "logger = vaep.logging.setup_nb_logger()\n", + "logger = pimmslearn.logging.setup_nb_logger()\n", "logging.getLogger('fontTools').setLevel(logging.ERROR)\n", "\n", "\n", @@ -130,10 +130,10 @@ }, "outputs": [], "source": [ - "params = vaep.nb.get_params(args, globals=globals())\n", - "args = vaep.nb.Config()\n", + "params = pimmslearn.nb.get_params(args, globals=globals())\n", + "args = pimmslearn.nb.Config()\n", "args.folder_experiment = Path(params[\"folder_experiment\"])\n", - "args = vaep.nb.add_default_paths(args,\n", + "args = pimmslearn.nb.add_default_paths(args,\n", " out_root=(args.folder_experiment\n", " / params[\"out_folder\"]\n", " / params[\"target\"]\n", @@ -189,7 +189,7 @@ }, "outputs": [], "source": [ - "data = vaep.io.datasplits.DataSplits.from_folder(\n", + "data = pimmslearn.io.datasplits.DataSplits.from_folder(\n", " args.data, file_format=args.file_format)\n", "data = pd.concat([data.train_X, data.val_y, data.test_y])\n", "data.sample(5)" @@ -228,7 +228,7 @@ "FRAC_PROTEIN_GROUPS: int = 0.622\n", "CV_QC_SAMPLE: float = 0.4\n", "\n", - "ald_study, cutoffs = vaep.analyzers.diff_analysis.select_raw_data(data.unstack(\n", + "ald_study, cutoffs = pimmslearn.analyzers.diff_analysis.select_raw_data(data.unstack(\n", "), data_completeness=DATA_COMPLETENESS, frac_protein_groups=FRAC_PROTEIN_GROUPS)\n", "\n", "if args.fn_qc_samples:\n", @@ -240,7 +240,7 @@ " ax = qc_cv_feat.plot.box(ax=ax)\n", " ax.set_ylabel('Coefficient of Variation')\n", " print((qc_cv_feat < CV_QC_SAMPLE).value_counts())\n", - " ald_study = ald_study[vaep.analyzers.diff_analysis.select_feat(qc_samples)]\n", + " ald_study = ald_study[pimmslearn.analyzers.diff_analysis.select_feat(qc_samples)]\n", "\n", "column_name_first_prot_to_pg = {\n", " pg.split(';')[0]: pg for pg in data.unstack().columns}\n", @@ -296,7 +296,7 @@ "source": [ "fname = args.out_preds / args.template_pred.format(args.model_key)\n", "print(f\"missing values pred. by {args.model_key}: {fname}\")\n", - "load_single_csv_pred_file = vaep.analyzers.compare_predictions.load_single_csv_pred_file\n", + "load_single_csv_pred_file = pimmslearn.analyzers.compare_predictions.load_single_csv_pred_file\n", "pred_real_na = load_single_csv_pred_file(fname).loc[mask_has_target]\n", "pred_real_na.sample(3)" ] @@ -621,7 +621,7 @@ "results_model_full.name = f'{args.model_key} all'\n", "fname = args.out_folder / f'results_{results_model_full.name}.pkl'\n", "files_out[fname.name] = fname\n", - "vaep.io.to_pickle(results_model_full, fname)\n", + "pimmslearn.io.to_pickle(results_model_full, fname)\n", "\n", "splits = Splits(X_train=X.loc[idx_train, new_features],\n", " X_test=X.loc[idx_test, new_features],\n", @@ -633,7 +633,7 @@ "results_model_new.name = f'{args.model_key} new'\n", "fname = args.out_folder / f'results_{results_model_new.name}.pkl'\n", "files_out[fname.name] = fname\n", - "vaep.io.to_pickle(results_model_new, fname)\n", + "pimmslearn.io.to_pickle(results_model_new, fname)\n", "\n", "splits_ald = Splits(\n", " X_train=ald_study.loc[idx_train],\n", @@ -646,7 +646,7 @@ "results_ald_full.name = 'ALD study all'\n", "fname = args.out_folder / f'results_{results_ald_full.name}.pkl'\n", "files_out[fname.name] = fname\n", - "vaep.io.to_pickle(results_ald_full, fname)" + "pimmslearn.io.to_pickle(results_ald_full, fname)" ] }, { @@ -674,7 +674,7 @@ "plot_split_auc(results_model_new.test, results_model_new.name, ax)\n", "fname = args.out_folder / 'auc_roc_curve.pdf'\n", "files_out[fname.name] = fname\n", - "vaep.savefig(fig, name=fname)" + "pimmslearn.savefig(fig, name=fname)" ] }, { @@ -764,7 +764,7 @@ "ax = plot_split_prc(results_model_new.test, results_model_new.name, ax)\n", "fname = folder = args.out_folder / 'prec_recall_curve.pdf'\n", "files_out[fname.name] = fname\n", - "vaep.savefig(fig, name=fname)" + "pimmslearn.savefig(fig, name=fname)" ] }, { @@ -817,7 +817,7 @@ "ax = plot_split_prc(results_model_new.train, results_model_new.name, ax)\n", "fname = folder = args.out_folder / 'prec_recall_curve_train.pdf'\n", "files_out[fname.name] = fname\n", - "vaep.savefig(fig, name=fname)" + "pimmslearn.savefig(fig, name=fname)" ] }, { @@ -837,7 +837,7 @@ "plot_split_auc(results_model_new.train, results_model_new.name, ax)\n", "fname = folder = args.out_folder / 'auc_roc_curve_train.pdf'\n", "files_out[fname.name] = fname\n", - "vaep.savefig(fig, name=fname)" + "pimmslearn.savefig(fig, name=fname)" ] }, { diff --git a/project/10_3_ald_ml_new_feat.py b/project/10_3_ald_ml_new_feat.py index 40bef222b..09403d94d 100644 --- a/project/10_3_ald_ml_new_feat.py +++ b/project/10_3_ald_ml_new_feat.py @@ -5,7 +5,7 @@ # extension: .py # format_name: percent # format_version: '1.3' -# jupytext_version: 1.15.0 +# jupytext_version: 1.16.2 # kernelspec: # display_name: Python 3 # language: python @@ -32,19 +32,19 @@ from njab.plotting.metrics import plot_split_auc, plot_split_prc from njab.sklearn.types import Splits -import vaep -import vaep.analyzers -import vaep.io.datasplits +import pimmslearn +import pimmslearn.analyzers +import pimmslearn.io.datasplits plt.rcParams['figure.figsize'] = (2.5, 2.5) plt.rcParams['lines.linewidth'] = 1 plt.rcParams['lines.markersize'] = 2 fontsize = 5 figsize = (2.5, 2.5) -vaep.plotting.make_large_descriptors(fontsize) +pimmslearn.plotting.make_large_descriptors(fontsize) -logger = vaep.logging.setup_nb_logger() +logger = pimmslearn.logging.setup_nb_logger() logging.getLogger('fontTools').setLevel(logging.ERROR) @@ -99,10 +99,10 @@ def parse_prc(*res: List[njab.sklearn.types.Results]) -> pd.DataFrame: # %% tags=["hide-input"] -params = vaep.nb.get_params(args, globals=globals()) -args = vaep.nb.Config() +params = pimmslearn.nb.get_params(args, globals=globals()) +args = pimmslearn.nb.Config() args.folder_experiment = Path(params["folder_experiment"]) -args = vaep.nb.add_default_paths(args, +args = pimmslearn.nb.add_default_paths(args, out_root=(args.folder_experiment / params["out_folder"] / params["target"] @@ -128,7 +128,7 @@ def parse_prc(*res: List[njab.sklearn.types.Results]) -> pd.DataFrame: # Aggregated from data splits of the imputation workflow run before. # %% tags=["hide-input"] -data = vaep.io.datasplits.DataSplits.from_folder( +data = pimmslearn.io.datasplits.DataSplits.from_folder( args.data, file_format=args.file_format) data = pd.concat([data.train_X, data.val_y, data.test_y]) data.sample(5) @@ -146,7 +146,7 @@ def parse_prc(*res: List[njab.sklearn.types.Results]) -> pd.DataFrame: FRAC_PROTEIN_GROUPS: int = 0.622 CV_QC_SAMPLE: float = 0.4 -ald_study, cutoffs = vaep.analyzers.diff_analysis.select_raw_data(data.unstack( +ald_study, cutoffs = pimmslearn.analyzers.diff_analysis.select_raw_data(data.unstack( ), data_completeness=DATA_COMPLETENESS, frac_protein_groups=FRAC_PROTEIN_GROUPS) if args.fn_qc_samples: @@ -158,7 +158,7 @@ def parse_prc(*res: List[njab.sklearn.types.Results]) -> pd.DataFrame: ax = qc_cv_feat.plot.box(ax=ax) ax.set_ylabel('Coefficient of Variation') print((qc_cv_feat < CV_QC_SAMPLE).value_counts()) - ald_study = ald_study[vaep.analyzers.diff_analysis.select_feat(qc_samples)] + ald_study = ald_study[pimmslearn.analyzers.diff_analysis.select_feat(qc_samples)] column_name_first_prot_to_pg = { pg.split(';')[0]: pg for pg in data.unstack().columns} @@ -182,7 +182,7 @@ def parse_prc(*res: List[njab.sklearn.types.Results]) -> pd.DataFrame: # %% tags=["hide-input"] fname = args.out_preds / args.template_pred.format(args.model_key) print(f"missing values pred. by {args.model_key}: {fname}") -load_single_csv_pred_file = vaep.analyzers.compare_predictions.load_single_csv_pred_file +load_single_csv_pred_file = pimmslearn.analyzers.compare_predictions.load_single_csv_pred_file pred_real_na = load_single_csv_pred_file(fname).loc[mask_has_target] pred_real_na.sample(3) @@ -329,7 +329,7 @@ def parse_prc(*res: List[njab.sklearn.types.Results]) -> pd.DataFrame: results_model_full.name = f'{args.model_key} all' fname = args.out_folder / f'results_{results_model_full.name}.pkl' files_out[fname.name] = fname -vaep.io.to_pickle(results_model_full, fname) +pimmslearn.io.to_pickle(results_model_full, fname) splits = Splits(X_train=X.loc[idx_train, new_features], X_test=X.loc[idx_test, new_features], @@ -341,7 +341,7 @@ def parse_prc(*res: List[njab.sklearn.types.Results]) -> pd.DataFrame: results_model_new.name = f'{args.model_key} new' fname = args.out_folder / f'results_{results_model_new.name}.pkl' files_out[fname.name] = fname -vaep.io.to_pickle(results_model_new, fname) +pimmslearn.io.to_pickle(results_model_new, fname) splits_ald = Splits( X_train=ald_study.loc[idx_train], @@ -354,7 +354,7 @@ def parse_prc(*res: List[njab.sklearn.types.Results]) -> pd.DataFrame: results_ald_full.name = 'ALD study all' fname = args.out_folder / f'results_{results_ald_full.name}.pkl' files_out[fname.name] = fname -vaep.io.to_pickle(results_ald_full, fname) +pimmslearn.io.to_pickle(results_ald_full, fname) # %% [markdown] # ### ROC-AUC on test split @@ -366,7 +366,7 @@ def parse_prc(*res: List[njab.sklearn.types.Results]) -> pd.DataFrame: plot_split_auc(results_model_new.test, results_model_new.name, ax) fname = args.out_folder / 'auc_roc_curve.pdf' files_out[fname.name] = fname -vaep.savefig(fig, name=fname) +pimmslearn.savefig(fig, name=fname) # %% [markdown] # Data used to plot ROC: @@ -408,7 +408,7 @@ def parse_prc(*res: List[njab.sklearn.types.Results]) -> pd.DataFrame: ax = plot_split_prc(results_model_new.test, results_model_new.name, ax) fname = folder = args.out_folder / 'prec_recall_curve.pdf' files_out[fname.name] = fname -vaep.savefig(fig, name=fname) +pimmslearn.savefig(fig, name=fname) # %% [markdown] # Data used to plot PRC: @@ -429,7 +429,7 @@ def parse_prc(*res: List[njab.sklearn.types.Results]) -> pd.DataFrame: ax = plot_split_prc(results_model_new.train, results_model_new.name, ax) fname = folder = args.out_folder / 'prec_recall_curve_train.pdf' files_out[fname.name] = fname -vaep.savefig(fig, name=fname) +pimmslearn.savefig(fig, name=fname) # %% tags=["hide-input"] fig, ax = plt.subplots(1, 1, figsize=figsize) @@ -438,7 +438,7 @@ def parse_prc(*res: List[njab.sklearn.types.Results]) -> pd.DataFrame: plot_split_auc(results_model_new.train, results_model_new.name, ax) fname = folder = args.out_folder / 'auc_roc_curve_train.pdf' files_out[fname.name] = fname -vaep.savefig(fig, name=fname) +pimmslearn.savefig(fig, name=fname) # %% [markdown] # Output files: diff --git a/project/10_4_ald_compare_single_pg.ipynb b/project/10_4_ald_compare_single_pg.ipynb index 6a1644d68..ee68a478e 100644 --- a/project/10_4_ald_compare_single_pg.ipynb +++ b/project/10_4_ald_compare_single_pg.ipynb @@ -31,16 +31,16 @@ "import pandas as pd\n", "import seaborn\n", "\n", - "import vaep\n", - "import vaep.analyzers\n", - "import vaep.imputation\n", - "import vaep.io.datasplits\n", + "import pimmslearn\n", + "import pimmslearn.analyzers\n", + "import pimmslearn.imputation\n", + "import pimmslearn.io.datasplits\n", "\n", - "logger = vaep.logging.setup_nb_logger()\n", + "logger = pimmslearn.logging.setup_nb_logger()\n", "logging.getLogger('fontTools').setLevel(logging.WARNING)\n", "\n", "plt.rcParams['figure.figsize'] = [4, 2.5] # [16.0, 7.0] , [4, 3]\n", - "vaep.plotting.make_large_descriptors(7)\n", + "pimmslearn.plotting.make_large_descriptors(7)\n", "\n", "# catch passed parameters\n", "args = None\n", @@ -92,10 +92,10 @@ }, "outputs": [], "source": [ - "params = vaep.nb.get_params(args, globals=globals())\n", - "args = vaep.nb.Config()\n", + "params = pimmslearn.nb.get_params(args, globals=globals())\n", + "args = pimmslearn.nb.Config()\n", "args.folder_experiment = Path(params[\"folder_experiment\"])\n", - "args = vaep.nb.add_default_paths(args,\n", + "args = pimmslearn.nb.add_default_paths(args,\n", " out_root=(args.folder_experiment\n", " / params[\"out_folder\"]\n", " / params[\"target\"]))\n", @@ -589,7 +589,7 @@ }, "outputs": [], "source": [ - "data = vaep.io.datasplits.DataSplits.from_folder(\n", + "data = pimmslearn.io.datasplits.DataSplits.from_folder(\n", " args.data,\n", " file_format=args.file_format)\n", "data = pd.concat([data.train_X, data.val_y, data.test_y]).unstack()\n", @@ -697,7 +697,7 @@ }, "outputs": [], "source": [ - "load_single_csv_pred_file = vaep.analyzers.compare_predictions.load_single_csv_pred_file\n", + "load_single_csv_pred_file = pimmslearn.analyzers.compare_predictions.load_single_csv_pred_file\n", "pred_real_na = dict()\n", "for method in model_keys:\n", " fname = args.out_preds / args.template_pred.format(method)\n", @@ -795,7 +795,7 @@ }, "outputs": [], "source": [ - "min_y_int, max_y_int = vaep.plotting.data.get_min_max_iterable(\n", + "min_y_int, max_y_int = pimmslearn.plotting.data.get_min_max_iterable(\n", " [data.stack(), pred_real_na.stack()])\n", "min_max = min_y_int, max_y_int\n", "\n", @@ -921,7 +921,7 @@ " fname = (folder /\n", " f'{first_pg}_swarmplot.pdf')\n", " files_out[fname.name] = fname.as_posix()\n", - " vaep.savefig(\n", + " pimmslearn.savefig(\n", " fig,\n", " name=fname)\n", " plt.close()" diff --git a/project/10_4_ald_compare_single_pg.py b/project/10_4_ald_compare_single_pg.py index 3bb21e39d..00d2357e3 100644 --- a/project/10_4_ald_compare_single_pg.py +++ b/project/10_4_ald_compare_single_pg.py @@ -5,7 +5,7 @@ # extension: .py # format_name: percent # format_version: '1.3' -# jupytext_version: 1.15.0 +# jupytext_version: 1.16.2 # kernelspec: # display_name: Python 3 # language: python @@ -28,16 +28,16 @@ import pandas as pd import seaborn -import vaep -import vaep.analyzers -import vaep.imputation -import vaep.io.datasplits +import pimmslearn +import pimmslearn.analyzers +import pimmslearn.imputation +import pimmslearn.io.datasplits -logger = vaep.logging.setup_nb_logger() +logger = pimmslearn.logging.setup_nb_logger() logging.getLogger('fontTools').setLevel(logging.WARNING) plt.rcParams['figure.figsize'] = [4, 2.5] # [16.0, 7.0] , [4, 3] -vaep.plotting.make_large_descriptors(7) +pimmslearn.plotting.make_large_descriptors(7) # catch passed parameters args = None @@ -62,10 +62,10 @@ # %% tags=["hide-input"] -params = vaep.nb.get_params(args, globals=globals()) -args = vaep.nb.Config() +params = pimmslearn.nb.get_params(args, globals=globals()) +args = pimmslearn.nb.Config() args.folder_experiment = Path(params["folder_experiment"]) -args = vaep.nb.add_default_paths(args, +args = pimmslearn.nb.add_default_paths(args, out_root=(args.folder_experiment / params["out_folder"] / params["target"])) @@ -276,7 +276,7 @@ # ## Measurments # %% tags=["hide-input"] -data = vaep.io.datasplits.DataSplits.from_folder( +data = pimmslearn.io.datasplits.DataSplits.from_folder( args.data, file_format=args.file_format) data = pd.concat([data.train_X, data.val_y, data.test_y]).unstack() @@ -318,7 +318,7 @@ pred_paths # %% tags=["hide-input"] -load_single_csv_pred_file = vaep.analyzers.compare_predictions.load_single_csv_pred_file +load_single_csv_pred_file = pimmslearn.analyzers.compare_predictions.load_single_csv_pred_file pred_real_na = dict() for method in model_keys: fname = args.out_preds / args.template_pred.format(method) @@ -356,7 +356,7 @@ # %% tags=["hide-input"] -min_y_int, max_y_int = vaep.plotting.data.get_min_max_iterable( +min_y_int, max_y_int = pimmslearn.plotting.data.get_min_max_iterable( [data.stack(), pred_real_na.stack()]) min_max = min_y_int, max_y_int @@ -467,7 +467,7 @@ def get_centered_label(method, n, q): fname = (folder / f'{first_pg}_swarmplot.pdf') files_out[fname.name] = fname.as_posix() - vaep.savefig( + pimmslearn.savefig( fig, name=fname) plt.close() diff --git a/project/10_5_comp_diff_analysis_repetitions.ipynb b/project/10_5_comp_diff_analysis_repetitions.ipynb index 24036f61c..92e40ec8b 100644 --- a/project/10_5_comp_diff_analysis_repetitions.ipynb +++ b/project/10_5_comp_diff_analysis_repetitions.ipynb @@ -12,7 +12,7 @@ "import njab\n", "import pandas as pd\n", "\n", - "import vaep" + "import pimmslearn" ] }, { diff --git a/project/10_5_comp_diff_analysis_repetitions.py b/project/10_5_comp_diff_analysis_repetitions.py index e49aa2b42..f75c9b59a 100644 --- a/project/10_5_comp_diff_analysis_repetitions.py +++ b/project/10_5_comp_diff_analysis_repetitions.py @@ -5,7 +5,7 @@ # extension: .py # format_name: percent # format_version: '1.3' -# jupytext_version: 1.15.0 +# jupytext_version: 1.16.2 # kernelspec: # display_name: vaep # language: python @@ -18,7 +18,7 @@ import njab import pandas as pd -import vaep +import pimmslearn # %% pickled_qvalues = snakemake.input.qvalues diff --git a/project/10_6_interpret_repeated_ald_da.py b/project/10_6_interpret_repeated_ald_da.py index bb685cb21..902e02968 100644 --- a/project/10_6_interpret_repeated_ald_da.py +++ b/project/10_6_interpret_repeated_ald_da.py @@ -2,11 +2,11 @@ import matplotlib.pyplot as plt from pathlib import Path import pandas as pd -import vaep -from vaep.analyzers.compare_predictions import load_single_csv_pred_file +import pimmslearn +from pimmslearn.analyzers.compare_predictions import load_single_csv_pred_file plt.rcParams['figure.figsize'] = (4, 2) -vaep.plotting.make_large_descriptors(5) +pimmslearn.plotting.make_large_descriptors(5) def load_pred_from_run(run_folder: Path, @@ -56,12 +56,12 @@ def load_pred_from_run(run_folder: Path, pred_real_na_cvs.to_excel(writer, float_format='%.3f', sheet_name='CVs') ax = pred_real_na_cvs.plot.hist(bins=15, - color=vaep.plotting.defaults.assign_colors(model_keys), + color=pimmslearn.plotting.defaults.assign_colors(model_keys), alpha=0.5) ax.yaxis.set_major_formatter('{x:,.0f}') ax.set_xlabel(f'Coefficient of variation of imputed intensites (N={len(pred_real_na):,d})') fname = reps_folder / 'pred_real_na_cvs.png' -vaep.savefig(ax.get_figure(), name=fname) +pimmslearn.savefig(ax.get_figure(), name=fname) # %% writer.close() diff --git a/project/10_7_ald_reduced_dataset_plots.ipynb b/project/10_7_ald_reduced_dataset_plots.ipynb index bdd4a796a..e520993e3 100644 --- a/project/10_7_ald_reduced_dataset_plots.ipynb +++ b/project/10_7_ald_reduced_dataset_plots.ipynb @@ -21,10 +21,10 @@ "import njab\n", "import pandas as pd\n", "\n", - "import vaep\n", + "import pimmslearn\n", "\n", "plt.rcParams['figure.figsize'] = [4, 2] # [16.0, 7.0] , [4, 3]\n", - "vaep.plotting.make_large_descriptors(6)\n", + "pimmslearn.plotting.make_large_descriptors(6)\n", "\n", "\n", "NONE_COL_NAME = 'No imputation\\n(None)'\n", @@ -35,7 +35,7 @@ "REF_MODEL = 'None (100%)'\n", "CUTOFF = 0.05\n", "\n", - "COLORS_TO_USE_MAPPTING = vaep.plotting.defaults.color_model_mapping\n", + "COLORS_TO_USE_MAPPTING = pimmslearn.plotting.defaults.color_model_mapping\n", "COLORS_TO_USE_MAPPTING[NONE_COL_NAME] = COLORS_TO_USE_MAPPTING['None']\n", "\n", "COLORS_CONTIGENCY_TABLE = {\n", @@ -253,7 +253,7 @@ "fname = out_folder / 'lost_signal_da_counts.pdf'\n", "da_target_sel_counts.fillna(0).to_excel(writer, sheet_name=fname.stem)\n", "files_out[fname.name] = fname.as_posix()\n", - "vaep.savefig(ax.figure, fname)" + "pimmslearn.savefig(ax.figure, fname)" ] }, { @@ -275,7 +275,7 @@ "ax.set_ylabel(\"q-value using 80% of the data\")\n", "fname = out_folder / 'lost_signal_qvalues.pdf'\n", "files_out[fname.name] = fname.as_posix()\n", - "vaep.savefig(ax.figure, fname)" + "pimmslearn.savefig(ax.figure, fname)" ] }, { @@ -327,7 +327,7 @@ "fname = out_folder / 'gained_signal_da_counts.pdf'\n", "da_target_sel_counts.fillna(0).to_excel(writer, sheet_name=fname.stem)\n", "files_out[fname.name] = fname.as_posix()\n", - "vaep.savefig(ax.figure, fname)" + "pimmslearn.savefig(ax.figure, fname)" ] }, { @@ -346,7 +346,7 @@ "ax.legend(loc='upper right')\n", "fname = out_folder / 'gained_signal_qvalues.pdf'\n", "files_out[fname.name] = fname.as_posix()\n", - "vaep.savefig(ax.figure, fname)" + "pimmslearn.savefig(ax.figure, fname)" ] }, { diff --git a/project/10_7_ald_reduced_dataset_plots.py b/project/10_7_ald_reduced_dataset_plots.py index 30c2750f2..f9d223e03 100644 --- a/project/10_7_ald_reduced_dataset_plots.py +++ b/project/10_7_ald_reduced_dataset_plots.py @@ -8,10 +8,10 @@ import njab import pandas as pd -import vaep +import pimmslearn plt.rcParams['figure.figsize'] = [4, 2] # [16.0, 7.0] , [4, 3] -vaep.plotting.make_large_descriptors(6) +pimmslearn.plotting.make_large_descriptors(6) NONE_COL_NAME = 'No imputation\n(None)' @@ -22,7 +22,7 @@ REF_MODEL = 'None (100%)' CUTOFF = 0.05 -COLORS_TO_USE_MAPPTING = vaep.plotting.defaults.color_model_mapping +COLORS_TO_USE_MAPPTING = pimmslearn.plotting.defaults.color_model_mapping COLORS_TO_USE_MAPPTING[NONE_COL_NAME] = COLORS_TO_USE_MAPPTING['None'] COLORS_CONTIGENCY_TABLE = { @@ -138,7 +138,7 @@ def plot_qvalues(df, x: str, y: list, ax=None, cutoff=0.05, fname = out_folder / 'lost_signal_da_counts.pdf' da_target_sel_counts.fillna(0).to_excel(writer, sheet_name=fname.stem) files_out[fname.name] = fname.as_posix() -vaep.savefig(ax.figure, fname) +pimmslearn.savefig(ax.figure, fname) # %% ax = plot_qvalues(df=sel, @@ -151,7 +151,7 @@ def plot_qvalues(df, x: str, y: list, ax=None, cutoff=0.05, ax.set_ylabel("q-value using 80% of the data") fname = out_folder / 'lost_signal_qvalues.pdf' files_out[fname.name] = fname.as_posix() -vaep.savefig(ax.figure, fname) +pimmslearn.savefig(ax.figure, fname) # %% [markdown] @@ -185,7 +185,7 @@ def plot_qvalues(df, x: str, y: list, ax=None, cutoff=0.05, fname = out_folder / 'gained_signal_da_counts.pdf' da_target_sel_counts.fillna(0).to_excel(writer, sheet_name=fname.stem) files_out[fname.name] = fname.as_posix() -vaep.savefig(ax.figure, fname) +pimmslearn.savefig(ax.figure, fname) # %% ax = plot_qvalues(sel, @@ -197,7 +197,7 @@ def plot_qvalues(df, x: str, y: list, ax=None, cutoff=0.05, ax.legend(loc='upper right') fname = out_folder / 'gained_signal_qvalues.pdf' files_out[fname.name] = fname.as_posix() -vaep.savefig(ax.figure, fname) +pimmslearn.savefig(ax.figure, fname) # %% [markdown] # Saved files diff --git a/project/misc_embeddings.py b/project/misc_embeddings.py index fcbc27eb4..dd7f26578 100644 --- a/project/misc_embeddings.py +++ b/project/misc_embeddings.py @@ -5,7 +5,7 @@ # extension: .py # format_name: percent # format_version: '1.3' -# jupytext_version: 1.15.2 +# jupytext_version: 1.16.2 # kernelspec: # display_name: Python 3 (ipykernel) # language: python diff --git a/project/misc_illustrations.py b/project/misc_illustrations.py index 2e5c50b9f..00408cd5c 100644 --- a/project/misc_illustrations.py +++ b/project/misc_illustrations.py @@ -5,7 +5,7 @@ # extension: .py # format_name: percent # format_version: '1.3' -# jupytext_version: 1.15.0 +# jupytext_version: 1.16.2 # kernelspec: # display_name: Python 3 # language: python diff --git a/project/misc_json_formats.ipynb b/project/misc_json_formats.ipynb index b503d2a04..9cf7a6878 100644 --- a/project/misc_json_formats.ipynb +++ b/project/misc_json_formats.ipynb @@ -19,8 +19,8 @@ "outputs": [], "source": [ "import pandas as pd\n", - "from vaep.io.data_objects import MqAllSummaries\n", - "from vaep.pandas import get_unique_non_unique_columns\n", + "from pimmslearn.io.data_objects import MqAllSummaries\n", + "from pimmslearn.pandas import get_unique_non_unique_columns\n", "\n", "mq_all_summaries = MqAllSummaries()" ] diff --git a/project/misc_json_formats.py b/project/misc_json_formats.py index b589ea91c..e0381b748 100644 --- a/project/misc_json_formats.py +++ b/project/misc_json_formats.py @@ -5,7 +5,7 @@ # extension: .py # format_name: percent # format_version: '1.3' -# jupytext_version: 1.15.2 +# jupytext_version: 1.16.2 # kernelspec: # display_name: Python 3 (ipykernel) # language: python @@ -22,8 +22,8 @@ # %% import pandas as pd -from vaep.io.data_objects import MqAllSummaries -from vaep.pandas import get_unique_non_unique_columns +from pimmslearn.io.data_objects import MqAllSummaries +from pimmslearn.pandas import get_unique_non_unique_columns mq_all_summaries = MqAllSummaries() diff --git a/project/misc_pytorch_fastai_dataloaders.ipynb b/project/misc_pytorch_fastai_dataloaders.ipynb index 71ab4b573..a4fd0b9bf 100644 --- a/project/misc_pytorch_fastai_dataloaders.ipynb +++ b/project/misc_pytorch_fastai_dataloaders.ipynb @@ -31,14 +31,14 @@ "\n", "import torch\n", "\n", - "from vaep.logging import setup_nb_logger\n", + "from pimmslearn.logging import setup_nb_logger\n", "setup_nb_logger()\n", "\n", - "from vaep.io.datasplits import DataSplits\n", - "from vaep.io.datasets import DatasetWithMaskAndNoTarget, to_tensor\n", - "from vaep.transform import VaepPipeline\n", - "from vaep.models import ae\n", - "from vaep.utils import create_random_df\n", + "from pimmslearn.io.datasplits import DataSplits\n", + "from pimmslearn.io.datasets import DatasetWithMaskAndNoTarget, to_tensor\n", + "from pimmslearn.transform import VaepPipeline\n", + "from pimmslearn.models import ae\n", + "from pimmslearn.utils import create_random_df\n", "\n", "np.random.seed(42)\n", "print(f\"fastai version: {fastai.__version__}\")\n", @@ -557,7 +557,7 @@ "from sklearn.impute import SimpleImputer\n", "from sklearn.preprocessing import StandardScaler\n", "\n", - "import vaep\n", + "import pimmslearn\n", "# import importlib; importlib.reload(vaep); importlib.reload(vaep.transform)\n", "\n", "dae_default_pipeline = sklearn.pipeline.Pipeline(\n", @@ -588,7 +588,7 @@ "metadata": {}, "outputs": [], "source": [ - "from vaep.io.dataloaders import get_dls\n", + "from pimmslearn.io.dataloaders import get_dls\n", "dls = get_dls(data.train_X, data.val_y, dae_transforms, bs=4) \n", "dls.valid.one_batch()" ] @@ -698,7 +698,7 @@ "metadata": {}, "outputs": [], "source": [ - "from vaep.transform import MinMaxScaler\n", + "from pimmslearn.transform import MinMaxScaler\n", "\n", "args_vae = {}\n", "args_vae['SCALER'] = MinMaxScaler\n", diff --git a/project/misc_pytorch_fastai_dataloaders.py b/project/misc_pytorch_fastai_dataloaders.py index 0dddfeef6..05e252391 100644 --- a/project/misc_pytorch_fastai_dataloaders.py +++ b/project/misc_pytorch_fastai_dataloaders.py @@ -5,7 +5,7 @@ # extension: .py # format_name: percent # format_version: '1.3' -# jupytext_version: 1.15.2 +# jupytext_version: 1.16.2 # kernelspec: # display_name: Python 3 (ipykernel) # language: python @@ -32,14 +32,14 @@ import torch -from vaep.logging import setup_nb_logger +from pimmslearn.logging import setup_nb_logger setup_nb_logger() -from vaep.io.datasplits import DataSplits -from vaep.io.datasets import DatasetWithMaskAndNoTarget, to_tensor -from vaep.transform import VaepPipeline -from vaep.models import ae -from vaep.utils import create_random_df +from pimmslearn.io.datasplits import DataSplits +from pimmslearn.io.datasets import DatasetWithMaskAndNoTarget, to_tensor +from pimmslearn.transform import VaepPipeline +from pimmslearn.models import ae +from pimmslearn.utils import create_random_df np.random.seed(42) print(f"fastai version: {fastai.__version__}") @@ -312,7 +312,7 @@ def decodes(self, x_enc:torch.tensor) -> torch.Tensor: from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler -import vaep +import pimmslearn # import importlib; importlib.reload(vaep); importlib.reload(vaep.transform) dae_default_pipeline = sklearn.pipeline.Pipeline( @@ -329,7 +329,7 @@ def decodes(self, x_enc:torch.tensor) -> torch.Tensor: valid_ds[:4] # %% -from vaep.io.dataloaders import get_dls +from pimmslearn.io.dataloaders import get_dls dls = get_dls(data.train_X, data.val_y, dae_transforms, bs=4) dls.valid.one_batch() @@ -394,7 +394,7 @@ def to_tensor(self, s: pd.Series) -> torch.Tensor: # ## Variational Autoencoder # %% -from vaep.transform import MinMaxScaler +from pimmslearn.transform import MinMaxScaler args_vae = {} args_vae['SCALER'] = MinMaxScaler diff --git a/project/misc_pytorch_fastai_dataset.ipynb b/project/misc_pytorch_fastai_dataset.ipynb index 622f2ac4d..681244ffe 100644 --- a/project/misc_pytorch_fastai_dataset.ipynb +++ b/project/misc_pytorch_fastai_dataset.ipynb @@ -16,11 +16,13 @@ "metadata": {}, "outputs": [], "source": [ + "from pimmslearn.io.datasplits import long_format\n", + "from fastai.collab import CollabDataLoaders\n", "import random\n", - "import numpy as np\n", + "\n", "import pandas as pd\n", - "import vaep.io.datasets as datasets\n", - "import vaep.utils as test_data" + "import pimmslearn.io.datasets as datasets\n", + "import pimmslearn.utils as test_data" ] }, { @@ -39,9 +41,9 @@ "source": [ "## Datasets\n", "\n", - "- `PeptideDatasetInMemory`\n", - "- `PeptideDatasetInMemoryMasked`\n", - "- `PeptideDatasetInMemoryNoMissings`" + "- `DatasetWithMaskAndNoTarget`\n", + "- `DatasetWithTarget`\n", + "- `DatasetWithTargetSpecifyTarget`" ] }, { @@ -50,7 +52,8 @@ "tags": [] }, "source": [ - "## `DatasetWithMaskAndNoTarget`" + "### `DatasetWithMaskAndNoTarget`\n", + "- base class for datasets with missing values and no target" ] }, { @@ -65,60 +68,6 @@ "_array, _mask" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### `PeptideDatasetInMemory`\n", - "\n", - "- with duplicated target in memory" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "dataset = datasets.PeptideDatasetInMemory(data)\n", - "for _array, _mask, _target in dataset:\n", - " break\n", - "_array, _mask, _target" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "id(_array), id(_mask), id(_target) " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "_array is _target # should be true" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "data = test_data.create_random_missing_data(N, M, prop_missing=0.3)\n", - "dataset = datasets.PeptideDatasetInMemoryMasked(df=pd.DataFrame(data), fill_na=25.0)\n", - "\n", - "for _array, _mask in dataset:\n", - " if any(_mask):\n", - " print(_array, _mask)\n", - " break" - ] - }, { "cell_type": "markdown", "metadata": {}, @@ -159,8 +108,7 @@ "df = pd.DataFrame(data)\n", "\n", "val_y = df.stack().groupby(level=0).sample(frac=0.2)\n", - "# targets = val_y.unstack().sort_index()\n", - "targets = val_y.unstack()\n", + "targets = val_y.unstack().sort_index(axis=1)\n", "\n", "df[targets.notna()] = pd.NA\n", "df" @@ -201,32 +149,21 @@ "metadata": {}, "outputs": [], "source": [ - "row = random.randint(0,len(dataset)-1)\n", + "row = random.randint(0, len(dataset) - 1)\n", "print(f\"{row = }\")\n", "dataset[row]" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### `PeptideDatasetInMemoryNoMissings`" - ] - }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "id": "37ba1a81", + "metadata": { + "lines_to_next_cell": 2 + }, "outputs": [], "source": [ - "# data and pd.DataFrame.data share the same memory\n", - "try:\n", - " dataset = datasets.PeptideDatasetInMemoryNoMissings(data)\n", - " for _array in dataset:\n", - " print(_array)\n", - " break\n", - "except AssertionError as e:\n", - " print(e)" + "dataset[row:row + 2]" ] }, { @@ -244,12 +181,9 @@ "metadata": {}, "outputs": [], "source": [ - "from fastai.collab import CollabDataLoaders\n", "# , MSELossFlat, Learner\n", "# from fastai.collab import EmbeddingDotBias\n", "\n", - "from vaep.io.datasplits import long_format\n", - "\n", "\n", "data = pd.DataFrame(data)\n", "data.index.name, data.columns.name = ('Sample ID', 'peptide')\n", @@ -264,10 +198,10 @@ "metadata": {}, "outputs": [], "source": [ - "dls = CollabDataLoaders.from_df(df_long, valid_pct=0.15, \n", + "dls = CollabDataLoaders.from_df(df_long, valid_pct=0.15,\n", " user_name='Sample ID', item_name='peptide', rating_name='intensity',\n", - " bs=4)\n", - "type(dls.dataset), dls.dataset._dl_type # no __mro__?" + " bs=4)\n", + "type(dls.dataset), dls.dataset._dl_type # no __mro__?" ] }, { @@ -381,7 +315,6 @@ "metadata": {}, "outputs": [], "source": [ - "from torch.utils.data.dataloader import _SingleProcessDataLoaderIter\n", "_SingleProcessDataLoaderIter??" ] }, @@ -391,7 +324,7 @@ "source": [ "So.. It seems too complicated\n", "- the `_collate_fn` seems to aggrete the data from the DataFrame\n", - "- should be possible to keep track of that " + "- should be possible to keep track of that" ] }, { diff --git a/project/misc_pytorch_fastai_dataset.py b/project/misc_pytorch_fastai_dataset.py index eede3262e..203951f7e 100644 --- a/project/misc_pytorch_fastai_dataset.py +++ b/project/misc_pytorch_fastai_dataset.py @@ -5,7 +5,7 @@ # extension: .py # format_name: percent # format_version: '1.3' -# jupytext_version: 1.15.2 +# jupytext_version: 1.16.2 # kernelspec: # display_name: Python 3 (ipykernel) # language: python @@ -19,13 +19,13 @@ # Datasets are provided to `DataLoaders` which perform the aggreation to batches. # %% -from vaep.io.datasplits import long_format +from pimmslearn.io.datasplits import long_format from fastai.collab import CollabDataLoaders import random import pandas as pd -import vaep.io.datasets as datasets -import vaep.utils as test_data +import pimmslearn.io.datasets as datasets +import pimmslearn.utils as test_data # %% N, M = 15, 7 diff --git a/project/misc_sampling_in_pandas.ipynb b/project/misc_sampling_in_pandas.ipynb index 05500f95d..a932d90c6 100644 --- a/project/misc_sampling_in_pandas.ipynb +++ b/project/misc_sampling_in_pandas.ipynb @@ -26,7 +26,7 @@ "metadata": {}, "outputs": [], "source": [ - "from vaep.utils import create_random_df\n", + "from pimmslearn.utils import create_random_df\n", "X = create_random_df(100, 15, prop_na=0.1).stack().to_frame(\n", " 'intensity').reset_index()\n", "\n", diff --git a/project/misc_sampling_in_pandas.py b/project/misc_sampling_in_pandas.py index 0924f7453..0ffdaaf2c 100644 --- a/project/misc_sampling_in_pandas.py +++ b/project/misc_sampling_in_pandas.py @@ -5,7 +5,7 @@ # extension: .py # format_name: percent # format_version: '1.3' -# jupytext_version: 1.15.2 +# jupytext_version: 1.16.2 # kernelspec: # display_name: Python 3 (ipykernel) # language: python @@ -22,7 +22,7 @@ # ## Some random data # %% -from vaep.utils import create_random_df +from pimmslearn.utils import create_random_df X = create_random_df(100, 15, prop_na=0.1).stack().to_frame( 'intensity').reset_index() diff --git a/project/workflow/notebooks/best_repeated_split_collect_metrics.py b/project/workflow/notebooks/best_repeated_split_collect_metrics.py index d62346e49..026a4b4d8 100644 --- a/project/workflow/notebooks/best_repeated_split_collect_metrics.py +++ b/project/workflow/notebooks/best_repeated_split_collect_metrics.py @@ -16,7 +16,7 @@ import json from pathlib import Path import pandas as pd -import vaep.models.collect_dumps +import pimmslearn.models.collect_dumps REPITITION_NAME = snakemake.params.repitition_name @@ -28,14 +28,14 @@ def load_metric_file(fname: Path): fname = Path(fname) with open(fname) as f: loaded = json.load(f) - loaded = vaep.pandas.flatten_dict_of_dicts(loaded) + loaded = pimmslearn.pandas.flatten_dict_of_dicts(loaded) key = key_from_fname(fname) # '_'.join(key_from_fname(fname)) return key, loaded load_metric_file(snakemake.input.metrics[0]) # %% -all_metrics = vaep.models.collect_dumps.collect(snakemake.input.metrics, load_metric_file) +all_metrics = pimmslearn.models.collect_dumps.collect(snakemake.input.metrics, load_metric_file) metrics = pd.DataFrame(all_metrics) metrics = metrics.set_index('id') metrics.index = pd.MultiIndex.from_tuples( diff --git a/project/workflow/notebooks/best_repeated_train_collect_metrics.py b/project/workflow/notebooks/best_repeated_train_collect_metrics.py index 9e252d73e..36e2899be 100644 --- a/project/workflow/notebooks/best_repeated_train_collect_metrics.py +++ b/project/workflow/notebooks/best_repeated_train_collect_metrics.py @@ -16,7 +16,7 @@ import json from pathlib import Path import pandas as pd -import vaep.models.collect_dumps +import pimmslearn.models.collect_dumps REPITITION_NAME = snakemake.params.repitition_name @@ -36,7 +36,7 @@ def load_metric_file(fname: Path, frist_split='metrics'): fname = Path(fname) with open(fname) as f: loaded = json.load(f) - loaded = vaep.pandas.flatten_dict_of_dicts(loaded) + loaded = pimmslearn.pandas.flatten_dict_of_dicts(loaded) key = key_from_fname(fname) # '_'.join(key_from_fname(fname)) return key, loaded @@ -45,7 +45,7 @@ def load_metric_file(fname: Path, frist_split='metrics'): # %% -all_metrics = vaep.models.collect_dumps.collect(snakemake.input.metrics, load_metric_file) +all_metrics = pimmslearn.models.collect_dumps.collect(snakemake.input.metrics, load_metric_file) metrics = pd.DataFrame(all_metrics) metrics = metrics.set_index('id') metrics.index = pd.MultiIndex.from_tuples( From da886b95e73d44621f8ff9bd853d560922d55de8 Mon Sep 17 00:00:00 2001 From: Henry Date: Tue, 2 Jul 2024 16:52:48 +0200 Subject: [PATCH 08/13] :white_check_mark: Test sklearn transformer interfaces --- tests/models/test_transformers.py | 48 +++++++++++++++++++++++++++++++ 1 file changed, 48 insertions(+) create mode 100644 tests/models/test_transformers.py diff --git a/tests/models/test_transformers.py b/tests/models/test_transformers.py new file mode 100644 index 000000000..b0744bb01 --- /dev/null +++ b/tests/models/test_transformers.py @@ -0,0 +1,48 @@ +"""Test scikit-learn transformers provided by PIMMS.""" +import numpy as np +import pandas as pd +import pytest + +from pimmslearn.sklearn.ae_transformer import AETransformer +from pimmslearn.sklearn.cf_transformer import CollaborativeFilteringTransformer + +test_data = 'project/data/dev_datasets/HeLa_6070/protein_groups_wide_N50_M227.csv' +index_name = 'Sample ID' +column_name = 'protein group' +value_name = 'intensity' + + +def test_CollaborativeFilteringTransformer(): + model = CollaborativeFilteringTransformer( + target_column=value_name, + sample_column=index_name, + item_column=column_name,) + # read data, name index and columns + df = pd.read_csv(test_data, index_col=0) + df = np.log2(df + 1) + df.index.name = index_name # already set + df.columns.name = column_name # not set due to csv disk file format + series = df.stack() + series.name = value_name # ! important + # run for 2 epochs + model.fit(series, cuda=False, epochs_max=2) + + +@pytest.mark.parametrize("model", ['DAE', 'VAE']) +def test_AETransformer(model): + df = pd.read_csv(test_data, index_col=0) + df = np.log2(df + 1) + + df.index.name = index_name # already set + df.columns.name = column_name # not set due to csv disk file format + model = AETransformer( + model=model, + hidden_layers=[512,], + latent_dim=50, + out_folder='runs/scikit_interface', + batch_size=10, + ) + model.fit(df, + cuda=False, + epochs_max=2, + ) From f44eeb8ee85b21cda147f397218c05bda62f5f9b Mon Sep 17 00:00:00 2001 From: Henry Date: Wed, 3 Jul 2024 10:30:28 +0200 Subject: [PATCH 09/13] :bug: pimmslearn as pkg name and pimms as environment everywhere in code - fix building of documentatin on readthedocs - update environment names --- .github/workflows/ci.yaml | 2 +- .github/workflows/workflow_website.yaml | 2 +- docs/README.md | 12 +- docs/conf.py | 2 +- environment.yml | 2 +- project/00_5_training_data_exploration.py | 4 +- project/01_1_train_KNN_unique_samples.py | 2 +- project/02_3_grid_search_analysis.py | 24 +-- project/03_1_best_models_comparison.py | 2 +- project/03_2_best_models_comparison_fig2.py | 2 +- .../10_5_comp_diff_analysis_repetitions.ipynb | 4 +- .../10_5_comp_diff_analysis_repetitions.py | 4 +- project/bin/run_snakemake.sh | 2 +- project/bin/run_snakemake_cluster.sh | 2 +- project/misc_pytorch_fastai_dataloaders.py | 148 +++++++++--------- .../Snakefile_best_across_datasets.smk | 4 +- project/workflow/Snakefile_grid.smk | 2 +- project/workflow/Snakefile_small_N.smk | 2 +- project/workflow/Snakefile_v2.smk | 4 +- project/workflow/TestNotebooks.smk | 2 +- 20 files changed, 113 insertions(+), 115 deletions(-) diff --git a/.github/workflows/ci.yaml b/.github/workflows/ci.yaml index 8c7944b41..79ccd52f5 100644 --- a/.github/workflows/ci.yaml +++ b/.github/workflows/ci.yaml @@ -35,7 +35,7 @@ jobs: channel-priority: disabled python-version: ${{ matrix.python-version }} environment-file: environment.yml - activate-environment: vaep + activate-environment: pimms auto-activate-base: true # auto-update-conda: true - name: inspect-conda-environment diff --git a/.github/workflows/workflow_website.yaml b/.github/workflows/workflow_website.yaml index bc3cef6d9..101f4ead0 100644 --- a/.github/workflows/workflow_website.yaml +++ b/.github/workflows/workflow_website.yaml @@ -27,7 +27,7 @@ jobs: channel-priority: disabled python-version: "3.8" environment-file: environment.yml - activate-environment: vaep + activate-environment: pimms auto-activate-base: true # auto-update-conda: true - name: Dry-run workflow diff --git a/docs/README.md b/docs/README.md index 53ff5ce29..ce6ab620e 100644 --- a/docs/README.md +++ b/docs/README.md @@ -8,16 +8,10 @@ In order to build the docs you need to Command to be run from `path/to/pimms/docs`, i.e. from within the `docs` package folder: -```bash -# pip install pimms[docs] -# pwd: ./vaep/docs -conda env update -f environment.yml -``` - -If you prefer pip, run +Install pimms-learn with docs option locally ```bash -# pwd: ./vaep +# pwd: ./pimms pip install .[docs] ``` @@ -31,7 +25,7 @@ Options: ```bash # pwd: ./pimms/docs # apidoc -sphinx-apidoc --force --implicit-namespaces --module-first -o reference ../vaep +sphinx-apidoc --force --implicit-namespaces --module-first -o reference ../pimmslearn # build docs sphinx-build -n -W --keep-going -b html ./ ./_build/ ``` \ No newline at end of file diff --git a/docs/conf.py b/docs/conf.py index 0f4551115..76dcc264d 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -131,7 +131,7 @@ from pathlib import Path PROJECT_ROOT = Path(__file__).parent.parent - PACKAGE_ROOT = PROJECT_ROOT / "vaep" + PACKAGE_ROOT = PROJECT_ROOT / "pimmslearn" def run_apidoc(_): from sphinx.ext import apidoc diff --git a/environment.yml b/environment.yml index 8415e1b58..a1ab25782 100644 --- a/environment.yml +++ b/environment.yml @@ -1,5 +1,5 @@ # Dev Environment -name: vaep +name: pimms channels: - conda-forge - pytorch diff --git a/project/00_5_training_data_exploration.py b/project/00_5_training_data_exploration.py index f5033f015..8e45733ce 100644 --- a/project/00_5_training_data_exploration.py +++ b/project/00_5_training_data_exploration.py @@ -7,9 +7,9 @@ # format_version: '1.3' # jupytext_version: 1.16.2 # kernelspec: -# display_name: vaep +# display_name: Python 3 # language: python -# name: vaep +# name: python3 # --- # %% [markdown] diff --git a/project/01_1_train_KNN_unique_samples.py b/project/01_1_train_KNN_unique_samples.py index 0adf97c95..4276ea9e7 100644 --- a/project/01_1_train_KNN_unique_samples.py +++ b/project/01_1_train_KNN_unique_samples.py @@ -31,7 +31,7 @@ from pimmslearn.io import datasplits from pimmslearn.models import ae -logger = pimmslearn.logging.setup_logger(logging.getLogger('vaep')) +logger = pimmslearn.logging.setup_logger(logging.getLogger('pimmslearn')) logger.info("Experiment 03 - Analysis of latent spaces and performance comparisions") figures = {} # collection of ax or figures diff --git a/project/02_3_grid_search_analysis.py b/project/02_3_grid_search_analysis.py index 99891a7b3..4c7dbe890 100644 --- a/project/02_3_grid_search_analysis.py +++ b/project/02_3_grid_search_analysis.py @@ -585,7 +585,7 @@ def get_plotly_figure(dataset: str, x='latent_dim'): # %% freq_feat = sampling.frequency_by_index(data.train_X, 0) freq_feat.name = 'freq' -# freq_feat = vaep.io.datasplits.load_freq(data_folder) # could be loaded from datafolder +# freq_feat = pimmslearn.io.datasplits.load_freq(data_folder) # could be loaded from datafolder freq_feat.head() # training data # %% @@ -837,17 +837,17 @@ def get_plotly_figure(dataset: str, x='latent_dim'): # Save html versin of curve with annotation of errors # %% -fig = px_vaep.line((errors_smoothed_long.loc[errors_smoothed_long[freq_feat.name] >= FREQ_MIN] - .join(n_obs_error_is_based_on) - .sort_values(by='freq')), - x=freq_feat.name, - color='model', - y='rolling error average', - title=f'Rolling average error by feature frequency {msg_annotation}', - labels=labels_dict, - hover_data=[feat_count.name, idx_name, 'n_obs'], - category_orders={'model': order_models}) -fig = px_vaep.apply_default_layout(fig) +fig = px.line((errors_smoothed_long.loc[errors_smoothed_long[freq_feat.name] >= FREQ_MIN] + .join(n_obs_error_is_based_on) + .sort_values(by='freq')), + x=freq_feat.name, + color='model', + y='rolling error average', + title=f'Rolling average error by feature frequency {msg_annotation}', + labels=labels_dict, + hover_data=[feat_count.name, idx_name, 'n_obs'], + category_orders={'model': order_models}) +fig = px.apply_default_layout(fig) fig.update_layout(legend_title_text='') # remove legend title files_out[f'best_models_errors_{dataset}_by_freq_plotly.html'] = (FOLDER / f'best_models_errors_{dataset}_by_freq_plotly.html') diff --git a/project/03_1_best_models_comparison.py b/project/03_1_best_models_comparison.py index 1c6c31b23..2314f612f 100644 --- a/project/03_1_best_models_comparison.py +++ b/project/03_1_best_models_comparison.py @@ -26,7 +26,7 @@ import pimmslearn.plotting from pimmslearn.logging import setup_logger -logger = setup_logger(logger=logging.getLogger('vaep'), level=10) +logger = setup_logger(logger=logging.getLogger('pimmslearn'), level=10) plt.rcParams['figure.figsize'] = [4.0, 2.0] pimmslearn.plotting.make_large_descriptors(7) diff --git a/project/03_2_best_models_comparison_fig2.py b/project/03_2_best_models_comparison_fig2.py index f5901e9c0..cb7205db1 100644 --- a/project/03_2_best_models_comparison_fig2.py +++ b/project/03_2_best_models_comparison_fig2.py @@ -28,7 +28,7 @@ import logging from pimmslearn.logging import setup_logger -logger = setup_logger(logger=logging.getLogger('vaep'), level=10) +logger = setup_logger(logger=logging.getLogger('pimmslearn'), level=10) # %% diff --git a/project/10_5_comp_diff_analysis_repetitions.ipynb b/project/10_5_comp_diff_analysis_repetitions.ipynb index 92e40ec8b..188ebda99 100644 --- a/project/10_5_comp_diff_analysis_repetitions.ipynb +++ b/project/10_5_comp_diff_analysis_repetitions.ipynb @@ -368,9 +368,9 @@ ], "metadata": { "kernelspec": { - "display_name": "vaep", + "display_name": "Python 3", "language": "python", - "name": "vaep" + "name": "python3" }, "language_info": { "codemirror_mode": { diff --git a/project/10_5_comp_diff_analysis_repetitions.py b/project/10_5_comp_diff_analysis_repetitions.py index f75c9b59a..301394504 100644 --- a/project/10_5_comp_diff_analysis_repetitions.py +++ b/project/10_5_comp_diff_analysis_repetitions.py @@ -7,9 +7,9 @@ # format_version: '1.3' # jupytext_version: 1.16.2 # kernelspec: -# display_name: vaep +# display_name: Python 3 # language: python -# name: vaep +# name: python3 # --- # %% diff --git a/project/bin/run_snakemake.sh b/project/bin/run_snakemake.sh index 9b7b2d032..c5e4a11e0 100644 --- a/project/bin/run_snakemake.sh +++ b/project/bin/run_snakemake.sh @@ -25,7 +25,7 @@ cd $PBS_O_WORKDIR # start_conda . ~/setup_conda.sh -conda activate vaep +conda activate pimms # try to influence how many jobs are run in parallel in one job training a model export MKL_NUM_THREADS=5 diff --git a/project/bin/run_snakemake_cluster.sh b/project/bin/run_snakemake_cluster.sh index bc96a46a6..cc632bba1 100644 --- a/project/bin/run_snakemake_cluster.sh +++ b/project/bin/run_snakemake_cluster.sh @@ -46,7 +46,7 @@ echo config_split $config_split echo config_train $config_train . ~/setup_conda.sh -conda activate vaep +conda activate pimms snakemake -s workflow/Snakefile_v2.smk --jobs 10 -k -p -c2 --latency-wait 60 --rerun-incomplete \ --configfile $configfile \ diff --git a/project/misc_pytorch_fastai_dataloaders.py b/project/misc_pytorch_fastai_dataloaders.py index 05e252391..a0fd7fdf9 100644 --- a/project/misc_pytorch_fastai_dataloaders.py +++ b/project/misc_pytorch_fastai_dataloaders.py @@ -15,70 +15,76 @@ # %% [markdown] # # `DataLoaders` for feeding data into models + # %% +import fastai import numpy as np import pandas as pd +import pytest +import sklearn +import torch +from fastai.data.core import DataLoaders +# from fastai.tabular.all import * +from fastai.tabular.all import * +from fastai.tabular.core import (FillMissing, IndexSplitter, Normalize, + TabularPandas) +from fastcore.basics import store_attr +from sklearn.impute import SimpleImputer +from sklearn.preprocessing import StandardScaler -import fastai -from fastai.tabular.core import Normalize -from fastai.tabular.core import FillMissing -from fastai.tabular.core import TabularPandas -from fastai.tabular.core import IndexSplitter -# make DataLoaders.test_dl work for DataFrames as test_items: +from pimmslearn.io.dataloaders import get_dls +from pimmslearn.io.datasets import DatasetWithMaskAndNoTarget +from pimmslearn.io.datasplits import DataSplits +from pimmslearn.logging import setup_nb_logger +from pimmslearn.models import ae +from pimmslearn.transform import MinMaxScaler, VaepPipeline +from pimmslearn.utils import create_random_df -# from fastai.tabular.all import * -from fastai.tabular.all import TabularDataLoaders -from fastcore.transform import Pipeline +# make DataLoaders.test_dl work for DataFrames as test_items: -import torch -from pimmslearn.logging import setup_nb_logger setup_nb_logger() -from pimmslearn.io.datasplits import DataSplits -from pimmslearn.io.datasets import DatasetWithMaskAndNoTarget, to_tensor -from pimmslearn.transform import VaepPipeline -from pimmslearn.models import ae -from pimmslearn.utils import create_random_df np.random.seed(42) print(f"fastai version: {fastai.__version__}") print(f"torch version: {torch.__version__}") # %% -from fastcore.transform import Pipeline -from fastcore.basics import store_attr + class FillMissingKeepAll(FillMissing): """Replacement for `FillMissing` including also non-missing features in the training data which might be missing in the validation or test data. """ + def setups(self, to): - store_attr(but='to', na_dict={n:self.fill_strategy(to[n], self.fill_vals[n]) - for n in to.conts.keys()}) + store_attr(but='to', na_dict={n: self.fill_strategy(to[n], self.fill_vals[n]) + for n in to.conts.keys()}) self.fill_strategy = self.fill_strategy.__name__ - # %% [markdown] # Create data # # - train data without missings # - validation and test data with missings # -# Could be adapted to have more or less missing in training, validation or test data. Choosen as in current version the validation data cannot contain features with missing values which were not missing in the training data. - +# Could be adapted to have more or less missing in training, validation or +# test data. Choosen as in current version the validation data cannot +# contain features with missing values which were not missing in the +# training data. # %% N, M = 150, 15 create_df = create_random_df X = create_df(N, M) -X = X.append(create_df(int(N*0.3), M, prop_na=.1, start_idx=len(X))) +X = pd.concat([X, create_df(int(N * 0.3), M, prop_na=.1, start_idx=len(X))]) -idx_val = X.index[N:] # RandomSplitter could be used, but used to show IndexSplitter usage with Tabular +idx_val = X.index[N:] # RandomSplitter could be used, but used to show IndexSplitter usage with Tabular -X_test = create_df(int(N*0.1), M, prop_na=.1, start_idx=len(X)) +X_test = create_df(int(N * 0.1), M, prop_na=.1, start_idx=len(X)) data = DataSplits(train_X=X.loc[X.index.difference(idx_val)], val_y=X.loc[idx_val], @@ -114,7 +120,8 @@ def setups(self, to): # _ = tf_fillna.setup(to) # ``` # -# No added in a manuel pipeline. See [opened issue](https://github.com/fastai/fastai/issues/3530) on `Tabular` behaviour. +# No added in a manuel pipeline. See [opened issue](https://github.com/fastai/fastai/issues/3530) +# on `Tabular` behaviour. # Setting transformation (procs) in the constructor is somehow not persistent, although very similar code is called. # # ``` @@ -123,14 +130,14 @@ def setups(self, to): # ``` # %% -X = data.train_X.append(data.val_y) +X = pd.concat([data.train_X, data.val_y]) + +splits = X.index.get_indexer(data.val_y.index) # In Tabular iloc is used, not loc for splitting +splits = IndexSplitter(splits)(X) # splits is are to list of integer indicies (for iloc) -splits = X.index.get_indexer(data.val_y.index) # In Tabular iloc is used, not loc for splitting -splits = IndexSplitter(splits)(X) # splits is are to list of integer indicies (for iloc) - procs = [Normalize, FillMissingKeepAll] -to = TabularPandas(X, procs=procs, cont_names=X.columns.to_list(), splits=splits) # to = tabular object +to = TabularPandas(X, procs=procs, cont_names=X.columns.to_list(), splits=splits) # to = tabular object print("Tabular object:", type(to)) to.items.head() @@ -148,14 +155,14 @@ def setups(self, to): # ```python # # (#2) # [ -# FillMissingKeepAll -- -# {'fill_strategy': , -# 'add_col': True, +# FillMissingKeepAll -- +# {'fill_strategy': , +# 'add_col': True, # 'fill_vals': defaultdict(, {'feat_00': 0, 'feat_01': 0, 'feat_02': 0, ..., 'feat_14': 13.972452} # }: # encodes: (object,object) -> encodes # decodes: , -# Normalize -- +# Normalize -- # {'mean': None, 'std': None, 'axes': (0, 2, 3), # 'means': {'feat_00': 14.982738, 'feat_01': 13.158741, 'feat_02': 14.800485, ..., 'feat_14': 8.372757} # }: @@ -202,8 +209,13 @@ def setups(self, to): # #### Transform test data manuelly # %% -to_test = TabularPandas(data.test_y.copy(), procs=None, cont_names=data.test_y.columns.to_list(), splits=None, do_setup=True) -_ = procs(to_test) # inplace operation +to_test = TabularPandas( + data.test_y.copy(), + procs=None, + cont_names=data.test_y.columns.to_list(), + splits=None, + do_setup=True) +_ = procs(to_test) # inplace operation to_test.items.head() # %% @@ -213,11 +225,11 @@ def setups(self, to): # #### Feeding one batch to the model # %% -cats, conts, ys = dls.one_batch() +cats, conts, ys = dls.one_batch() # %% model = ae.Autoencoder(n_features=M, n_neurons=int( - M/2), last_decoder_activation=None, dim_latent=10) + M / 2), last_decoder_activation=None, dim_latent=10) model # %% [markdown] @@ -228,7 +240,8 @@ def setups(self, to): # %% [markdown] # #### target -# - missing puzzle piece is to have a `callable` y-block which transforms part of the input. In principle it could be the same as the continous features +# - missing puzzle piece is to have a `callable` y-block which transforms part of the +# input. In principle it could be the same as the continous features # %% [markdown] # ### PyTorch Dataset @@ -242,7 +255,6 @@ def setups(self, to): # #### DataLoaders # %% -from fastai.data.core import DataLoaders dls = DataLoaders.from_dsets(train_ds, valid_ds, bs=4) @@ -253,27 +265,29 @@ def setups(self, to): # #### DataLoaders with Normalization fastai Transform # %% -from fastai.tabular.all import * + + class Normalize(Transform): def setup(self, array): self.mean = array.mean() # this assumes tensor, numpy arrays and alike # should be applied along axis 0 (over the samples) self.std = array.std() # ddof=0 in scikit-learn - - def encodes(self, x): # -> torch.Tensor: # with type annotation this throws an error + + def encodes(self, x): # -> torch.Tensor: # with type annotation this throws an error x_enc = (x - self.mean) / self.std return x_enc - def decodes(self, x_enc:torch.tensor) -> torch.Tensor: + def decodes(self, x_enc: torch.tensor) -> torch.Tensor: x = (self.std * x_enc) + self.mean return x - + + o_tf_norm = Normalize() o_tf_norm.setup(data.train_X) -o_tf_norm(data.val_y.head()) # apply this manueally to each dataset +o_tf_norm(data.val_y.head()) # apply this manueally to each dataset # %% -o_tf_norm.encodes # object= everything +o_tf_norm.encodes # object= everything # %% train_ds = DatasetWithMaskAndNoTarget(df=o_tf_norm(data.train_X)) @@ -289,14 +303,13 @@ def decodes(self, x_enc:torch.tensor) -> torch.Tensor: dls.valid.one_batch() # %% -import pytest -from numpy.testing import assert_array_almost_equal, assert_array_less assert (dls.valid.one_batch()[1] < 0.0).any(), "Normalization did not work." with pytest.raises(AttributeError): DatasetWithMaskAndNoTarget(df=data.val_y, transformer=o_tf_norm) - -# assert_array_almost_equal(DatasetWithMaskAndNoTarget(df=data.val_y, transformer=o_tf_norm)[0][1], DatasetWithMaskAndNoTarget(df=o_tf_norm(data.val_y))[0][1]) + +# assert_array_almost_equal(DatasetWithMaskAndNoTarget +# (df=data.val_y, transformer=o_tf_norm)[0][1], DatasetWithMaskAndNoTarget(df=o_tf_norm(data.val_y))[0][1]) # with pytest.raises(AttributeError): # valid_ds.inverse_transform(dls.valid.one_batch()[1]) @@ -304,17 +317,9 @@ def decodes(self, x_enc:torch.tensor) -> torch.Tensor: # #### DataLoaders with Normalization sklearn transform # # - solve transformation problem by composition -# - inverse transform only used for +# - inverse transform only used for # %% -import sklearn -# from sklearn import preprocessing -from sklearn.impute import SimpleImputer -from sklearn.preprocessing import StandardScaler - -import pimmslearn -# import importlib; importlib.reload(vaep); importlib.reload(vaep.transform) - dae_default_pipeline = sklearn.pipeline.Pipeline( [ ('normalize', StandardScaler()), @@ -329,8 +334,8 @@ def decodes(self, x_enc:torch.tensor) -> torch.Tensor: valid_ds[:4] # %% -from pimmslearn.io.dataloaders import get_dls -dls = get_dls(data.train_X, data.val_y, dae_transforms, bs=4) + +dls = get_dls(data.train_X, data.val_y, dae_transforms, bs=4) dls.valid.one_batch() # %% @@ -341,7 +346,7 @@ def decodes(self, x_enc:torch.tensor) -> torch.Tensor: test_dl.one_batch() # %% -dae_transforms.inverse_transform(test_dl.one_batch()[1]) # here the missings are not replaced +dae_transforms.inverse_transform(test_dl.one_batch()[1]) # here the missings are not replaced # %% data.test_y.head(4) @@ -352,8 +357,8 @@ def decodes(self, x_enc:torch.tensor) -> torch.Tensor: # - adding `Transforms` not possible, I openend a [discussion](https://forums.fast.ai/t/correct-output-type-for-tensor-created-from-dataframe-custom-new-task-tutorial/92564) # %% -from typing import Tuple -from fastai.tabular.all import * + + # from fastai.torch_core import TensorBase @@ -365,12 +370,12 @@ def __init__(self, df: pd.DataFrame): self.mask_obs = df.isna() # .astype('uint8') # in case 0,1 is preferred self.data = df - def encodes(self, idx): # -> Tuple[torch.Tensor, torch.Tensor]: # annotation is interpreted + def encodes(self, idx): # -> Tuple[torch.Tensor, torch.Tensor]: # annotation is interpreted mask = self.mask_obs.iloc[idx] data = self.data.iloc[idx] # return (self.to_tensor(mask), self.to_tensor(data)) # return (Tensor(mask), Tensor(data)) - return (tensor(data), tensor(mask)) #TabData, TabMask + return (tensor(data), tensor(mask)) # TabData, TabMask def to_tensor(self, s: pd.Series) -> torch.Tensor: return torch.from_numpy(s.values) @@ -384,8 +389,8 @@ def to_tensor(self, s: pd.Series) -> torch.Tensor: DatasetTransform(data.val_y)) dls = DataLoaders.from_dsets(train_tl, valid_tl, -# after_item=[Normalize], -# after_batch=[Normalize], + # after_item=[Normalize], + # after_batch=[Normalize], bs=4) print(f"\n{DatasetTransform.encodes = }") dls.one_batch() @@ -394,7 +399,6 @@ def to_tensor(self, s: pd.Series) -> torch.Tensor: # ## Variational Autoencoder # %% -from pimmslearn.transform import MinMaxScaler args_vae = {} args_vae['SCALER'] = MinMaxScaler diff --git a/project/workflow/Snakefile_best_across_datasets.smk b/project/workflow/Snakefile_best_across_datasets.smk index ea208a1c4..4a09a91a1 100644 --- a/project/workflow/Snakefile_best_across_datasets.smk +++ b/project/workflow/Snakefile_best_across_datasets.smk @@ -51,7 +51,7 @@ rule collect_metrics: run: from pathlib import Path import pandas as pd - import vaep.models + import pimmslearn.models REPITITION_NAME = params.repitition_name @@ -62,7 +62,7 @@ rule collect_metrics: return key - all_metrics = vaep.models.collect_metrics(input.metrics, key_from_fname) + all_metrics = pimmslearn.models.collect_metrics(input.metrics, key_from_fname) metrics = pd.DataFrame(all_metrics).T metrics.index.names = ("data level", REPITITION_NAME) metrics diff --git a/project/workflow/Snakefile_grid.smk b/project/workflow/Snakefile_grid.smk index 299a512ea..137b046e4 100644 --- a/project/workflow/Snakefile_grid.smk +++ b/project/workflow/Snakefile_grid.smk @@ -1,4 +1,4 @@ -from vaep.io.types import resolve_type +from pimmslearn.io.types import resolve_type from snakemake.utils import min_version min_version("6.0") diff --git a/project/workflow/Snakefile_small_N.smk b/project/workflow/Snakefile_small_N.smk index 807f7d211..07b611a91 100644 --- a/project/workflow/Snakefile_small_N.smk +++ b/project/workflow/Snakefile_small_N.smk @@ -1,5 +1,5 @@ from pathlib import Path -from vaep.io.types import resolve_type +from pimmslearn.io.types import resolve_type from snakemake.utils import min_version from snakemake.logging import logger diff --git a/project/workflow/Snakefile_v2.smk b/project/workflow/Snakefile_v2.smk index 1db8774af..1e5e13f45 100644 --- a/project/workflow/Snakefile_v2.smk +++ b/project/workflow/Snakefile_v2.smk @@ -138,7 +138,7 @@ rule train_NAGuideR_model: method="{method}", name="{method}", conda: - "vaep" + "pimms" shell: "papermill {input.nb} {output.nb}" " -r train_split {input.train_split}" @@ -191,7 +191,7 @@ rule train_models: out="{folder_experiment}/01_1_train_{model}.o", name="{model}", conda: - "vaep" + "pimms" shell: "papermill {input.nb} {output.nb}" " -f {input.configfile}" diff --git a/project/workflow/TestNotebooks.smk b/project/workflow/TestNotebooks.smk index 9d8399827..9c29355a2 100644 --- a/project/workflow/TestNotebooks.smk +++ b/project/workflow/TestNotebooks.smk @@ -18,6 +18,6 @@ rule execute: output: nb="test_nb/{file}", # conda: - # vaep + # pimms shell: "papermill {input.nb} {output.nb}" From b0f5ef6195673bb925f330034dbde0f30246b386 Mon Sep 17 00:00:00 2001 From: Henry Date: Tue, 9 Jul 2024 14:04:20 +0200 Subject: [PATCH 10/13] :white_check_mark: test transform method of Transformers --- tests/models/test_transformers.py | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/tests/models/test_transformers.py b/tests/models/test_transformers.py index b0744bb01..4e4aebb8f 100644 --- a/tests/models/test_transformers.py +++ b/tests/models/test_transformers.py @@ -26,6 +26,8 @@ def test_CollaborativeFilteringTransformer(): series.name = value_name # ! important # run for 2 epochs model.fit(series, cuda=False, epochs_max=2) + df_imputed = model.transform(series).unstack() + assert df_imputed.isna().sum().sum() == 0 @pytest.mark.parametrize("model", ['DAE', 'VAE']) @@ -46,3 +48,5 @@ def test_AETransformer(model): cuda=False, epochs_max=2, ) + df_imputed = model.transform(df) + assert df_imputed.isna().sum().sum() == 0 From fb4c802d7d134f76e89a11942313ff5a95db2855 Mon Sep 17 00:00:00 2001 From: Henry Date: Tue, 9 Jul 2024 14:05:18 +0200 Subject: [PATCH 11/13] :memo: add minimal usage example to README --- README.md | 92 ++++++++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 84 insertions(+), 8 deletions(-) diff --git a/README.md b/README.md index 86ec184b6..2f014780b 100644 --- a/README.md +++ b/README.md @@ -31,17 +31,89 @@ In our experiments overfitting wasn't a big issue, but it's easy to check. For interactive use of the models provided in PIMMS, you can use our [python package `pimms-learn`](https://pypi.org/project/pimms-learn/). -The interface is similar to scikit-learn. +The interface is similar to scikit-learn. The package is then availabe as `pimmslearn` +for import in your Python session. ``` pip install pimms-learn +# import pimmslearn # in your python script ``` -Then you can use the models on a pandas DataFrame with missing values. You can try this in the tutorial on Colab by uploading your data: -[![open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/RasmussenLab/pimms/blob/HEAD/project/04_1_train_pimms_models.ipynb) +The most basic use for imputation is using a DataFrame. + +```python +import numpy as np +import pandas as pd +from pimmslearn.sklearn.ae_transformer import AETransformer +from pimmslearn.sklearn.cf_transformer import CollaborativeFilteringTransformer + +fn_intensities = ('https://raw.githubusercontent.com/RasmussenLab/pimms/main/' + 'project/data/dev_datasets/HeLa_6070/protein_groups_wide_N50.csv') +index_name = 'Sample ID' +column_name = 'protein group' +value_name = 'intensity' + +df = pd.read_csv(fn_intensities, index_col=0) +df = np.log2(df + 1) + +df.index.name = index_name # already set +df.columns.name = column_name # not set due to csv disk file format + +# df # see details below to see a preview of the DataFrame + +# use the Denoising or Variational Autoencoder +model = AETransformer( + model='DAE', # or 'VAE' + hidden_layers=[512,], + latent_dim=50, # dimension of joint sample and item embedding + batch_size=10, +) +model.fit(df, + cuda=False, + epochs_max=100, + ) +df_imputed = model.transform(df) + +# or use the collaborative filtering model +series = df.stack() +series.name = value_name # ! important +model = CollaborativeFilteringTransformer( + target_column=value_name, + sample_column=index_name, + item_column=column_name, + n_factors=30, # dimension of separate sample and item embedding + batch_size = 4096 +) +model.fit(series, cuda=False, epochs_max=20) +df_imputed = model.transform(series).unstack() +``` + +
+ see log2 transformed DataFrame + + First 10 rows and 10 columns. notice that the indices are named: + + | Sample ID | AAAS | AACS | AAMDC | AAMP | AAR2 | AARS | AARS2 | AASDHPPT | AATF | ABCB10 | + |:-----------------------------------------------|--------:|---------:|---------:|---------:|---------:|--------:|---------:|-----------:|--------:|---------:| + protein group | + | 2019_12_18_14_35_Q-Exactive-HF-X-Orbitrap_6070 | 28.3493 | 26.1332 | nan | 26.7769 | 27.2478 | 32.1949 | 27.1526 | 27.8721 | 28.6025 | 26.1103 | + | 2019_12_19_19_48_Q-Exactive-HF-X-Orbitrap_6070 | 27.6574 | 25.0186 | 24.2362 | 26.2707 | 27.2107 | 31.9792 | 26.5302 | 28.1915 | 27.9419 | 25.7349 | + | 2019_12_20_14_15_Q-Exactive-HF-X-Orbitrap_6070 | 28.3522 | 23.7405 | nan | 27.0979 | 27.3774 | 32.8845 | 27.5145 | 28.4756 | 28.7709 | 26.7868 | + | 2019_12_27_12_29_Q-Exactive-HF-X-Orbitrap_6070 | 26.8255 | nan | nan | 26.2563 | nan | 31.9264 | 26.1569 | 27.6349 | 27.8508 | 25.346 | + | 2019_12_29_15_06_Q-Exactive-HF-X-Orbitrap_6070 | 27.4037 | 26.9485 | 23.8644 | 26.9816 | 26.5198 | 31.8438 | 25.3421 | 27.4164 | 27.4741 | nan | + | 2019_12_29_18_18_Q-Exactive-HF-X-Orbitrap_6070 | 27.8913 | 26.481 | 26.3475 | 27.8494 | 26.917 | 32.2737 | nan | 27.4041 | 28.0811 | nan | + | 2020_01_02_17_38_Q-Exactive-HF-X-Orbitrap_6070 | 25.4983 | nan | nan | nan | nan | 30.2256 | nan | 23.8013 | 25.1304 | nan | + | 2020_01_03_11_17_Q-Exactive-HF-X-Orbitrap_6070 | 27.3519 | nan | 24.4331 | 25.2752 | 24.8459 | 30.9793 | nan | 24.893 | 25.3238 | nan | + | 2020_01_03_16_58_Q-Exactive-HF-X-Orbitrap_6070 | 27.6197 | 25.6238 | 23.5204 | 27.1356 | 25.9713 | 31.4154 | 25.3596 | 25.1191 | 25.75 | nan | + | 2020_01_03_20_10_Q-Exactive-HF-X-Orbitrap_6070 | 27.2998 | nan | 25.6604 | 27.7328 | 26.8965 | 31.4546 | 25.4369 | 26.8135 | 26.2008 | nan | + ... + +
+ + +For hints on how to add validation (and potentially test data) to use early stopping, +see the tutorial: [![open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/RasmussenLab/pimms/blob/HEAD/project/04_1_train_pimms_models.ipynb) -> `PIMMS` was called `vaep` during development. -> Before entire refactoring has been completed the imported package will be `vaep`. ## PIMMS comparison workflow and differential analysis workflow @@ -169,7 +241,11 @@ python 04_1_train_pimms_models.py # just execute the code If you only want to execute the workflow, you can use snakemake to build the environments for you: -> Snakefile workflow for imputation v1 only support that atm. +Install snakemake e.g. using the provided [`snakemake_env.yml`](https://github.com/RasmussenLab/pimms/blob/HEAD/snakemake_env.yml) +file as used in +[this workflow](https://github.com/RasmussenLab/pimms/blob/HEAD/.github/workflows/ci_workflow.yaml). + +> [!NOTE] Snakefile workflow for imputation v1 only support that atm. ```bash snakemake -p -c1 --configfile config/single_dev_dataset/example/config.yaml --use-conda -n # dry-run @@ -236,7 +312,7 @@ To combine them with the observed data you can run # ipython or python session # be in ./pimms/project folder_data = 'runs/example/data' -data = vaep.io.datasplits.DataSplits.from_folder( +data = pimmslearn.io.datasplits.DataSplits.from_folder( folder_data, file_format='pkl') observed = pd.concat([data.train_X, data.val_y, data.test_y]) # load predictions for missing values of a certain model @@ -249,7 +325,7 @@ assert df_imputed.isna().sum().sum() == 0 df_imputed ``` -> [!NOTE]: The imputation is simpler if you use the provide scikit-learn Transformer +> [!NOTE] The imputation is simpler if you use the provide scikit-learn Transformer > interface (see [Tutorial](https://colab.research.google.com/github/RasmussenLab/pimms/blob/HEAD/project/04_1_train_pimms_models.ipynb)). ## Available imputation methods From af4f92b8b37a9e8736a89c6b355dab128ad11109 Mon Sep 17 00:00:00 2001 From: Henry Date: Tue, 9 Jul 2024 14:56:16 +0200 Subject: [PATCH 12/13] :construction: see if emojis are rendere on readthedocs --- README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 2f014780b..808bbe421 100644 --- a/README.md +++ b/README.md @@ -89,7 +89,7 @@ df_imputed = model.transform(series).unstack() ```
- see log2 transformed DataFrame + :mag: see log2 transformed DataFrame First 10 rows and 10 columns. notice that the indices are named: @@ -245,7 +245,7 @@ Install snakemake e.g. using the provided [`snakemake_env.yml`](https://github.c file as used in [this workflow](https://github.com/RasmussenLab/pimms/blob/HEAD/.github/workflows/ci_workflow.yaml). -> [!NOTE] Snakefile workflow for imputation v1 only support that atm. +> :warning: Snakefile workflow for imputation v1 only support that atm. ```bash snakemake -p -c1 --configfile config/single_dev_dataset/example/config.yaml --use-conda -n # dry-run @@ -325,7 +325,7 @@ assert df_imputed.isna().sum().sum() == 0 df_imputed ``` -> [!NOTE] The imputation is simpler if you use the provide scikit-learn Transformer +> :warning: The imputation is simpler if you use the provide scikit-learn Transformer > interface (see [Tutorial](https://colab.research.google.com/github/RasmussenLab/pimms/blob/HEAD/project/04_1_train_pimms_models.ipynb)). ## Available imputation methods From 83d01019e626e0fa5740eef0ae39d6337ed98cee Mon Sep 17 00:00:00 2001 From: Henry Date: Tue, 9 Jul 2024 15:42:29 +0200 Subject: [PATCH 13/13] :bug: emojis: replace shortcode with unicode icons --- README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 808bbe421..0693cac50 100644 --- a/README.md +++ b/README.md @@ -89,7 +89,7 @@ df_imputed = model.transform(series).unstack() ```
- :mag: see log2 transformed DataFrame + 🔍 see log2 transformed DataFrame First 10 rows and 10 columns. notice that the indices are named: @@ -178,7 +178,7 @@ cd project # project folder as pwd papermill 01_0_split_data.ipynb --help-notebook papermill 01_1_train_vae.ipynb --help-notebook ``` -> :warning: Mistyped argument names won't throw an error when using papermill, but a warning is printed on the console thanks to my contributions:) +> ⚠️ Mistyped argument names won't throw an error when using papermill, but a warning is printed on the console thanks to my contributions:) ## Setup workflow and development environment @@ -211,7 +211,7 @@ If on Mac M1, M2 or having otherwise issue using your accelerator (e.g. GPUs): I ### Install pytorch first (2) -> :warning: We currently see issues with some installations on M1 chips. A dependency +> ⚠️ We currently see issues with some installations on M1 chips. A dependency > for one workflow is polars, which causes the issue. This should be [fixed now](https://github.com/RasmussenLab/njab/pull/13) > for general use by delayed import > of `mrmr-selection` in `njab`. If you encounter issues, please open an issue. @@ -245,7 +245,7 @@ Install snakemake e.g. using the provided [`snakemake_env.yml`](https://github.c file as used in [this workflow](https://github.com/RasmussenLab/pimms/blob/HEAD/.github/workflows/ci_workflow.yaml). -> :warning: Snakefile workflow for imputation v1 only support that atm. +> ⚠️ Snakefile workflow for imputation v1 only support that atm. ```bash snakemake -p -c1 --configfile config/single_dev_dataset/example/config.yaml --use-conda -n # dry-run