wilhelm-lab · JSchlensok · Sep 17, 2024 · Jun 20, 2024 · Jul 8, 2024 · Jul 8, 2024
diff --git a/docs/API.rst b/docs/API.rst
@@ -62,28 +62,34 @@ Predicting: :code:`pr`
 
 .. currentmodule:: oktoberfest
 
-Access to functions that communicate with a Koina server to retrieve predictions from various prediction models.
+Access to functions that communicate with a Koina server to retrieve predictions from various prediction models, or serve pre-trained TensorFlow models locally.
 
 High level features
 ~~~~~~~~~~~~~~~~~~~
 
-.. autosummary::
-   :toctree: api/pr
-
-   pr.predict_intensities
-   pr.predict_rt
-   pr.ce_calibration
+.. autoclass:: pr.predictor.Predictor
+    :members: from_config, predict_intensities, predict_irt, ce_calibration
 
 Koina interface
 ~~~~~~~~~~~~~~~
 
 .. autosummary::
    :toctree: api/pr
 
-   pr.predict
-   pr.predict_at_once
-   pr.predict_in_chunks
+   pr.Predictor.predict
+   pr.Predictor.predict_at_once
+   pr.Predictor.predict_in_chunks
+   pr.koina.Koina
+
+DLomix interface
+~~~~~~~~~~~~~~~~
+
+.. autosummary::
+    :toctree: api/pr
 
+    pr.Predictor.predict
+    pr.Predictor.predict_at_once
+    pr.dlomix.DLomix
 
 Rescoring: :code:`re`
 ---------------------

diff --git a/docs/config.rst b/docs/config.rst
@@ -1,7 +1,7 @@
 Configuration
 =============
 
-The following provides an overview of all available flags in the configuration file to use the high level API and run jobs. Parameters may be applicable to more than one job type and are collected within indivdual tables.
+The following provides an overview of all available flags in the configuration file to use the high-level API and run jobs. Parameters may be applicable to more than one job type and are collected within indivdual tables.
 
 Always applicable
 -----------------
@@ -18,7 +18,7 @@ Always applicable
    +----------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | models                     | Contains information about the used models for peptide property prediction (see following 2 nested parameters)                                                                                                                                                                             |
    +----------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
-   |     intensity              | Name of the model used for fragment intensity prediction                                                                                                                                                                                                                                   |
+   |     intensity              | Name or path of the model used for fragment intensity prediction                                                                                                                                                                                                                           |
    +----------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |     irt                    | Name of the model used for indexed retention time prediction                                                                                                                                                                                                                               |
    +----------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
@@ -140,3 +140,15 @@ Applicable to in-silico digestion
    +----------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |     db                     | Defines whether the digestion should contain only targets, only decoys or both (concatenated); can be "target", "decoy" or "concat"; default = "concat"            |
    +----------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+
+Applicable to local prediction and transfer learning
+----------------------------------------------------
+
+.. table::
+   :class: fixed-table local-prediction-config-table
+
+   +----------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+   | Parameter                  |                             Description                                                                                                                            |
+   +============================+====================================================================================================================================================================+
+   | predictIntensityLocally    | Defines whether an off-line model should be used for predicting insensity; can be True or False; default = False                                                   |
+   +----------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
diff --git a/docs/installation.rst b/docs/installation.rst
@@ -17,7 +17,7 @@ The installer script automatically installs dependencies and creates a new conda
    wget https://raw.githubusercontent.com/wilhelm-lab/oktoberfest/main/installer.sh -O install_oktoberfest.sh
    bash install_oktoberfest.sh
 
-The installer searches for existing anaconda / miniconda installation. If none was found, it will download and install miniconda.
+The installer searches for an existing anaconda / miniconda installation. If none is found, it will download and install miniconda.
 
 Docker Image
 ------------

diff --git a/docs/outputs.rst b/docs/outputs.rst
@@ -12,7 +12,9 @@ General directory structure
     +-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
     | Directory             | Description                                                                                                                                                                                                                                                                                                                                                                                                                   |
     +=======================+===============================================================================================================================================================================================================================================================================================================================================================================================================================+
-    | data/                 | Contains hdf5 files that combine search results, annotated spectra, metadata in <spectra_file>.mzML.hdf5 and predictions in <spectra_file>.mzML.pred.hdf5 where <spectra_file> is replaced with the specific name of the RAW file for which information is stored. The files are updated and store the progress of the current job and enable skipping specific steps when rerunning a job.                                   |
+    | data/                 | Contains spectra processed for usage in machine learning applications with DLomix: preprocessed datasets in Parquet format, as well as lists of ion types and modifications in them in plain text format.                                                                                                                                                                                                                     |
+    +-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+    | data/dlomix           | Contains hdf5 files that combine search results, annotated spectra, metadata in <spectra_file>.mzML.hdf5 and predictions in <spectra_file>.mzML.pred.hdf5 where <spectra_file> is replaced with the specific name of the RAW file for which information is stored. The files are updated and store the progress of the current job and enable skipping specific steps when rerunning a job.                                   |
     +-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
     | msms/                 | Contains combined msms.prosit and separated search results in <spectra_file>.prosit where <spectra_file> is replaced with the spectra file name for which search results are stored. The files are stored in the `internal format <./internal_format.html>`_  and are created as part of preprocessing search results from supported search engines. If a file is present, preprocessing is skipped when rerunning a job.     |
     +-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

diff --git a/oktoberfest/data/spectra.py b/oktoberfest/data/spectra.py
@@ -34,19 +34,25 @@ class Spectra(anndata.AnnData):
     INTENSITY_LAYER_NAME = "raw_int"
     MZ_LAYER_NAME = "mz"
     COLUMNS_FRAGMENT_ION = ["Y1+", "Y1++", "Y1+++", "B1+", "B1++", "B1+++"]
+    MAX_CHARGE = 3
 
     @staticmethod
-    def _gen_vars_df() -> pd.DataFrame:
+    def _gen_vars_df(specified_ion_types: Optional[List[str]] = None) -> pd.DataFrame:
         """
         Creates Annotation dataframe for vars in AnnData object.
 
+        :param specified_ion_types: ion types that are expected to be in the spectra. If None default back to
         :return: pd.Dataframe of Frgment Annotation
         """
-        ion_nums = np.repeat(np.arange(1, 30), 6)
-        ion_charge = np.tile([1, 2, 3], 29 * 2)
+        if not specified_ion_types:
+            specified_ion_types = ["y", "b"]
+
+        number_of_ion_types = len(specified_ion_types)
+        ion_nums = np.repeat(np.arange(1, 30), 3 * number_of_ion_types)
+        ion_charge = np.tile([1, 2, 3], 29 * number_of_ion_types)
         temp_cols = []
         for size in range(1, 30):
-            for typ in ["y", "b"]:
+            for typ in specified_ion_types:
                 for charge in ["+1", "+2", "+3"]:
                     temp_cols.append(f"{typ}{size}{charge}")
         ion_types = [frag[0] for frag in temp_cols]
@@ -55,7 +61,7 @@ def _gen_vars_df() -> pd.DataFrame:
         return var_df
 
     @staticmethod
-    def _gen_column_names(fragment_type: FragmentType) -> List[str]:
+    def _gen_column_names(fragment_type: FragmentType):  # , fragmentation_methods: Set[str]) -> List[str]:
         """
         Get column names of the spectra data.
 
@@ -260,7 +266,7 @@ def get_matrix(self, fragment_type: FragmentType) -> Tuple[csr_matrix, List[str]
         layer = self._resolve_layer_name(fragment_type)
         matrix = self.layers[layer]
 
-        return matrix, self._gen_column_names(fragment_type)
+        return matrix, self._gen_column_names(fragment_type)  # , set(self.obs["FRAGMENTATION"]))
 
     def write_as_hdf5(self, output_file: Union[str, Path]):
         """
@@ -291,14 +297,46 @@ def convert_to_df(self) -> pd.DataFrame:
 
         if "mz" in list(self.layers):
             mz_cols = pd.DataFrame(self.get_matrix(FragmentType.MZ)[0].toarray())
-            mz_cols.columns = self._gen_column_names(FragmentType.MZ)
+            mz_cols.columns = self._gen_column_names(FragmentType.MZ)  # , set(self.obs["FRAGMENTATION"]))
             df_merged = pd.concat([df_merged, mz_cols], axis=1)
         if "raw_int" in list(self.layers):
             raw_cols = pd.DataFrame(self.get_matrix(FragmentType.RAW)[0].toarray())
-            raw_cols.columns = self._gen_column_names(FragmentType.RAW)
+            raw_cols.columns = self._gen_column_names(FragmentType.RAW)  # , set(self.obs["FRAGMENTATION"]))
             df_merged = pd.concat([df_merged, raw_cols], axis=1)
         if "pred_int" in list(self.layers):
             pred_cols = pd.DataFrame(self.get_matrix(FragmentType.PRED)[0].toarray())
-            pred_cols.columns = self._gen_column_names(FragmentType.PRED)
+            pred_cols.columns = self._gen_column_names(FragmentType.PRED)  # , set(self.obs["FRAGMENTATION"]))
             df_merged = pd.concat([df_merged, pred_cols], axis=1)
         return df_merged
+
+    def assemble_df_for_parquet(self, include_intensities: bool = False) -> pd.DataFrame:
+        """
+        Returns a Pandas dataframe that can be serialized to Parquet for building a DLomix dataset.
+
+        :param include_intensities: Whether to include raw intensity values (i.e. labels required for training a model,
+            but not for inference)
+
+        :return: Pandas DataFrame with column names matching to those required for DLomix datasets
+        """
+        frag_dict = {
+            "CID": 1,
+            "HCD": 2,
+            "electron transfer dissociation": 3,
+            "ETD": 3,
+        }  # TODO get frag dict from constants in spectrum fundamentals
+
+        ready_to_parquet = pd.DataFrame()
+        ready_to_parquet["modified_sequence"] = self.obs["MODIFIED_SEQUENCE"]
+        ready_to_parquet["precursor_charge_onehot"] = list(
+            np.eye(6, dtype=int)[self.obs["PRECURSOR_CHARGE"].to_numpy() - 1]
+        )
+        ready_to_parquet["collision_energy_aligned_normed"] = 35
+        ready_to_parquet["method_nbr"] = self.obs["FRAGMENTATION"].apply(lambda x: frag_dict[x])
+
+        if include_intensities:
+            raw_int = self.layers["raw_int"].toarray()
+            raw_int[raw_int == 0] = -1
+            raw_int[raw_int == c.EPSILON] = 0
+            ready_to_parquet["intensities_raw"] = list(raw_int)
+
+        return ready_to_parquet
diff --git a/oktoberfest/predict/__init__.py b/oktoberfest/predict/__init__.py
@@ -1,4 +1,5 @@
 """Init predict."""
 
-from .koina import Koina
-from .predict import *
+from .predictor import Predictor
+
+__all__ = ["Predictor"]
diff --git a/oktoberfest/predict/alignment.py b/oktoberfest/predict/alignment.py
@@ -0,0 +1,71 @@
+import logging
+from typing import Tuple
+
+import anndata
+import numpy as np
+from spectrum_fundamentals.fragments import retrieve_ion_types
+from spectrum_fundamentals.metrics.similarity import SimilarityMetrics
+
+from ..data.spectra import FragmentType, Spectra
+
+logger = logging.getLogger(__name__)
+
+
+def _prepare_alignment_df(library: Spectra, ce_range: Tuple[int, int], group_by_charge: bool = False) -> Spectra:
+    """
+    Prepare an alignment DataFrame from the given Spectra library.
+
+    This function creates an alignment DataFrame by removing decoy and HCD fragmented spectra
+    from the input library, selecting the top 1000 highest-scoring spectra, and repeating the
+    DataFrame for each collision energy (CE) in the given range.
+
+    :param library: the library to be propagated
+    :param ce_range: the min and max CE to be propagated for alignment in the dataframe
+    :param group_by_charge: if true, select the top 1000 spectra independently for each precursor charge
+    :return: a library that is modified according to the description above
+    """
+    top_n = 1000
+    hcd_targets = library.obs.query("(FRAGMENTATION == 'HCD') & ~REVERSE")
+    hcd_targets = hcd_targets.sort_values(by="SCORE", ascending=False).groupby("RAW_FILE")
+
+    if group_by_charge:
+        hcd_targets = hcd_targets.groupby("PRECURSOR_CHARGE")
+    top_hcd_targets = hcd_targets.head(top_n)
+
+    alignment_library = library[top_hcd_targets.index]
+    alignment_library = Spectra(
+        anndata.concat([alignment_library for _ in range(*ce_range)], index_unique="_", keys=range(*ce_range))
+    )
+    alignment_library.var = library.var
+    alignment_library.obs.reset_index(inplace=True)
+
+    alignment_library.obs["ORIG_COLLISION_ENERGY"] = alignment_library.obs["COLLISION_ENERGY"]
+    alignment_library.obs["COLLISION_ENERGY"] = np.repeat(range(*ce_range), top_n)
+
+    alignment_library.uns["ion_types"] = np.array(
+        list(
+            {
+                ion_type
+                for fragmentation_method in library.obs["FRAGMENTATION"].unique()
+                for ion_type in retrieve_ion_types(fragmentation_method)
+            }
+        ),
+        dtype=object,
+    )
+
+    return alignment_library
+
+
+def _alignment(alignment_library: Spectra):
+    """
+    Perform the alignment of predicted versus raw intensities.
+
+    The function calculates the spectral angle between predicted and observed fragment intensities and
+    adds it as a column to the alignment library.
+
+    :param alignment_library: the library to perform the alignment on
+    """
+    pred_intensity = alignment_library.get_matrix(FragmentType.PRED)[0]
+    raw_intensity = alignment_library.get_matrix(FragmentType.RAW)[0]
+    sm = SimilarityMetrics(pred_intensity, raw_intensity)
+    alignment_library.add_column(sm.spectral_angle(raw_intensity, pred_intensity, 0), "SPECTRAL_ANGLE")