Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: DLomix integration #250

Merged
merged 172 commits into from
Sep 17, 2024
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
Show all changes
172 commits
Select commit Hold shift + click to select a range
afac1e9
Store working state to squash later
JSchlensok Jun 20, 2024
9958289
feat: local intensity prediction using DLomix
JSchlensok Jul 8, 2024
f5f044b
Merge remote-tracking branch 'origin/development' into feature/dlomix…
JSchlensok Jul 8, 2024
0f9c8fe
chore: fix typo in docstrings
JSchlensok Jul 8, 2024
ceb2ed9
chore: Clean up a bit
JSchlensok Jul 8, 2024
2f5ce34
fix: add missing annotation array in DLomix.predict
JSchlensok Jul 8, 2024
0b08966
feat: parametrize DLomix inference batch size
JSchlensok Jul 8, 2024
cf4aae8
chore: formatting
JSchlensok Jul 8, 2024
c40d7f6
feat: parametrize DLomix inference batch size
JSchlensok Jul 8, 2024
67d4200
feat: Implement local intensity prediction via DLomix
JSchlensok Jul 26, 2024
689268a
Merge remote-tracking branch 'origin/development' into feature/dlomix…
JSchlensok Jul 26, 2024
e487c3c
chore(pre-commit): keep runtime typing
JSchlensok Aug 5, 2024
68ab9ca
chore: update dependencies
JSchlensok Aug 5, 2024
5a8f3fd
feat(data): additional preprocessing of spectra
JSchlensok Aug 5, 2024
9d51ff5
chore: more consistent spelling & formatting
JSchlensok Aug 5, 2024
deec9cd
feat: refinement learning of intensity predictor
JSchlensok Aug 5, 2024
4e90290
chore: expose DLomix & Koina interfaces through public prediction API
JSchlensok Aug 5, 2024
613238e
chore: Some housekeeping for mypy & flake8
JSchlensok Aug 5, 2024
49f2d0d
chore: encrypt Koina connection by default
JSchlensok Aug 5, 2024
264fe3d
refactor(config): tidy up config validation
JSchlensok Aug 5, 2024
f0fda1a
feat(config): validate config for local prediction/refinement learning
JSchlensok Aug 5, 2024
0f35f6f
chore: add autosectionlabels to sphinx
JSchlensok Aug 6, 2024
ba720e2
docs: Include local prediction/refinement learning
JSchlensok Aug 6, 2024
db5b149
refactor(tests): Enforce test suite naming convention
JSchlensok Aug 6, 2024
e085829
test(config): Add tests for verification of optional dependencies
JSchlensok Aug 6, 2024
be4ba6d
test(predict): Add stub test cases for local prediction & refinement …
JSchlensok Aug 6, 2024
5d22df4
Merge remote-tracking branch 'origin/development' into feature/dlomix…
JSchlensok Aug 6, 2024
dcec9de
chore: Resolve spectrum_io-side TODO
JSchlensok Aug 6, 2024
00df47a
chore: add ProcessStep for refinement learning
JSchlensok Aug 6, 2024
10ef46f
fix(ML dataset processing): column name handling
JSchlensok Aug 6, 2024
3b53a62
fix(DLomix interface): correctly handle model path
JSchlensok Aug 6, 2024
bddb31e
perf(DLomix): reduce batch size to not blow up GPU
JSchlensok Aug 6, 2024
12f2eee
fix(DLomix): Pin DLomix dependency
JSchlensok Aug 6, 2024
66f4177
fix(DLomix data preprocessing): ensure column name consistency
JSchlensok Aug 6, 2024
4258ab9
fix(config): actually check config
JSchlensok Aug 6, 2024
b1250b7
chore: remove outdated TODO
JSchlensok Aug 6, 2024
b697be8
refactor(config): handle baseline model download more gracefully
JSchlensok Aug 6, 2024
46efe67
build: Specify DLomix as extra instead of optional group
JSchlensok Aug 7, 2024
c6e5f3e
fix: Manually install DLomix in Nox session
JSchlensok Aug 7, 2024
8f03d64
refactor(test): use proper tempfile for garbage config
JSchlensok Aug 7, 2024
8697ca2
Merge remote-tracking branch 'origin/development' into feature/dlomix…
JSchlensok Aug 7, 2024
a5f112a
fix: double dependency from merge mess-up
JSchlensok Aug 7, 2024
1592421
feat(dlomix): separate data & logging directories
JSchlensok Aug 7, 2024
cc85cfd
fix: ETD fragmentation encoding not yet in spectrum_fundamentals
JSchlensok Aug 7, 2024
bf74a34
style: typo
JSchlensok Aug 7, 2024
d1a8e9f
style(data): return type annotations for inplace methods
JSchlensok Aug 7, 2024
57e53e6
fix(data): replace non-abbreviated fragmentation method names
JSchlensok Aug 7, 2024
6dc7323
feat: completely mute TensorFlow output on import
JSchlensok Aug 7, 2024
2a3c086
fix(tests): rename broken test
JSchlensok Aug 7, 2024
09d7298
fix(tests): add missing import
JSchlensok Aug 7, 2024
b8380ee
refactor(tests): switch to class-based fixtures
JSchlensok Aug 7, 2024
20afd9a
fix(tests): manually remove DLomix for optional dependency tests
JSchlensok Aug 7, 2024
58a071a
test: remove obsolete WandB dependency test
JSchlensok Aug 7, 2024
2220d7f
refactor(config): remove obsolete check for WandB installation
JSchlensok Aug 7, 2024
875c649
refactor(dlomix): enforce consistent path of downloaded baseline model
JSchlensok Aug 7, 2024
6b017b1
feat(dlomix): infer model type from name
JSchlensok Aug 7, 2024
b11afd8
style: pre-commit
JSchlensok Aug 7, 2024
eaf3bb3
refactor(runner): remove degenerate kwargs dict
JSchlensok Aug 7, 2024
c27a2f9
refactor(dlomix): streamline local model checking
JSchlensok Aug 7, 2024
27220c6
refactor(predictor): correct type annotations
JSchlensok Aug 7, 2024
007e0cd
refactor(predictor): remove unused prediction method
JSchlensok Aug 7, 2024
433fe84
docs: add missing return type
JSchlensok Aug 7, 2024
98b716f
refactor(data): remove unused return value
JSchlensok Aug 7, 2024
f1f29b6
fix: typos
JSchlensok Aug 7, 2024
ccaea78
docs: typo
JSchlensok Aug 7, 2024
b414f14
fix: more typos
JSchlensok Aug 7, 2024
e82901f
fix: typos galore (need coffee)
JSchlensok Aug 7, 2024
82f71a7
chore: remove dangling TODO
JSchlensok Aug 7, 2024
1f6ccd0
fix: properly pass kwargs to dlomix/koina
JSchlensok Aug 7, 2024
5c0860a
fix: typo
JSchlensok Aug 7, 2024
f7ee4b8
fix: typo
JSchlensok Aug 7, 2024
a2a4d27
chore: Update to spectrum_fundamentals 0.6.1
JSchlensok Aug 8, 2024
4a77a53
chore: more robust type checking
JSchlensok Aug 8, 2024
bb74487
refactor(data): straighten out _gen_vars_df
JSchlensok Aug 8, 2024
82a0d1d
chore (dlomix): Depend on spectrum_fundamentals.constants instead of …
JSchlensok Aug 8, 2024
76c3b90
chore: exclude TYPE_CHECKING blocks from coverage statistics
JSchlensok Aug 8, 2024
b75faa5
feat(dlomix): Improve fragment ion annotation handling
JSchlensok Aug 8, 2024
118cdfa
chore: manually install unreleased spectrum-fundamentals for testing …
JSchlensok Aug 8, 2024
490861d
chore: update packages, fix requirements for pip-based install
JSchlensok Aug 9, 2024
c574422
fix(data): sanitize fragmentation method keys in preprocessing instea…
JSchlensok Aug 9, 2024
e82fc95
fix(data): infer dtype correctly when generating var_df
JSchlensok Aug 9, 2024
07c927d
fix(data): properly handle intensity dataframe to nested array conver…
JSchlensok Aug 9, 2024
80a9d40
tests(spectra): fix broken spectra tests
JSchlensok Aug 9, 2024
6c4be71
fix(preprocessing): typo
JSchlensok Aug 9, 2024
74a7585
fixed custom mods tokens
Aug 9, 2024
dc611a5
chore: update requirements to support z● ions
JSchlensok Aug 9, 2024
761e456
Merge remote-tracking branch 'origin/feature/dlomix-integration' into…
JSchlensok Aug 9, 2024
c4d71ef
style: formatting
JSchlensok Aug 9, 2024
0936121
feat(data): support z● ions
JSchlensok Aug 9, 2024
f6f1be6
chore: remove outdated parameters
JSchlensok Aug 9, 2024
3b5d28f
fix(preprocessing): standardize fragmentation method names in all run…
JSchlensok Aug 9, 2024
59c1fed
fix(dlomix): properly tile annotations of prediction
JSchlensok Aug 9, 2024
1c41922
Include new DLomix changes
JSchlensok Aug 9, 2024
0b6ac74
fix(dlomix): include z● ions in ion type ordering
JSchlensok Aug 9, 2024
9c949e9
chore: switch to revised spectrum_fundamentals fragment ion annotations
JSchlensok Aug 10, 2024
fa46da3
feat(dlomix): parametrize improve_further
JSchlensok Aug 12, 2024
12f5fc7
style: spelling
JSchlensok Aug 12, 2024
b51d0c5
refactor(dlomix): move standard out muting to utils
JSchlensok Aug 12, 2024
e0c6ce7
style(dlomix): spelling
JSchlensok Aug 12, 2024
6092c38
refactor(dlomix): cleanup
JSchlensok Aug 12, 2024
7872ea5
chore: switch to spectrum_fundamentals dev branch
JSchlensok Aug 12, 2024
79c632a
fix(dlomix): typo
JSchlensok Aug 12, 2024
1a4e1b9
style(dlomix): clearer variable naming
JSchlensok Aug 12, 2024
506591d
style(alignment): fix misleading docstring
JSchlensok Aug 12, 2024
5d0ed8b
fix(alignment): handle spectral libraries with <1000 matching spectra…
JSchlensok Aug 12, 2024
237f84c
fix(preprocessing): remove redundant ion type annotation
JSchlensok Aug 12, 2024
ea04152
fix(spectra): outdated constant reference
JSchlensok Aug 12, 2024
feb6a5b
style: pre-commit
JSchlensok Aug 12, 2024
ce635bd
fix(noxfile): Install correct spectrum_fundamentals branch for testing
JSchlensok Aug 12, 2024
a011be2
fix: typos
JSchlensok Aug 13, 2024
6b2cf63
refactor(alignment): adopt alignment df with <1000 spectra from https…
JSchlensok Aug 13, 2024
0e7158d
fix: iron out inconsistencies
JSchlensok Aug 13, 2024
399b2eb
Merge branch 'development' into feature/dlomix-integration
picciama Aug 13, 2024
16f8c03
updated fundamentals dep
picciama Aug 13, 2024
e9826db
fix(dlomix): Column name case in refinement training dataset
JSchlensok Aug 13, 2024
13eef9f
docs(dlomix): update config parameters
JSchlensok Aug 13, 2024
3ac452a
docs: fix config table indentation
JSchlensok Aug 13, 2024
26bf5e8
chore: remove dangling TODO
JSchlensok Aug 13, 2024
ffd0696
tests: install git dependencies for typeguard session
JSchlensok Aug 13, 2024
a41cae6
style: formatting
JSchlensok Aug 13, 2024
b17898b
tests(data): add tests for additional ion types
JSchlensok Aug 13, 2024
1fa5e6b
tests: add unfinished tests for alignment & prediction
JSchlensok Aug 13, 2024
c569834
Merge remote-tracking branch 'origin/feature/dlomix-integration' into…
JSchlensok Aug 13, 2024
bd7bdcd
style: formatting
JSchlensok Aug 13, 2024
8333b0b
chore: remove outdated spectrum_fundamentals git dependency
JSchlensok Aug 13, 2024
24f1aac
added koinapy and extend superclass
picciama Aug 16, 2024
1d84260
Merge remote-tracking branch 'origin/development' into feature/dlomix…
JSchlensok Aug 20, 2024
a75979e
feat(dlomix): include original & modified sequence in refinement dataset
JSchlensok Aug 23, 2024
d46901a
feat(dlomix): skip CE calibration when refinement learning
JSchlensok Aug 23, 2024
440e261
chore: update dependencies
JSchlensok Aug 23, 2024
9d796b0
Revert "feat(dlomix): include original & modified sequence in refinem…
JSchlensok Aug 23, 2024
525fdbf
feat(dlomix): pass raw modified sequence to DLomix for downstream ana…
JSchlensok Aug 23, 2024
09e2462
fix(dlomix): Keep decoys in inference data
JSchlensok Aug 24, 2024
4c93b94
fix(predict): don't predict iRT for citrullination
JSchlensok Aug 27, 2024
2b13564
chore: set spectrum-io dependency to hotfix
JSchlensok Aug 27, 2024
b6f149b
fix: plot_pred_rt_vs_irt failed when having a perfect prediction (art…
juli-p Aug 29, 2024
26d2d2b
style: generalize search engine score threshold variable naming
JSchlensok Sep 11, 2024
aeebef2
chore: upgrade from 3.8 to 3.9 in overlooked spots
JSchlensok Sep 11, 2024
3aef3e6
style: remove dangling commented-out code
JSchlensok Sep 11, 2024
659b488
style: correct grammar
JSchlensok Sep 11, 2024
b9e2ffc
docs: clear up phrasing
JSchlensok Sep 11, 2024
7621db2
docs: explicitly add default parameters to rescoring config example
JSchlensok Sep 11, 2024
8251b43
chore: formatting
JSchlensok Sep 11, 2024
0364c84
refactor: separate batch size between speclib generation & DLomix inf…
JSchlensok Sep 11, 2024
8fb2b0e
fix: typo
JSchlensok Sep 11, 2024
c837c12
tests: remove non-existing model path
JSchlensok Sep 11, 2024
c2aeff3
feat(spectra): implement duplicate filtering
JSchlensok Sep 11, 2024
563b1fc
refactor: remove unnecessary stardardization of fragmentation method …
JSchlensok Sep 11, 2024
814a5fc
fix(predict): make predictor implementations take arbitrary kwargs
JSchlensok Sep 11, 2024
b8ecf3f
fix(predict): only import dlomix module if dlomix is installed
JSchlensok Sep 11, 2024
ebdb12f
chore: fix pre-commit complaints
JSchlensok Sep 11, 2024
76ff7f9
chore: pyupgrade 3.8->3.9
JSchlensok Sep 11, 2024
1d87b0f
Merge branch 'development' into feature/dlomix-integration
picciama Sep 12, 2024
8790992
Merge branch 'feature/dlomix-integration' into chore/switch_to_koinapy
picciama Sep 12, 2024
746eeb4
fixed shape issue when transforming to dict
picciama Sep 12, 2024
e0b4a87
fix: remove obsolete kwarg for Koina
JSchlensok Sep 13, 2024
20af248
fix(dlomix): clean up arbitrary kwarg passing to predictor interface …
JSchlensok Sep 13, 2024
b655dd4
Merge pull request #257 from wilhelm-lab/chore/switch_to_koinapy
JSchlensok Sep 13, 2024
8585705
refactor(dlomix): generate zero iRT predictions through predictor int…
JSchlensok Sep 13, 2024
eb15144
style: ignore complexity score of methods in runner
JSchlensok Sep 13, 2024
09e498a
style: formatting
JSchlensok Sep 13, 2024
0320f10
style: pre-commit
JSchlensok Sep 13, 2024
9222673
tests: comment out unfinished tests
JSchlensok Sep 13, 2024
9484cbe
tests: fix data type
JSchlensok Sep 13, 2024
151f657
tests: fix method call for alphapept
JSchlensok Sep 13, 2024
1af6a94
style: formatting
JSchlensok Sep 13, 2024
4656588
dix speclib: don't pickle global predictor object
picciama Sep 13, 2024
21f4b6a
don't create explicit cast copy
picciama Sep 13, 2024
8d3d6d1
fix xdoctest
picciama Sep 13, 2024
9c24d1c
fix typeguard: use df instead of Spectra object
picciama Sep 13, 2024
3490afa
fix xdoctest
picciama Sep 13, 2024
9be535c
readded Spectra instead of df + added dlomix check
picciama Sep 13, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 16 additions & 10 deletions docs/API.rst
Original file line number Diff line number Diff line change
Expand Up @@ -62,28 +62,34 @@ Predicting: :code:`pr`

.. currentmodule:: oktoberfest

Access to functions that communicate with a Koina server to retrieve predictions from various prediction models.
Access to functions that communicate with a Koina server to retrieve predictions from various prediction models, or serve pre-trained TensorFlow models locally.

High level features
~~~~~~~~~~~~~~~~~~~

.. autosummary::
:toctree: api/pr

pr.predict_intensities
pr.predict_rt
pr.ce_calibration
.. autoclass:: pr.predictor.Predictor
:members: from_config, predict_intensities, predict_irt, ce_calibration

Koina interface
~~~~~~~~~~~~~~~

.. autosummary::
:toctree: api/pr

pr.predict
pr.predict_at_once
pr.predict_in_chunks
pr.Predictor.predict
pr.Predictor.predict_at_once
pr.Predictor.predict_in_chunks
pr.koina.Koina

DLomix interface
~~~~~~~~~~~~~~~~

.. autosummary::
:toctree: api/pr

pr.Predictor.predict
pr.Predictor.predict_at_once
pr.dlomix.DLomix

Rescoring: :code:`re`
---------------------
Expand Down
16 changes: 14 additions & 2 deletions docs/config.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Configuration
=============

The following provides an overview of all available flags in the configuration file to use the high level API and run jobs. Parameters may be applicable to more than one job type and are collected within indivdual tables.
The following provides an overview of all available flags in the configuration file to use the high-level API and run jobs. Parameters may be applicable to more than one job type and are collected within indivdual tables.

Always applicable
-----------------
Expand All @@ -18,7 +18,7 @@ Always applicable
+----------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| models | Contains information about the used models for peptide property prediction (see following 2 nested parameters) |
+----------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| intensity | Name of the model used for fragment intensity prediction |
| intensity | Name or path of the model used for fragment intensity prediction |
+----------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| irt | Name of the model used for indexed retention time prediction |
+----------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Expand Down Expand Up @@ -140,3 +140,15 @@ Applicable to in-silico digestion
+----------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| db | Defines whether the digestion should contain only targets, only decoys or both (concatenated); can be "target", "decoy" or "concat"; default = "concat" |
+----------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Applicable to local prediction and transfer learning
----------------------------------------------------

.. table::
:class: fixed-table local-prediction-config-table

+----------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Parameter | Description |
+============================+====================================================================================================================================================================+
| predictIntensityLocally | Defines whether an off-line model should be used for predicting insensity; can be True or False; default = False |
JSchlensok marked this conversation as resolved.
Show resolved Hide resolved
+----------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 changes: 1 addition & 1 deletion docs/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ The installer script automatically installs dependencies and creates a new conda
wget https://raw.githubusercontent.com/wilhelm-lab/oktoberfest/main/installer.sh -O install_oktoberfest.sh
bash install_oktoberfest.sh

The installer searches for existing anaconda / miniconda installation. If none was found, it will download and install miniconda.
The installer searches for an existing anaconda / miniconda installation. If none is found, it will download and install miniconda.

Docker Image
------------
Expand Down
4 changes: 3 additions & 1 deletion docs/outputs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,9 @@ General directory structure
+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Directory | Description |
+=======================+===============================================================================================================================================================================================================================================================================================================================================================================================================================+
| data/ | Contains hdf5 files that combine search results, annotated spectra, metadata in <spectra_file>.mzML.hdf5 and predictions in <spectra_file>.mzML.pred.hdf5 where <spectra_file> is replaced with the specific name of the RAW file for which information is stored. The files are updated and store the progress of the current job and enable skipping specific steps when rerunning a job. |
| data/ | Contains spectra processed for usage in machine learning applications with DLomix: preprocessed datasets in Parquet format, as well as lists of ion types and modifications in them in plain text format. |
+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| data/dlomix | Contains hdf5 files that combine search results, annotated spectra, metadata in <spectra_file>.mzML.hdf5 and predictions in <spectra_file>.mzML.pred.hdf5 where <spectra_file> is replaced with the specific name of the RAW file for which information is stored. The files are updated and store the progress of the current job and enable skipping specific steps when rerunning a job. |
+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| msms/ | Contains combined msms.prosit and separated search results in <spectra_file>.prosit where <spectra_file> is replaced with the spectra file name for which search results are stored. The files are stored in the `internal format <./internal_format.html>`_ and are created as part of preprocessing search results from supported search engines. If a file is present, preprocessing is skipped when rerunning a job. |
+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Expand Down
56 changes: 47 additions & 9 deletions oktoberfest/data/spectra.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,19 +34,25 @@ class Spectra(anndata.AnnData):
INTENSITY_LAYER_NAME = "raw_int"
MZ_LAYER_NAME = "mz"
COLUMNS_FRAGMENT_ION = ["Y1+", "Y1++", "Y1+++", "B1+", "B1++", "B1+++"]
MAX_CHARGE = 3

@staticmethod
def _gen_vars_df() -> pd.DataFrame:
def _gen_vars_df(specified_ion_types: Optional[List[str]] = None) -> pd.DataFrame:
"""
Creates Annotation dataframe for vars in AnnData object.

:param specified_ion_types: ion types that are expected to be in the spectra. If None default back to
:return: pd.Dataframe of Frgment Annotation
"""
ion_nums = np.repeat(np.arange(1, 30), 6)
ion_charge = np.tile([1, 2, 3], 29 * 2)
if not specified_ion_types:
specified_ion_types = ["y", "b"]

number_of_ion_types = len(specified_ion_types)
ion_nums = np.repeat(np.arange(1, 30), 3 * number_of_ion_types)
ion_charge = np.tile([1, 2, 3], 29 * number_of_ion_types)
temp_cols = []
for size in range(1, 30):
for typ in ["y", "b"]:
for typ in specified_ion_types:
for charge in ["+1", "+2", "+3"]:
temp_cols.append(f"{typ}{size}{charge}")
ion_types = [frag[0] for frag in temp_cols]
Expand All @@ -55,7 +61,7 @@ def _gen_vars_df() -> pd.DataFrame:
return var_df

@staticmethod
def _gen_column_names(fragment_type: FragmentType) -> List[str]:
def _gen_column_names(fragment_type: FragmentType): # , fragmentation_methods: Set[str]) -> List[str]:
JSchlensok marked this conversation as resolved.
Show resolved Hide resolved
"""
Get column names of the spectra data.

Expand Down Expand Up @@ -260,7 +266,7 @@ def get_matrix(self, fragment_type: FragmentType) -> Tuple[csr_matrix, List[str]
layer = self._resolve_layer_name(fragment_type)
matrix = self.layers[layer]

return matrix, self._gen_column_names(fragment_type)
return matrix, self._gen_column_names(fragment_type) # , set(self.obs["FRAGMENTATION"]))
JSchlensok marked this conversation as resolved.
Show resolved Hide resolved

def write_as_hdf5(self, output_file: Union[str, Path]):
"""
Expand Down Expand Up @@ -291,14 +297,46 @@ def convert_to_df(self) -> pd.DataFrame:

if "mz" in list(self.layers):
mz_cols = pd.DataFrame(self.get_matrix(FragmentType.MZ)[0].toarray())
mz_cols.columns = self._gen_column_names(FragmentType.MZ)
mz_cols.columns = self._gen_column_names(FragmentType.MZ) # , set(self.obs["FRAGMENTATION"]))
JSchlensok marked this conversation as resolved.
Show resolved Hide resolved
df_merged = pd.concat([df_merged, mz_cols], axis=1)
if "raw_int" in list(self.layers):
raw_cols = pd.DataFrame(self.get_matrix(FragmentType.RAW)[0].toarray())
raw_cols.columns = self._gen_column_names(FragmentType.RAW)
raw_cols.columns = self._gen_column_names(FragmentType.RAW) # , set(self.obs["FRAGMENTATION"]))
df_merged = pd.concat([df_merged, raw_cols], axis=1)
if "pred_int" in list(self.layers):
pred_cols = pd.DataFrame(self.get_matrix(FragmentType.PRED)[0].toarray())
pred_cols.columns = self._gen_column_names(FragmentType.PRED)
pred_cols.columns = self._gen_column_names(FragmentType.PRED) # , set(self.obs["FRAGMENTATION"]))
df_merged = pd.concat([df_merged, pred_cols], axis=1)
return df_merged

def assemble_df_for_parquet(self, include_intensities: bool = False) -> pd.DataFrame:
"""
Returns a Pandas dataframe that can be serialized to Parquet for building a DLomix dataset.

:param include_intensities: Whether to include raw intensity values (i.e. labels required for training a model,
but not for inference)

:return: Pandas DataFrame with column names matching to those required for DLomix datasets
"""
frag_dict = {
"CID": 1,
"HCD": 2,
"electron transfer dissociation": 3,
"ETD": 3,
} # TODO get frag dict from constants in spectrum fundamentals
JSchlensok marked this conversation as resolved.
Show resolved Hide resolved

ready_to_parquet = pd.DataFrame()
ready_to_parquet["modified_sequence"] = self.obs["MODIFIED_SEQUENCE"]
ready_to_parquet["precursor_charge_onehot"] = list(
np.eye(6, dtype=int)[self.obs["PRECURSOR_CHARGE"].to_numpy() - 1]
)
ready_to_parquet["collision_energy_aligned_normed"] = 35
ready_to_parquet["method_nbr"] = self.obs["FRAGMENTATION"].apply(lambda x: frag_dict[x])

if include_intensities:
raw_int = self.layers["raw_int"].toarray()
raw_int[raw_int == 0] = -1
raw_int[raw_int == c.EPSILON] = 0
ready_to_parquet["intensities_raw"] = list(raw_int)

return ready_to_parquet
5 changes: 3 additions & 2 deletions oktoberfest/predict/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
"""Init predict."""

from .koina import Koina
from .predict import *
from .predictor import Predictor

__all__ = ["Predictor"]
71 changes: 71 additions & 0 deletions oktoberfest/predict/alignment.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
import logging
from typing import Tuple

import anndata
import numpy as np
from spectrum_fundamentals.fragments import retrieve_ion_types
from spectrum_fundamentals.metrics.similarity import SimilarityMetrics

from ..data.spectra import FragmentType, Spectra

logger = logging.getLogger(__name__)


def _prepare_alignment_df(library: Spectra, ce_range: Tuple[int, int], group_by_charge: bool = False) -> Spectra:
"""
Prepare an alignment DataFrame from the given Spectra library.

This function creates an alignment DataFrame by removing decoy and HCD fragmented spectra
from the input library, selecting the top 1000 highest-scoring spectra, and repeating the
DataFrame for each collision energy (CE) in the given range.

:param library: the library to be propagated
:param ce_range: the min and max CE to be propagated for alignment in the dataframe
:param group_by_charge: if true, select the top 1000 spectra independently for each precursor charge
:return: a library that is modified according to the description above
"""
top_n = 1000
hcd_targets = library.obs.query("(FRAGMENTATION == 'HCD') & ~REVERSE")
hcd_targets = hcd_targets.sort_values(by="SCORE", ascending=False).groupby("RAW_FILE")

if group_by_charge:
hcd_targets = hcd_targets.groupby("PRECURSOR_CHARGE")
top_hcd_targets = hcd_targets.head(top_n)

alignment_library = library[top_hcd_targets.index]
alignment_library = Spectra(
anndata.concat([alignment_library for _ in range(*ce_range)], index_unique="_", keys=range(*ce_range))
)
alignment_library.var = library.var
alignment_library.obs.reset_index(inplace=True)

alignment_library.obs["ORIG_COLLISION_ENERGY"] = alignment_library.obs["COLLISION_ENERGY"]
alignment_library.obs["COLLISION_ENERGY"] = np.repeat(range(*ce_range), top_n)

alignment_library.uns["ion_types"] = np.array(
list(
{
ion_type
for fragmentation_method in library.obs["FRAGMENTATION"].unique()
for ion_type in retrieve_ion_types(fragmentation_method)
}
),
dtype=object,
)

return alignment_library


def _alignment(alignment_library: Spectra):
"""
Perform the alignment of predicted versus raw intensities.

The function calculates the spectral angle between predicted and observed fragment intensities and
adds it as a column to the alignment library.

:param alignment_library: the library to perform the alignment on
"""
pred_intensity = alignment_library.get_matrix(FragmentType.PRED)[0]
raw_intensity = alignment_library.get_matrix(FragmentType.RAW)[0]
sm = SimilarityMetrics(pred_intensity, raw_intensity)
alignment_library.add_column(sm.spectral_angle(raw_intensity, pred_intensity, 0), "SPECTRAL_ANGLE")
Loading
Loading