Merge pull request #109 from wilhelm-lab/feature/documentation_usage_…

…principles Changed hierarchy in usage principles
wilhelm-lab · Aug 11, 2023 · 31c331f · 31c331f
2 parents 3e69012 + 62765d4
commit 31c331f
Show file tree

Hide file tree

Showing 11 changed files with 303 additions and 147 deletions.
diff --git a/README.rst b/README.rst
@@ -114,7 +114,7 @@ Create a `config.json` file which should contain the following flags:
 - `tag` = "tmt", "tmtpro", "itraq4" or "itraq8"; default is ""
 - `fdr_estimation_method` = method used for FDR estimation on PSM and peptide level: "percolator" or "mokapot"; default = "mokapot"
 - `allFeatures`` = True if all features should be used for FDR estimation; default = False
-- `regressionMethod` = regression method for curve fitting (mapping from predicted iRT values to experimental retention times): "lowess", "spline" or "logistic"; default = "lowess"
+- `regressionMethod` = regression method for curve fitting (mapping from predicted iRT values to experimental retention times): "lowess", "spline" or "logistic"; default = "spline"
 - `inputs`
    - `search_results` = path to the file containing the search results
    - `search_results_type` = the tool used to produce the search results, can be "Maxquant", "Msfragger", "Mascot" or "Internal"; default = "Maxquant"

diff --git a/ReadMe.md b/ReadMe.md
@@ -90,7 +90,7 @@ Create a `config.json` file which should contain the following flags:
 
 -   `allFeatures` = True if all features should be used for FDR estimation; default = False
 
--   `regressionMethod` = regression method for curve fitting (mapping from predicted iRT values to experimental retention times): "lowess", "spline" or "logistic"; default = "lowess"
+-   `regressionMethod` = regression method for curve fitting (mapping from predicted iRT values to experimental retention times): "lowess", "spline" or "logistic"; default = "spline"
 
 -   `inputs`
 

diff --git a/data/plasma/config.json b/data/plasma/config.json
@@ -14,8 +14,10 @@
     },
     "prediction_server": "koina.proteomicsdb.org:443",
     "ssl": true,
-    "thermoExe": "ThermoRawFileParser.exe",
     "output": "./out",
+    "thermoExe": "ThermoRawFileParser.exe",
+    "regressionMethod": "spline",
+    "fdr_estimation_method": "mokapot",
     "massTolerance": 20,
     "unitMassTolerance": "ppm"
 }
diff --git a/docs/config.rst b/docs/config.rst
@@ -0,0 +1,57 @@
+Configuration
+=============
+
+The following provides an overview of all available flags in the configuration file to use the high level API and run jobs.
+
+Mandatory flags
+---------------
+
+- `type` = "CollisionEnergyAlignment", "SpectralLibraryGeneration" or "Rescoring"
+- `tag` = "tmt", "tmtpro", "itraq4" or "itraq8"; default is ""
+- `models`
+   - `intensity` = intensity model
+   - `irt` = irt model
+- `prediction_server` = server for obtaining peptide property predictions
+- `ssl` = Use ssl when making requests to the prediction server, can be true or false; default = true
+- `output` = path to the output folder; if not provided the current working directory will be used.
+
+For spectral library generation and rescoring
+---------------------------------------------
+
+- `inputs`
+   - `search_results` = path to the file containing the search results
+   - `search_results_type` = the tool used to produce the search results, can be "Maxquant", "Msfragger", "Mascot" or "Internal"; default = "Maxquant"
+   - `spectra` = path to a folder or a single file containing mass spectrometry results (raw or mzml files)
+   - `spectra_type` = "raw" or "mzml"; default = "raw"
+- `numThreads` = number of raw files processed in parallel processes; default = 1
+- `thermoExe` = path to ThermoRawFileParser executable; default "ThermoRawFileParser.exe"
+- `massTolerance` = mass tolerance value defining the allowed tolerance between theoretical and experimentally observered fragment mass during peak filtering and annotation. Default depends on the mass analyzer: 20 (FTMS), 40 (TOF), 0.35 (ITMS)
+- `unitMassTolerance` = unit for the mass tolerance, either "da" or "ppm". Default is da (mass analyzer is ITMS) and ppm (mass analyzer is FTMS or TOF)
+
+For spectral library generation only
+------------------------------------
+
+- `inputs`
+   - `library_input` = path to the FASTA or peptides file
+   - `library_input_type` = library input type: "fasta" or "peptides"
+- `outputFormat` = "spectronaut" or "msp"
+
+For in-silico digestion (spectral library generation) only
+----------------------------------------------------------
+
+- `fastaDigestOptions`
+   - `fragmentation` = fragmentation method: "HCD" or "CID"
+   - `digestion` = digestion mode: "full", "semi" or None; default = "full"
+   - `cleavages` = number of allowed missed cleavages used in the search engine; default = 2
+   - `minLength` = minimum peptide length allowed used in the search engine; default = 7
+   - `maxLength` = maximum peptide length allowed used in the search engine; default = 60
+   - `enzyme` = type of enzyme used in the search engine; default = "trypsin"
+   - `specialAas` = special amino acids for decoy generation; default = "KR"
+   - `db` = "target", "decoy" or "concat"; default = "concat"
+
+For rescoring
+-------------
+
+- `fdr_estimation_method` = method used for FDR estimation on PSM and peptide level: "percolator" or "mokapot"; default = "mokapot"
+- `allFeatures`` = True if all features should be used for FDR estimation; default = False
+- `regressionMethod` = regression method for curve fitting (mapping from predicted iRT values to experimental retention times): "lowess", "spline" or "logistic"; default = "spline"
diff --git a/docs/index.rst b/docs/index.rst
@@ -10,7 +10,7 @@
 .. include:: news.rst
 
 .. toctree::
-   :maxdepth: 2
+   :maxdepth: 1
    :caption: Contents:
 
    installation

diff --git a/docs/jobs.rst b/docs/jobs.rst
@@ -0,0 +1,164 @@
+Running a job
+=============
+
+The command for executing a job from terminal:
+
+.. code-block:: bash
+
+   python oktoberfest/run_oktoberfest.py --config_path <path/to/config.json>
+
+The command for executing a job within python:
+
+.. code-block:: python
+
+   from oktoberfest.runner import run_job
+   run_job("<path/to/config.json>")
+
+If you instead want to run oktoberfest using the docker image, run:
+
+.. code-block:: bash
+
+   DATA=path/to/data/dir make run_oktoberfest
+
+.. note::
+    When using with docker, `DATA` must contain the spectra, the search results that fit the specified `search_results_type` in the config, and a `config.json` file with the configuration. The results will be written to `<DATA>/<output>/results/percolator`.
+
+
+A. Collision Energy Calibration
+-------------------------------
+
+This task estimates the optimal normalised collision energy (NCE) based on a given search result.
+Oktoberfest will:
+
+1. Select the 1000 highest scoring target PSMs
+2. Perform peptide property prediction for NCE 18 to 49 in steps of one.
+3. Calculate the spectral angle between predicted and experimentally observed fragment intensities for each NCE and report the best NCE, i.e the one that reaches the highest spectral angle.
+
+.. note::
+    Sequences with amino acid U or O are not supported. Modifications except "M(ox)" are not supported.
+
+    Each C is treated as Cysteine with carbamidomethylation (fixed modification).
+
+Example config file:
+
+.. code-block:: python
+
+    task_config_ce_calibration = {
+        "type": "CollisionEnergyCalibration",
+        "tag": "",
+        "output": "./out",
+        "inputs": {
+            "search_results": "./msms.txt",
+            "search_results_type": "Maxquant",
+            "spectra": "./",
+            "spectra_type": "raw"
+        },
+        "models": {
+            "intensity": "Prosit_2020_intensity_HCD",
+            "irt": "Prosit_2019_irt"
+        },
+        "prediction_server": "koina.proteomicsdb.org:443",
+        "regressionMethod": "lowess",
+        "ssl": True,
+        "thermoExe": "ThermoRawFileParser.exe",
+        "massTolerance": 20,
+        "unitMassTolerance": "ppm"
+    }
+
+B. Spectral Library Generation
+------------------------------
+
+This task generates a spectral library either by digesting a given FASTA file, or by predicting a list of peptides given in a CSV file. You need to provide a collision energy (CE) for prediction (see above).
+Oktoberfest will:
+1. Digest the FASTA using a given protease and other parameters and create a peptides.csv file from that.
+2. Predict all spectra at the given collision energy.
+
+In case a CSV with peptides is provided, Oktoberfest will directly predict all spectra and skip the digestion step.
+
+.. note::
+    Sequences with amino acid U or O are not supported. Modifications except "M(ox)" are not supported.
+
+    Each C is treated as Cysteine with carbamidomethylation (fixed modification).
+
+Example config file:
+
+.. code-block:: python
+
+    task_config_spectral_lib = {
+        "type": "SpectralLibraryGeneration",
+        "tag": "",
+        "output": "./out",
+        "inputs": {
+            "search_results": "./msms.txt",
+            "search_results_type": "Maxquant",
+            "library_input": "./peptides.csv",
+            "library_input_type": "peptides"
+        },
+        "models": {
+            "intensity": "Prosit_2020_intensity_HCD",
+            "irt": "Prosit_2019_irt"
+        },
+        "outputFormat": "spectronaut",
+        "prediction_server": "koina.proteomicsdb.org:443",
+        "numThreads": 1,
+        "ssl": True,
+        "thermoExe": "ThermoRawFileParser.exe"
+        "fastaDigestOptions": {
+            "fragmentation": "",
+            "digestion": "full",
+            "missedCleavages": 2,
+            "minLength": 7,
+            "maxLength": 60,
+            "enzyme", "trypsin",
+            "specialAas": "KR",
+            "db": "concat"
+    }
+
+
+C. Rescoring
+------------
+
+This task rescores an existing search result using features generated from peptide property prediction.
+Oktoberfest will:
+
+1. Calibrate CE against the provided RAW files.
+2. Perform peptide property prediction for all spectra that have a match in the search results file.
+3. Use predicted spectra and retention time to generate features for rescoring.
+4. Run percolator or mokapot to rescore the search and perform FDR estimation.
+5. Generate summary plots.
+
+.. note::
+    You need to provide search results that were not filtered for a given FDR (i.e. 100% FDR), otherwise valid targets may be filtered out prior to rescoring.
+
+    Sequences with amino acid U or O are not supported. Modifications except "M(ox)" are not supported.
+
+    Each C is treated as Cysteine with carbamidomethylation (fixed modification).
+
+Example config file:
+
+.. code-block:: python
+
+    task_config_rescoring = {
+        "type": "Rescoring",
+        "tag": "",
+        "output": "./out",
+        "inputs": {
+            "search_results": "./msms.txt",
+            "search_results_type": "Maxquant",
+            "spectra": "./",
+            "spectra_type": "raw"
+        },
+        "models": {
+            "intensity": "Prosit_2020_intensity_HCD",
+            "irt": "Prosit_2019_irt"
+        },
+        "prediction_server": "koina.proteomicsdb.org:443",
+        "numThreads": 1,
+        "fdr_estimation_method": "mokapot",
+        "allFeatures": False,
+        "regressionMethod": "spline",
+        "ssl": True,
+        "thermoExe": "ThermoRawFileParser.exe",
+        "massTolerance": 20,
+        "unitMassTolerance": "ppm"
+    }
diff --git a/docs/predictions.rst b/docs/predictions.rst
@@ -0,0 +1,54 @@
+Retrieving Predictions
+======================
+
+Oktoberfest relies on retrieving predictions from a `koina <https://koina.proteomicsdb.org/>`_ or any other community server that hosts specific models for peptide property prediction. server that hosts supported models for peptide property predictions. Users can use any publicly available community server or host their own server.
+
+Connecting to a community server
+--------------------------------
+
+Our publicly available community server is available at `koina.proteomicsdb.org:443`.
+If you want to connect to it, you need to have the following flags in your config file (default settings):
+
+.. code-block:: python
+
+   "prediction_server": "koina.proteomicsdb.org:443",
+   "ssl": True,
+
+Once more community servers become available, we will add a list here.
+
+Currently supported models
+--------------------------
+
+This is the list of currently supported and tested models for Oktoberfest provided by our community server:
+
+Intensity models:
+
++----------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
+|          Model             |                             Description                                                                                                                |
++============================+========================================================================================================================================================+
+| Prosit_2019_intensity      | deprecated, please use the 2020 model                                                                                                                  |
++----------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
+| Prosit_2020_intensity_HCD  | your go to model for fragment intensity prediction for HCD fragmentation, find out more about this model `here <https://github.com/kusterlab/prosit>`_ |
++----------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
+| Prosit_2020_intensity_CID  | your go to model for fragment intensity prediction for CID fragmentation, find out more about this model `here <https://github.com/kusterlab/prosit>`_ |
++----------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
+| Prosit_2020_intensity_TMT  | your go to model for fragment intensity prediction for TMT, find out more about this model `here <https://github.com/kusterlab/prosit>`_               |
++----------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
+
+iRT models:
+
++----------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
+|          Model             |                             Description                                                                                                                |
++============================+========================================================================================================================================================+
+| Prosit_2019_irt            | all purpose model for retention time prediction, find out more about this model `here <https://github.com/kusterlab/prosit>`_                          |
++----------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
+| Prosit_2020_irt_TMT        | your go to model for retention time prediction for TMT, find out more about this model `here <https://github.com/kusterlab/prosit>`_                   |
++----------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
+
+Once support for additional models is implemented in Oktoberfest, they will be added here.
+
+Hosting and adding your own models
+----------------------------------
+
+In case you are planning to host your own private or public instance of koina or want us to host your model, please refer to the official `koina documentation <https://koina.proteomicsdb.org/docs#overview>`_.
+