-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CE calibrating FragPipe results generated from mzML formatted data fails #129
CE calibrating FragPipe results generated from mzML formatted data fails #129
Comments
I published a hotfix release for spectrum-io (v0.3.3) because it was only there to check if we have default values for the mass tolerance and unit. As long as you supply these yourself, it should be fine. If you install the newest release of oktoberfest (v0.5.0), this error should be gone. The release will be published tonight and the issue will be closed accordingly. Please reopen should you still encounter the problem. |
Hi @picciama, I updated to v 0.5.0 and get another error:
the
there is no other file in the output folder:
|
Another thing: I can find the
If this is intended and will stay like this in the future, you might update this here and replace it with this |
Concerning the key error you get: This is likely because you provide a folder that contains a raw file with a name that is not present in the msms.prosit file. Please check the following potential issues:
Meanwhile, I will implement a check that prints a warning if no PSMs for a provided filename could be found in the search results. Concerning the second point: |
Hmmm, the
There is actually one pepxml file for each raw file in FragPipe output folder:
but I provided only one in the config file. Why exactly is this a problem? Shouldn't the search result determine which raw files needs to be found? |
not found, as expected due to .pepxml file coverage:
|
I changed the source to a single file:
and get a different error:
|
Yes indeed, it shouldn't raise a key error. This is an inconvenience that I will address by printing a warning instead of raising an error. If you want to include all files in the msms.prosit, you can also provide the folder instead of one pepXML file, and Oktoberferst wil include all pepXML files contained in the folder and all subfolders then. For now, you have solved this by explictely providing one raw file.
The mzML file should contain a header that defines a list of instrument types for each MS level. Each spectra is then using a reference "instrumentConfigurationRef" that defines which instrument was used to measure it. It seems your mzML file does not have this. I will add a check if the instrument reference is provided and if not, rely on the user providing mass tolerance and unit. |
Can you maybe send me an email with one of the pepXML + corresponding mzML file? Would be helpful to debug this for this particular case. |
The data is available from our ftp server using this url. |
I would honestly say that if anything in the mzML space def. a standard or reference implementation, it is msconvert/the proteowizard project - not the rawfileparser. There is meanwhile a containerized version that also works on Linux, see here. github This is what we are using on our HPC nodes to convert raw files. Would that also be a solution for oktoberfest? |
|
But do I get this right: In the end all you want is to extract ~1000 MS2 scans from a raw file to do the spectral angle/similarity calculation and because you can't request the peak lists of these scans selectively using from sort of API you need to convert the the complete file to |
Ok, I think I found a fix, and I used that one mzML and performed CECalibration and Rescoring with it. The rescoring results on peptide level and spectral angle for the tested CEs are below, so it definitely works now. I have pushed the fix to the fix/mzml_instrumentConfigurationRef branch of spectrum-io, you can install it using pip install git+https://github.com/wilhelm-lab/spectrum_io.git#fix/mzml_instrumentConfigurationRef for now, until I release this. |
Yes, for CE calibration, we currently take the top 1000 scoring target PSMs, so unfortunately, we need to read all of them and then sort by score. If you know a better way of doing this, please let me know :) |
My comment was more about the conversion of 99% of the scan data (the raw file) that is afterwards anyway not needed. If one could selectively request those ~1000 from the binary file without a linear read access...it would save so much time and computation... |
Nice plots! ok. will try to update spectrum-io! |
Yes, good idea. Especially in situtation of many raw files. This does apparently work with ThermoRawFileParser by providing the scannumbers you want to extract but it would require many changes. I created an issue for that: #135 |
I updated by
but the erro does not disappear:
What am I doing wrong? |
Try again but instead of "#" write "@", i.e. Check while installing, that the log output specifically states it is checking out that branch. Sometimes one also needs to first uninstall the package for whatever reason... |
no difference.
|
not sure why... I check your commit 0ed85b8 in spectrum_io and on my local system it is still the old code (looked at line 154 in msraw.py) |
BAAM! Looks like it works. |
And the answer to life, the universe, and everything is
and not 42! 😂 |
Very cool. Thanks a lot for your fast help. Now I can do this for all the files. Does multithreading help for this workflow? |
Yes it does, parallel processing is realised on the file level, The shared msms.prosit is split by rawfile and the entire annotation and prediction is performed in parallel then. I.e. use as many processes as you have files with the |
Very cool, 46 files done!
|
Sorry... the problems continue! The output below is from a CEcalibration of FragPipe results. This time the mzML was written by FragPipe (not MSconvert) and the raw data is ddaPASEF style (so .tdf or .d):
Not sure if the Oktoberfest expects that .raw files and .mzML files are the exact same base file name (except the very last .postfix). FragPipe adds this
|
The filename without the extension has to match. I don't know why they add _uncalibrated but then they should also add this suffix to the search results. We cannot possibly know how to match arbitrary filename manipulations and I suggest that fragpipe is fixing this on their side. For now, you would need to correct the filenames. I could maybe add a check if "_uncalibrated" exists as long as this is always the case. But it gets difficult if every tool changes the filename somehow. |
Jip! I totally agree, not their best idea... could I also use a hard link for this purpose? |
Yes Oktoberfest supports links. I would rather use a symlink though, i.e. |
hmmm, I think
but
|
This is not because of symlinks but because of the "filter string" accession, which is not present. This is used in spectrum-io to determine the fragmentation type (supported are HCD and CID at the moment) as well as the mz range of the spectrum. The problem is, that this is an accession unique to thermo instruments it seems. I.e. it is not there all the time but we need it for annotation of fragment peaks. @WassimG do you have an idea how to do this better? It seems HUPO PSI made the terms HCD/CID obsolete and suggests a different accession, which to me doesn't even make sense: https://raw.githubusercontent.com/HUPO-PSI/psi-ms-CV/master/psi-ms.obo (search for HCD). |
ok. I loooooove mzML! So it is this (searched for HCD):
But why obsolete? |
I don't know, this is just sth. they say in the mzml documentation. I will have to check the mzML to see which accessions are used to define the scan window and fragmentation type and will come back to you once I know how to solve this. |
But why are you so keen on checking that the scans are actually of the HCD/CID fragmentation type? I kind of understood this when the code was still sitting behind Prosit - which only had HCD/CID models trained on Orbitrap data, but now that all kinds of models could become available through Koina... or maybe someone wants to score EtHCD data vs. a model trained on CID data? So in essence, is this check really needed? Maybe, just check if the scan is a fragment ion scan (guess that is the ms level in mzML) and place a warning if the fragmentation method indicated in the scan metadata mismatches the selected model, but even that matching might be difficult to do. |
I sort of agree. This is now an issue that is more about do we care about the fragmentation type and FTMS/ITMS/TOF but instead let the user do this. We really only need the scan window and let the user provide the desired mass tolerance used for the search which is already supported in the config. We realised that mzml converted using MSConvert is actually not working at all. Still looking into this in hopes of finding a solution. |
ok. crazy! Not at all? But the mzML files written by FragPipe work? |
Yes, because you were able to read the information from the filter string attribute, which is not there all the time. What should be there is the scanWindowList and the activation attribute within the precursorList. I pushed to wilhelm-lab/spectrum_io@4a60a9c which should fix your problem. I removed the dependency on the filter string attribute. You should I added some unit tests and they work but please check this before I merge it. |
Hi there, not sure if this is a new problem, or still the old one. I started a CE calibration:
and it fails when Oktoberfest starts reading from the first mzML files:
BUT, this has worked in a previous attempt. |
It's a new one. In order to support different models in koina which require the instrument type to be read from the mzml file, we introduced a new column in the internal format that contains this information. This is already implemented in the latests version and works with the mzML files we tested. I checked the file LFQ_Orbitrap_DDA_Condition_A_Sample_Alpha_01.mzML which I still had and found that MSConvert writes "CommonInstrumentParams", i.e. capital "C" compared to ThermoRawFileParser which writes lowercase. I will fix this asap. If it isn't too many files you can therefore manually change the mzML files for now if you want... |
ok, thx for the fast reply. I will wait for the fix. The last thing we want is to create additional confusion by introducing manual changes in the .mzML files. I 😍 .mzML |
@tobiasko Reading the instrumentConfiguration from mzML files that were converted with MSConvert is now working with the current development branch of oktoberfest but not with the stable release, since the newest spectrum-io isn't supported by the current stable version of oktoberfest due to a breaking change. If sth. else doesn't work with regards to reading the mzML, please consider opening a new issue. |
Describe the bug
The above mzML file was generated by MSconvert (Docker container) on Debian Linux with parameters
--mzML --64 --zlib --filter "peakPicking true 1-
To Reproduce
Expected behavior
no error complaining about unsupported mass analysers.
System [please complete the following information]:
The text was updated successfully, but these errors were encountered: