Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modification object #6

Open
sgibb opened this issue Jul 11, 2017 · 22 comments
Open

Modification object #6

sgibb opened this issue Jul 11, 2017 · 22 comments

Comments

@sgibb
Copy link
Member

sgibb commented Jul 11, 2017

The current Modification object is a simple class that contains mainly the unimod ID, the composition and the avg/mono mass of the modification as integer/double vectors of length 1. Additionally it contains a data.frame named specificity that stores information about the site and the position of the modification, e.g.:

- General:
  Class                  :   Modification
  Accession number/id    :              1
  PSI-MS/Interim Name    :         Acetyl
  Description            :    Acetylation
  Composition            : H(2) C(2) O(1)
  Delta Average Mass     :        42.0367
  Delta Monoisotopic Mass:      42.010565
  Approved               :           TRUE
- Specificity:
    site       position      classification hidden group
1      K       Anywhere            Multiple  FALSE     1
2 N-term     Any N-term            Multiple  FALSE     2
3      C       Anywhere  Post-translational   TRUE     3
4      S       Anywhere  Post-translational   TRUE     4
5 N-term Protein N-term  Post-translational  FALSE     5
6      T       Anywhere  Post-translational   TRUE     6
7      Y       Anywhere Chemical derivative   TRUE     7
8      H       Anywhere Chemical derivative   TRUE     8
- References: use 'references(object)'

Some specificities have additional entries in the unimod database for neutral loss (#3). These entries have their own avg/mono mass (sometimes different from the general modification mass, e.g. Phosphorylation, id=21).

screenshot_20170711_220453

We could create a new class NeutralLoss that stores these information and could be attached to a specificity (which maybe should be also a class, so that we could handle different user-defined locations easier; see #2). But before creating two new classes I like to ask whether anyone has a better idea?
Maybe we overcomplicate things. Maybe a data.frame (with some duplicated entries in some columns) would fit and a complicated class hierarchy is just overkill.

Class hierarchy would be:

AbstractModification (VIRTUAL; slots id, name, avgMass, monoMass, composition)
|- NeutralLoss (inherits AbstractModification, no additional slots)
`- Modification (inherits AbstractModification, 
                 additional slots: specificity (list of Specificity)

Specificity (slots: id, site, position, classification, hidden, 
                    neutralLoss (list of NeutralLoss))

vs. a data.frame where all these slots would be columns.

There is the mzID package that has a complex class hierarchy and many classes but in fact just turns a mzIdentML file into a data.frame (nearly identical use case). I don't want to create classes just because it is possible. The user should benefit from them and should be allowed to create modifications for calculateFragments and other functions.

@lgatto do you have a better idea for the data structure?

@sgibb sgibb added the question label Jul 11, 2017
@sgibb sgibb added this to the First Bioconductor Release milestone Jul 11, 2017
@lgatto
Copy link
Member

lgatto commented Jul 11, 2017

I think it depends a bit on the use cases you envision. I am not convinced that several classes are really necessary here, and a data.frame with additional rows, similarly to the table above, seems like it could do the trick.

There could be a helper function to create new modification as new rows in the data.frame, so that the user doesn't need to create them manually - that function could do some checks, fill out values that can be calculated automatically, ...

@sgibb
Copy link
Member Author

sgibb commented Aug 6, 2017

Maybe you are right and we should not overcomplicate things. We could avoid a class completely.

I converted all the unimod entries into a data.frame (currently without the NeutralLoss and reference information) and beside a lot of data duplication it takes just around 1 MB of memory:

head(d)
#     id   name description        lastModified approved avgMass  monoMass
# 1    1 Acetyl Acetylation 2008-02-15 05:20:02        1 42.0367 42.010565
# 1.1  1 Acetyl Acetylation 2008-02-15 05:20:02        1 42.0367 42.010565
# 1.2  1 Acetyl Acetylation 2008-02-15 05:20:02        1 42.0367 42.010565
# 1.3  1 Acetyl Acetylation 2008-02-15 05:20:02        1 42.0367 42.010565
# 1.4  1 Acetyl Acetylation 2008-02-15 05:20:02        1 42.0367 42.010565
# 1.5  1 Acetyl Acetylation 2008-02-15 05:20:02        1 42.0367 42.010565
#      composition   site       position     classification hidden group
# 1   H(2)C(2)O(1)      K       Anywhere           Multiple  FALSE     1
# 1.1 H(2)C(2)O(1) N-term     Any N-term           Multiple  FALSE     2
# 1.2 H(2)C(2)O(1)      C       Anywhere Post-translational   TRUE     3
# 1.3 H(2)C(2)O(1)      S       Anywhere Post-translational   TRUE     4
# 1.4 H(2)C(2)O(1) N-term Protein N-term Post-translational  FALSE     5
# 1.5 H(2)C(2)O(1)      T       Anywhere Post-translational   TRUE     6
> dim(d)
# [1] 2370   13
> print(object.size(d), units="Kb")
# 908.1 Kb

By converting some of the columns into factor and Rle, removing some useless columns (lastModification, group) and adding around 1000 rows because of incorporating NeutralLoss information that would change a bit. Nevertheless I think it would be acceptable to store the whole unimod.xml in a data.frame (and keep it with the amino acid and element information in data). In that case we could also move xml2 from Depends to Suggests. The reference information is IMHO negligible. If anybody wants to know where a modification was described/published he could look it up at http://unimod.org.

Instead of a Modification class there could be a simple function that creates a modification data.frame for calculateFragments, etc. This function could look up the unimod information in the unimod data.frame.

@lgatto
Copy link
Member

lgatto commented Aug 6, 2017

I think it's good to keep things as simple as possible, at least in a first stage. If necessary, it's possible to encapsulate the data in a class of the need becomes clear.

@sgibb
Copy link
Member Author

sgibb commented Jan 12, 2018

There are three data.frames in the /data directory now (containing all information from uniprot except the references and notes):

library("unimod")

data("elements")
head(elements)
Name FullName AvgMass MonoMass
H H Hydrogen 1.007940 1.007825
2H 2H Deuterium 2.014102 2.014102
Li Li Lithium 6.941000 7.016003
C C Carbon 12.010700 12.000000
13C 13C Carbon13 13.003355 13.003355
N N Nitrogen 14.006700 14.003074
data("aminoacids")
head(aminoacids)
OneLetter ThreeLetter FullName AvgMass MonoMass H C N O S Se
- - 0.0000 0.00000 0 0 0 0 0 0
A A Ala Alanine 71.0779 71.03711 5 3 1 1 0 0
R R Arg Arginine 156.1857 156.10111 12 6 4 1 0 0
N N Asn Asparagine 114.1026 114.04293 6 4 2 2 0 0
D D Asp Aspartic acid 115.0874 115.02694 5 4 1 3 0 0
C C Cys Cysteine 103.1429 103.00919 5 3 1 1 1 0
data("modifications")
head(modifications)
Id Name Description Composition AvgMass MonoMass Site Position Classification SpecGroup LastModified Approved Hidden
1 Acetyl Acetylation H(2) C(2) O 42.0367 42.01056 K Anywhere Multiple 1 2017-11-08 16:08:56 TRUE FALSE
1 Acetyl Acetylation H(2) C(2) O 42.0367 42.01056 N-term Any N-term Multiple 2 2017-11-08 16:08:56 TRUE FALSE
1 Acetyl Acetylation H(2) C(2) O 42.0367 42.01056 C Anywhere Post-translational 3 2017-11-08 16:08:56 TRUE TRUE
1 Acetyl Acetylation H(2) C(2) O 42.0367 42.01056 S Anywhere Post-translational 4 2017-11-08 16:08:56 TRUE TRUE
1 Acetyl Acetylation H(2) C(2) O 42.0367 42.01056 N-term Protein N-term Post-translational 5 2017-11-08 16:08:56 TRUE FALSE
1 Acetyl Acetylation H(2) C(2) O 42.0367 42.01056 T Anywhere Post-translational 6 2017-11-08 16:08:56 TRUE TRUE

We could turn each of these data.frames into a DataFrame (with Rle or similar) or into a tibble but I don't think it is necessary because the whole database is small:

print(object.size(modifications), units="KB")
# 725.4 Kb

Currently unimod is a very small package providing just these 3 data.frames and has no dependencies (the hidden functions to create the data.frames need the xml2 package that's why it is in Suggests:).

The aminoacids and elements data.frame could replace MSnbase's amino.acids data.frame and atomic.mass vector (in R/environments.R).

Do you like these data.frames and their format or should we provide something different?

@lgatto
Copy link
Member

lgatto commented Jan 13, 2018

For anything that is like a data.frame, tidy tools are superior when it comes to data wrangling. Still, I don't think we need to depend on tibble, as the conversion can be done by the user, if required. Unless, of course, we envision some sort or direct analysis ourselves where tibbles would be a better fit.

Yes, I would suggest to use the data in MSnbase and make use of unimod. The latter would probably have to be submitted to Bioconductor first, though.

@sgibb
Copy link
Member Author

sgibb commented Jan 15, 2018

While the elements and aminoacids data.frames are very useful now. The
modification data.frame is more or less useless. E.g. in topdownr we
support 3 modifications (Carbamidomethyl, Acetyl, Met-loss; unimod id 4, 1,
765).

library("unimod")
data("modifications")
subset(modifications, Id %in% c(1, 4, 765) & Classification != "Artefact")
Id Name Description Composition AvgMass MonoMass Site Position Classification SpecGroup LastModified Approved Hidden
1 1 Acetyl Acetylation H(2) C(2) O 42.0367 42.01056 K Anywhere Multiple 1 2017-11-08 16:08:56 TRUE FALSE
2 1 Acetyl Acetylation H(2) C(2) O 42.0367 42.01056 N-term Any N-term Multiple 2 2017-11-08 16:08:56 TRUE FALSE
3 1 Acetyl Acetylation H(2) C(2) O 42.0367 42.01056 C Anywhere Post-translational 3 2017-11-08 16:08:56 TRUE TRUE
4 1 Acetyl Acetylation H(2) C(2) O 42.0367 42.01056 S Anywhere Post-translational 4 2017-11-08 16:08:56 TRUE TRUE
5 1 Acetyl Acetylation H(2) C(2) O 42.0367 42.01056 N-term Protein N-term Post-translational 5 2017-11-08 16:08:56 TRUE FALSE
6 1 Acetyl Acetylation H(2) C(2) O 42.0367 42.01056 T Anywhere Post-translational 6 2017-11-08 16:08:56 TRUE TRUE
7 1 Acetyl Acetylation H(2) C(2) O 42.0367 42.01056 Y Anywhere Chemical derivative 7 2017-11-08 16:08:56 TRUE TRUE
8 1 Acetyl Acetylation H(2) C(2) O 42.0367 42.01056 H Anywhere Chemical derivative 8 2017-11-08 16:08:56 TRUE TRUE
14 4 Carbamidomethyl Iodoacetamide derivative H(3) C(2) N O 57.0513 57.02146 C Anywhere Chemical derivative 1 2017-10-09 10:27:10 TRUE FALSE
23 4 Carbamidomethyl Iodoacetamide derivative H(3) C(2) N O 57.0513 57.02146 U Anywhere Chemical derivative 10 2017-10-09 10:27:10 TRUE TRUE
24 4 Carbamidomethyl Iodoacetamide derivative H(3) C(2) N O 57.0513 57.02146 M Anywhere Chemical derivative 11 2017-10-09 10:27:10 TRUE TRUE
25 4 Carbamidomethyl Iodoacetamide derivative H(7) C(3) N O S 105.1588 105.02483 M Anywhere Chemical derivative 11 2017-10-09 10:27:10 TRUE TRUE
1170 765 Met-loss Removal of initiator methionine from protein N-terminus H(-9) C(-5) N(-1) O(-1) S(-1) -131.1961 -131.04048 M Protein N-term Co-translational 1 2007-07-15 20:01:35 FALSE TRUE

While we could use the modifications data.frame to find the modification and
its mass difference we also need the "modification rule", e.g. for Acetylation
the "Any N-term" rule (so we add the mass if our fragment starts with the
N-terminal end of the sequence), for Carbamidomethyl the "C|U" rule
(we add the mass if "C" or "U" is present in the sequence) and for Met-loss the
"remove M at the beginning" rule (we remove "M" from the start of the sequence
and substract 131 from the peptide mass).

Here you could see the implementation in topdownr:

.unimod1 <- function(x, s) {
    i <- startsWith(s, x$seq)
    x$mz[i] <- x$mz[i] + 42.010565
    x
}

.unimod4 <- function(x) {
    iCU <- grep("C|U", x$seq)
    x$mz[iCU] <- x$mz[iCU] + 57.021464
    x
}

.unimod765 <- function(x) {
    gsub("^M([ACGPSTV])", "\\1", x)
}

All these rules could be written in regular expressions (one for finding the
pattern and in some cases a second one for replacement). But there is no way for
me to write nrow(modifications) == 3458 regular expressions.

Unfortunately sometimes the rule could not be predicted from the Site and
Position column because the details are written in special notes, e.g. for
Met-loss:

N-terminal initiator methionine is removed by a methionine aminopeptidase from proteins where the residue following the methionine is Ala, Cys, Gly, Pro, Ser, Thr or Val. This is generally the final N-terminal state for proteins where the following residue was a Cys, Pro or Val.

But often the notes just contain less useful information (that's why the notes
are not included in the data.frame):

"observed in monoclonal antibodies"
"Covalently bound structure in Manglik et al., Fig. 1b-Fig1c. Chemical formula % Sigma catalog entry."
"Triton X-114"
"GEE (glycine ethyl ester) is a substrate for the enzyme Factor XIII for cross-linking to fibrinogen"
...

I would like to have an interface like calculatePeptideMass(peptideSequence, fixedModifications=unimodIds, variableModifications=unimodIds, neutralLoss=TRUE)
that could be used in MSnbase::calculateFragements and similar functions.
(Would be great to have impact from mass spec users here for a better interface regarding the fixed/variable modifications.)

What we could do: Writing the regular expressions for 3-10 often used
modifications and set everything else as NA. If somebody wants to use a
modification with NA pattern he would get a message to open an issue on github
for implementing this rule.

Alternatively we just provide the modification data.frame with the delta mass
(and remove the other columns) and let the user implement the rule himself (as I
did in topdownr). This would at least reduce the need for hardcoding the
delta mass.

@lgatto
Copy link
Member

lgatto commented Jan 16, 2018

What we could do: Writing the regular expressions for 3-10 often used modifications and set everything else as NA. If somebody wants to use a modification with NA pattern he would get a message to open an issue on github for implementing this rule.

I like this approach because it makes the package useful for what you need right now without overwhelming you with tons of unnecessary stuff, but allows users to extend or asks for useful extensions.

@sgibb
Copy link
Member Author

sgibb commented Jan 23, 2018

I implemented the first prototype of a function to calculate the mass for peptides and allow fixed custom and unimod modifications (the unimod modifications are used by their short names, colon, site):

library("unimod")


unimod:::.mass("MACE",
               fixedModifications=c("Acetyl:N-term",
                                    "Carbamidomethyl:C"))
# [1] 533.1614
# attr(,"sequence")
# [1] "MACE"


unimod:::.mass("MACE",
               fixedModifications=c("Met-loss:P-M",
                                    "Acetyl:N-term",
                                    "Carbamidomethyl:C"))
# [1] 402.1209
# attr(,"sequence")
# [1] "ACE"


unimod:::.mass(c("ACE", "MACE", "CDE"),
               fixedModifications=c("Met-loss:P-M",
                                    "Acetyl:N-term",
                                    "Carbamidomethyl:C"))
# [1] 402.1209 402.1209 446.1107
# attr(,"sequence")
# [1] "ACE" "ACE" "CDE"


unimod:::.mass(c("ACE", "MACE", "CDE"), fixedModifications="Unknown:420:N-term")
# [1] 723.1397 854.1802 767.1296
# attr(,"sequence")
# [1] "ACE"  "MACE" "CDE"
#
# Applying the default rule for the modification: Unknown:420:N-term
# Please create an issue on: https://github.com/ComputationalProteomicsUnit/unimod/issues/new
# to let us implement the correct rule or if the default one is already correct we could remove 
# this message.


unimod:::.mass(c("ACE", "MACE", "CDE"),
               fixedModifications=data.frame(
                    Id=c("MyModification1",
                         "MyModification2"),
                    Site=c("C", "D"),
                    MonoMass=c(57, 58),
                    stringsAsFactors=FALSE))
# [1] 360.0889 491.1294 462.0787
# attr(,"sequence")
# [1] "ACE"  "MACE" "CDE"

I am going to implement the variable modifications next. @pavel-shliaha, @adder, @yafeng any suggestion for the interface?

Currently I am thinking an additional argument named variableModifications that takes a data.frame with the columns Id, Site (Aminoacid), Location (Position in the peptide chain), DeltaMass would be sufficient.

Does anyone have a good suggestion for a name for this function? I don't like names that contain more or less useless verbs calculateMass, determineMass, getMass (I know we have MSnbase::calculateFragments; I am not sure why we not simply used fragments that time?!). That's why I vote for mass (but this is very generic).

@lgatto
Copy link
Member

lgatto commented Jan 23, 2018

If you want mass, you might need to consider a method mass,character and use the generic in ProtGenerics. Otherwise, what about pepmass, to get the mass of a peptides (with optional fixed or variable modifications passed as arguments).

@sgibb
Copy link
Member Author

sgibb commented Jan 23, 2018

pepmass is much more specific. Thanks. I guess the bioc reviewer will "force" me to provide a method for AAString, AAStringSet and AAStringSetList anyway (they did so for the cleave method in cleaver). So pepmass,character, pepmass,AAString etc. would be fine.

@adder
Copy link

adder commented Jan 23, 2018

Hey,
Dataframes sound ok for me.
I also usually represent these type of objects as dataframes with in my code, it's sufficiently flexible.
It fits nicely with my mainly dplyr/tidyverse oriented workflow :)

Mosty tricky thing is probably specyfing terminal modifiations.
Maybe position 0 for N-terminus and length(pep)+1 for C-terminus?

Regarding the function name. If mass is to general, you could also call it peptide_mass.

@adder
Copy link

adder commented Jan 23, 2018

Ok, I was to slow with my comments :)
Sorry

@lgatto
Copy link
Member

lgatto commented Jan 23, 2018

Let's not get into the CamelCam vs snake_case vs alllowsercase debate ;-)

Surely we all agree not to use ALLUPPERCASE.

@sgibb
Copy link
Member Author

sgibb commented Jan 24, 2018

I like pepmass because it calculates the mass for peptides (ok it could be a protein as well). I assume we need to create methods because we should support character and AAString*.

I actually wondering what should happen if a variable modification and a fixed modification hit the same site:

  1. both are applied
  2. fixed mod is overwritten (= fixed mod wouldn't be applied but the variable)
  3. the variable mod is ignored

@adder
Copy link

adder commented Jan 24, 2018

I'm not an expert in these matters and I can't supply real biological examples right now but I would say that both should be applied by default.
In the case that both can be applied, you allow for this.
If the fixed modification blocks the variable modification, it's up to the user to correctly specify the variable modification by not allowing it on the same site that the fixed is on.

A difficult one is the case that the variable blocks a fixed modification, I guess an option variable_only = TRUE could help here. (the default of this option would be FALSE then)

I'm not sure if this is a problem in a real example but what happens if you have 2 variable modifications that can be on the same site?

@lgatto
Copy link
Member

lgatto commented Jan 24, 2018

I suppose it depends whether the modifications can co-occur or whether they compete. I would say that it is the user's responsibility to make sure that sites undergo only a single modification (whether variable of fixed); if > 1 modifications are provided, I think we should consider all possibilities: mod1 only, mod2 only, mod1 and 2, or none, if both are variable.

I could ask in the lab if this is an issue in practice.

@pavel-shliaha
Copy link

pavel-shliaha commented Jan 25, 2018

I think that 2 modifications can co-exist in principle (chemically), but I have never seen 2 modifications reported on the same residue. I think if you want them both then just create a new modification that contains both of them. Maybe make a function that combines them. And fixed modification should beat variable modification. This is just my opinion of course

The only real example I can think of is trimethylation of lysine. There are 3 modifications.

  1. K methyl 14.015650
  2. K dimethyl 28.031300
  3. K trimethyl 42.046950

the mass of Kme3 modification is exactly identical to Kme2 + Kme1, however I have never seen an identification with 2 modifications Kme2 + Kme1. All modifications Kme3 are reported as Kme3.

@pavel-shliaha
Copy link

pavel-shliaha commented Jan 25, 2018

I agree you should store NL as a column of a dataframe.

@pavel-shliaha
Copy link

see below my email exchange with people who work on simultaneous modifications in Mascot. They say mascot does not put 2 modifications on the same reisude


Dear Pavel,

No, Mascot will never suggest two simultaneous modifications on the same residue. In such a case it would try to allocate one of the modifications to a different residue if another possible target is present in the peptide. If you expect to see this, you should specify it as a separate modification.

Best,
Tina


From: Pavel V. Shliaha [mailto:[email protected]]
Sent: 25. januar 2018 13:42
To: Tina Nybo; Adelina Rogowska-Wrzesinska
Subject: 2 modifications on the same residue

Dear Tina and Adelina,

I know you work with some very weird modifications in oxidation field and hence I wanted to ask if you have ever come across residues that could be modified in 2 places, e.g. oxidation + chlorination. If so how do you handle that in a database search. Do you specify modifications separately as dynamic and mascot knows a combination on a single residue is possible or do you create a new modification that contains both the (say chlorination + oxidation?)

Pavel

@lgatto
Copy link
Member

lgatto commented Jan 25, 2018

So, bottom line is that search engine don't seem to support multiple modification. I suggest that if such a cases arises, to calculate the masses for

  • mod1 only,
  • mod2 only,
  • none if both are variable
  • mod1 + mode2, but return a message/warning that two modifications appear on the same amino acid.

I don't like the idea that a user has to create a new virtual modification composed of two individual ones. As search engines won't support this, a warning or message should then inform the user.

And to follow up from Pavel's example, trimethylation would be a single modification, of course.

@pavel-shliaha
Copy link

@sgibb and @lgatto just a quick opinion from a more top-down perspective

  1. @sgibb could you please provide an output you imagine your function will give when you submit a seqeunce and a variable modification.

unimod:::.mass("KKK",
varModifications=data.frame(
Id=c("acetyl"),
Site=c("K"),
MonoMass=c(42),
stringsAsFactors=FALSE))

will it be a vector of all possible permutations of K modification masses, i.e. singly, doubly and triply acetylated? I can suggest 3 different proteoforms with identical mass KacKK, KKacK and KKKac. How will this be reflected if the output is just monoisotopic mass? (please let me know if you are open to suggestions on these points)

  1. As a top-down person can I suggest you create an additional column which by default is all. I sometimes want a fixed/varible modification but not on all K, but only on a particular. E.g. I know first K is acetylated but 2nd one can only be trimethylated.

  2. Just to let you know: the vast majority of modifications cannot co-exist with others. The reason for this is simple: a modification needs certain physico-chemical properties to be attached to an amino acid. However other modifications to this residue destroy these properties. Given co-existance is extremely rare I suggest not to not calculate co-existing modifications by default, but perhaps to provide an interface that allows user to say which modifications can co-exist.

Lets assume there are 5 K (KKKKK) residues, each of which can be mono-, di, trimethylated and acetylated. MS1 mass tells us there are 3 methylations and 1 acetylations Even without co-existing modifications we already have a huge space of possibilities of proteoform combinations. E.g. KmeKmeKmeKacK and Kme3KacKKK and so on. If you do consider all can co-exist the number of combinations becomes almost infinite.

@sgibb
Copy link
Member Author

sgibb commented Feb 14, 2018

@adder, @lgatto and @pavel-shliaha thanks for your great input and sorry for the delayed answer.

First I have to admit that my understanding of fixed/variable modifications was quite different. So to have everybody on the same page I would define the terms now as follows:

fixed: modification that is always present, could have two characteristics:

  1. all: each residue has the same modification, e.g. KmeKmeKme
  2. specific: just a few residues or a single residue at a specific position was modified and the position is known a priori e.g. methyl at K1: KmeKK, or metyhl at K1 and 2: KmeKmeK

variable: modification could happen at none, one, multiple or all residues without knowing the position a priori.

Currently unimod just supports fixed/all. I am going to implement fixed/specific next.

@sgibb could you please provide an output you imagine your function will give when you submit a seqeunce and a variable modification.

unimod:::.mass("KKK",
varModifications=data.frame(
Id=c("acetyl"),
Site=c("K"),
MonoMass=c(42),
stringsAsFactors=FALSE))

Current output would be:

mass sequence modifications
510.3166 KKK Acetyl:K

(because it is KacKacKac)

will it be a vector of all possible permutations of K modification masses, i.e. singly, doubly and triply acetylated? I can suggest 3 different proteoforms with identical mass KacKK, KKacK and KKKac. How will this be reflected if the output is just monoisotopic mass? (please let me know if you are open to suggestions on these points)

As you assumed currently I just return the monoisotopic mass. So if fixed/specific modifications are available the output would be 426 for KacKK, KKacK and KKKac.

Of course I am open for suggestions and discussions.

As a top-down person can I suggest you create an additional column which by default is all. I sometimes want a fixed/varible modification but not on all K, but only on a particular. E.g. I know first K is acetylated but 2nd one can only be trimethylated.

I think that is what I want to provide with the fixed/specific method.

Just to let you know: the vast majority of modifications cannot co-exist with others. The reason for this is simple: a modification needs certain physico-chemical properties to be attached to an amino acid. However other modifications to this residue destroy these properties. Given co-existance is extremely rare I suggest not to not calculate co-existing modifications by default, but perhaps to provide an interface that allows user to say which modifications can co-exist.

Good suggestion. That would be easier to implement.

Lets assume there are 5 K (KKKKK) residues, each of which can be mono-, di, trimethylated and acetylated. MS1 mass tells us there are 3 methylations and 1 acetylations Even without co-existing modifications we already have a huge space of possibilities of proteoform combinations. E.g. KmeKmeKmeKacK and Kme3KacKKK and so on. If you do consider all can co-exist the number of combinations becomes almost infinite.

I see your point. With the current implementation it doesn't matter because it will only return the monoisotopic mass. But I guess that is not the information you are interested in, or?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants