-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modification object #6
Comments
I think it depends a bit on the use cases you envision. I am not convinced that several classes are really necessary here, and a There could be a helper function to create new modification as new rows in the |
Maybe you are right and we should not overcomplicate things. We could avoid a class completely. I converted all the unimod entries into a head(d)
# id name description lastModified approved avgMass monoMass
# 1 1 Acetyl Acetylation 2008-02-15 05:20:02 1 42.0367 42.010565
# 1.1 1 Acetyl Acetylation 2008-02-15 05:20:02 1 42.0367 42.010565
# 1.2 1 Acetyl Acetylation 2008-02-15 05:20:02 1 42.0367 42.010565
# 1.3 1 Acetyl Acetylation 2008-02-15 05:20:02 1 42.0367 42.010565
# 1.4 1 Acetyl Acetylation 2008-02-15 05:20:02 1 42.0367 42.010565
# 1.5 1 Acetyl Acetylation 2008-02-15 05:20:02 1 42.0367 42.010565
# composition site position classification hidden group
# 1 H(2)C(2)O(1) K Anywhere Multiple FALSE 1
# 1.1 H(2)C(2)O(1) N-term Any N-term Multiple FALSE 2
# 1.2 H(2)C(2)O(1) C Anywhere Post-translational TRUE 3
# 1.3 H(2)C(2)O(1) S Anywhere Post-translational TRUE 4
# 1.4 H(2)C(2)O(1) N-term Protein N-term Post-translational FALSE 5
# 1.5 H(2)C(2)O(1) T Anywhere Post-translational TRUE 6
> dim(d)
# [1] 2370 13
> print(object.size(d), units="Kb")
# 908.1 Kb By converting some of the columns into Instead of a |
I think it's good to keep things as simple as possible, at least in a first stage. If necessary, it's possible to encapsulate the data in a class of the need becomes clear. |
There are three library("unimod")
data("elements")
head(elements)
data("aminoacids")
head(aminoacids)
data("modifications")
head(modifications)
We could turn each of these print(object.size(modifications), units="KB")
# 725.4 Kb Currently The Do you like these |
For anything that is like a Yes, I would suggest to use the data in |
While the library("unimod")
data("modifications")
subset(modifications, Id %in% c(1, 4, 765) & Classification != "Artefact")
While we could use the Here you could see the implementation in .unimod1 <- function(x, s) {
i <- startsWith(s, x$seq)
x$mz[i] <- x$mz[i] + 42.010565
x
}
.unimod4 <- function(x) {
iCU <- grep("C|U", x$seq)
x$mz[iCU] <- x$mz[iCU] + 57.021464
x
}
.unimod765 <- function(x) {
gsub("^M([ACGPSTV])", "\\1", x)
} All these rules could be written in regular expressions (one for finding the Unfortunately sometimes the rule could not be predicted from the
But often the notes just contain less useful information (that's why the notes
I would like to have an interface like What we could do: Writing the regular expressions for 3-10 often used Alternatively we just provide the modification |
I like this approach because it makes the package useful for what you need right now without overwhelming you with tons of unnecessary stuff, but allows users to extend or asks for useful extensions. |
I implemented the first prototype of a function to calculate the mass for peptides and allow fixed custom and unimod modifications (the unimod modifications are used by their short names, colon, site): library("unimod")
unimod:::.mass("MACE",
fixedModifications=c("Acetyl:N-term",
"Carbamidomethyl:C"))
# [1] 533.1614
# attr(,"sequence")
# [1] "MACE"
unimod:::.mass("MACE",
fixedModifications=c("Met-loss:P-M",
"Acetyl:N-term",
"Carbamidomethyl:C"))
# [1] 402.1209
# attr(,"sequence")
# [1] "ACE"
unimod:::.mass(c("ACE", "MACE", "CDE"),
fixedModifications=c("Met-loss:P-M",
"Acetyl:N-term",
"Carbamidomethyl:C"))
# [1] 402.1209 402.1209 446.1107
# attr(,"sequence")
# [1] "ACE" "ACE" "CDE"
unimod:::.mass(c("ACE", "MACE", "CDE"), fixedModifications="Unknown:420:N-term")
# [1] 723.1397 854.1802 767.1296
# attr(,"sequence")
# [1] "ACE" "MACE" "CDE"
#
# Applying the default rule for the modification: Unknown:420:N-term
# Please create an issue on: https://github.com/ComputationalProteomicsUnit/unimod/issues/new
# to let us implement the correct rule or if the default one is already correct we could remove
# this message.
unimod:::.mass(c("ACE", "MACE", "CDE"),
fixedModifications=data.frame(
Id=c("MyModification1",
"MyModification2"),
Site=c("C", "D"),
MonoMass=c(57, 58),
stringsAsFactors=FALSE))
# [1] 360.0889 491.1294 462.0787
# attr(,"sequence")
# [1] "ACE" "MACE" "CDE" I am going to implement the variable modifications next. @pavel-shliaha, @adder, @yafeng any suggestion for the interface? Currently I am thinking an additional argument named Does anyone have a good suggestion for a name for this function? I don't like names that contain more or less useless verbs |
If you want |
|
Hey, Mosty tricky thing is probably specyfing terminal modifiations. Regarding the function name. If |
Ok, I was to slow with my comments :) |
Let's not get into the CamelCam vs snake_case vs alllowsercase debate ;-) Surely we all agree not to use ALLUPPERCASE. |
I like I actually wondering what should happen if a variable modification and a fixed modification hit the same site:
|
I'm not an expert in these matters and I can't supply real biological examples right now but I would say that both should be applied by default. A difficult one is the case that the variable blocks a fixed modification, I guess an option I'm not sure if this is a problem in a real example but what happens if you have 2 variable modifications that can be on the same site? |
I suppose it depends whether the modifications can co-occur or whether they compete. I would say that it is the user's responsibility to make sure that sites undergo only a single modification (whether variable of fixed); if > 1 modifications are provided, I think we should consider all possibilities: mod1 only, mod2 only, mod1 and 2, or none, if both are variable. I could ask in the lab if this is an issue in practice. |
I think that 2 modifications can co-exist in principle (chemically), but I have never seen 2 modifications reported on the same residue. I think if you want them both then just create a new modification that contains both of them. Maybe make a function that combines them. And fixed modification should beat variable modification. This is just my opinion of course The only real example I can think of is trimethylation of lysine. There are 3 modifications.
the mass of Kme3 modification is exactly identical to Kme2 + Kme1, however I have never seen an identification with 2 modifications Kme2 + Kme1. All modifications Kme3 are reported as Kme3. |
I agree you should store NL as a column of a dataframe. |
see below my email exchange with people who work on simultaneous modifications in Mascot. They say mascot does not put 2 modifications on the same reisude Dear Pavel, No, Mascot will never suggest two simultaneous modifications on the same residue. In such a case it would try to allocate one of the modifications to a different residue if another possible target is present in the peptide. If you expect to see this, you should specify it as a separate modification. Best, From: Pavel V. Shliaha [mailto:[email protected]] Dear Tina and Adelina, I know you work with some very weird modifications in oxidation field and hence I wanted to ask if you have ever come across residues that could be modified in 2 places, e.g. oxidation + chlorination. If so how do you handle that in a database search. Do you specify modifications separately as dynamic and mascot knows a combination on a single residue is possible or do you create a new modification that contains both the (say chlorination + oxidation?) Pavel |
So, bottom line is that search engine don't seem to support multiple modification. I suggest that if such a cases arises, to calculate the masses for
I don't like the idea that a user has to create a new virtual modification composed of two individual ones. As search engines won't support this, a warning or message should then inform the user. And to follow up from Pavel's example, trimethylation would be a single modification, of course. |
@sgibb and @lgatto just a quick opinion from a more top-down perspective
unimod:::.mass("KKK", will it be a vector of all possible permutations of K modification masses, i.e. singly, doubly and triply acetylated? I can suggest 3 different proteoforms with identical mass KacKK, KKacK and KKKac. How will this be reflected if the output is just monoisotopic mass? (please let me know if you are open to suggestions on these points)
Lets assume there are 5 K (KKKKK) residues, each of which can be mono-, di, trimethylated and acetylated. MS1 mass tells us there are 3 methylations and 1 acetylations Even without co-existing modifications we already have a huge space of possibilities of proteoform combinations. E.g. KmeKmeKmeKacK and Kme3KacKKK and so on. If you do consider all can co-exist the number of combinations becomes almost infinite. |
@adder, @lgatto and @pavel-shliaha thanks for your great input and sorry for the delayed answer. First I have to admit that my understanding of fixed/variable modifications was quite different. So to have everybody on the same page I would define the terms now as follows: fixed: modification that is always present, could have two characteristics:
variable: modification could happen at none, one, multiple or all residues without knowing the position a priori. Currently
unimod:::.mass("KKK",
varModifications=data.frame(
Id=c("acetyl"),
Site=c("K"),
MonoMass=c(42),
stringsAsFactors=FALSE)) Current output would be:
(because it is KacKacKac)
As you assumed currently I just return the monoisotopic mass. So if fixed/specific modifications are available the output would be Of course I am open for suggestions and discussions.
I think that is what I want to provide with the fixed/specific method.
Good suggestion. That would be easier to implement.
I see your point. With the current implementation it doesn't matter because it will only return the monoisotopic mass. But I guess that is not the information you are interested in, or? |
The current
Modification
object is a simple class that contains mainly the unimod ID, the composition and the avg/mono mass of the modification as integer/double vectors of length 1. Additionally it contains adata.frame
named specificity that stores information about the site and the position of the modification, e.g.:Some specificities have additional entries in the unimod database for neutral loss (#3). These entries have their own avg/mono mass (sometimes different from the general modification mass, e.g. Phosphorylation, id=21).
We could create a new class
NeutralLoss
that stores these information and could be attached to a specificity (which maybe should be also a class, so that we could handle different user-defined locations easier; see #2). But before creating two new classes I like to ask whether anyone has a better idea?Maybe we overcomplicate things. Maybe a
data.frame
(with some duplicated entries in some columns) would fit and a complicated class hierarchy is just overkill.Class hierarchy would be:
vs. a
data.frame
where all these slots would be columns.There is the
mzID
package that has a complex class hierarchy and many classes but in fact just turns a mzIdentML file into adata.frame
(nearly identical use case). I don't want to create classes just because it is possible. The user should benefit from them and should be allowed to create modifications forcalculateFragments
and other functions.@lgatto do you have a better idea for the data structure?
The text was updated successfully, but these errors were encountered: