-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement additional distance metrices #30
Comments
Normalized spectral angle Maybe we should ask somebody from ProteomeTools what is used in practice? |
What about having something similar to the functions used by GNPS? |
Nice idea @michaelwitting. Is there a publication, source code, reference implementation available? |
I'm not aware of any implementation in R yet. |
We use cosine similarity as a measure to compare spectra in LC-MS/MS. Usually I apply the cosine() function from lsa package but now I have decided to use compareSpectra() with the following function
I read in your pull that cosine similarity measure corresponds to the standard normalized dotproduct and that you are applying the other definition (literature definition). So I have modified the code you are using to implement the cosine similarity measure. It seems that works (although I want to check it better), |
Great @jmbadia ! - feel free to make a pull request adding your function to |
Ok @jorainer. Happy to help. |
@jorainer @sgibb, let me share what I have (related also with this pull), A) Stein and Scott dot product is actually the conventional dot product (cosine) squared. Stein and Scott 1994 based their dot product measure on a technical manual of a commercial software (The Finnigan Library Search Program, 1978). Finnigan manual used a variable called purity= 1000 x dotproduct² and I guess Stein and Scott took the formula developed without being aware of the square B) Best spectra comparison algorithms in Stein and Scott 1994 are dotproduct algorithm (squared or not) with m>1 and n~0.5 as optimal exponents. The point is that intensities must be normalized with the mass range somehow (otherwise same n values give different relative weights) and I assume Stein et al did it as Finnegan's manual tells ("[...] since the library unknown always has a base peak intensity of 1000", i.e. base peak of every spectra = 1000). This normalization is not related with the so called optional global/local normalization, described in the manual as a useful tool to "corrects for the differences that may exist between two spectra of the same compound acquired in different ways" . So, what now?. I guess ...? Sorry if I am too direct here but my wife is waiting me for dinner... |
Thanks for the comprehensive description @jmbadia !
I agree. But then we should mention this in the documentation. Regarding B) and C), whether or not intensity values of a spectrum are normalized to values relative to the base peak intensity (whether that is set to 100 or 1000) seems to have no influence on the a <- cbind(
mz = c(74.03, 120.01, 122.02, 151.98, 153.99, 177.99, 195.02, 241.03),
intensity = c(15800, 110400, 58100, 117900, 11100, 7700, 15300, 64400))
a_100 <- a
a_100[, 2] <- a[, 2] / max(a[, 2]) * 100
a_1000 <- a
a_1000[, 2] <- a[, 2] / max(a[, 2]) * 1000
b_100 <- cbind(
mz = c(74.03, 120.01, 122.02, 151.98, 153.88, 177.99, 195.02, 241.03),
intensity = c(3.47, 9.63, 8.36, 51.12, 7.75, 3.24, 10.15, 100))
b_1000 <- b_100
b_1000[, 2] <- b_100[, 2] / max(b_100[, 2]) * 1000
library(MsCoreUtils)
## One normalized against one not normalized.
ndotproduct(x = a, y = b_100)
[1] 0.7838264
## Both normalized to 100
ndotproduct(x = a_100, y = b_100)
[1] 0.7838264
## Both normalized to 1000
ndotproduct(x = a_1000, y = b_1000)
[1] 0.7838264
## One normalized to 100, one to 1000
ndotproduct(x = a_100, y = b_1000)
[1] 0.7838264
## Repeat by changing m and n
ndotproduct(x = a, y = b_100, n = 0.9, m = 0.7)
[1] 0.5769457
## Both normalized to 100
ndotproduct(x = a_100, y = b_100, n = 0.9, m = 0.7)
[1] 0.5769457
## Both normalized to 1000
ndotproduct(x = a_1000, y = b_1000, n = 0.9, m = 0.7)
[1] 0.5769457
## One normalized to 100, one to 1000
ndotproduct(x = a_100, y = b_1000, n = 0.9, m = 0.7)
[1] 0.5769457
Seems that from a mathematical standpoint the similarity calculation is independent on the scale of the intensity values, so I think there is no need to modify |
Hi @jorainer .
|
I have to thank you for rising this potential problem. This made me investigate and now I'm confident that it is working. Things like that increase the trust in the package's functionality!
We used the definition from Stein and cite their paper - thus I would like to keep the dotproduct function as described there. Yes, please add a comment in the documentation. Thanks! |
See also #29
See: https://doi.org/10.1016/1044-0305(94)87009-8 for Euclidean/Absolute value distance
@tnaake , @tobiasko any other needed?
The text was updated successfully, but these errors were encountered: