Implement additional distance metrices #30

sgibb · 2019-12-11T10:13:49Z

See also #29

Euclidean distance
Absolute value distance
Normalized spectral angle

See: https://doi.org/10.1016/1044-0305(94)87009-8 for Euclidean/Absolute value distance

@tnaake , @tobiasko any other needed?

tobiasko · 2019-12-11T17:50:34Z

Normalized spectral angle

Maybe we should ask somebody from ProteomeTools what is used in practice?

michaelwitting · 2020-04-25T10:47:22Z

What about having something similar to the functions used by GNPS?
https://ccms-ucsd.github.io/GNPSDocumentation/massspecbackground/networkingtheory/
Could be nice for the metabolomics community.

jorainer · 2020-04-27T05:24:25Z

Nice idea @michaelwitting. Is there a publication, source code, reference implementation available?

michaelwitting · 2020-04-27T08:01:34Z

I'm not aware of any implementation in R yet.

jmbadia · 2020-10-13T18:57:08Z

We use cosine similarity as a measure to compare spectra in LC-MS/MS. Usually I apply the cosine() function from lsa package but now I have decided to use compareSpectra() with the following function

.weightxy <- function(x, y, m = 0, n = 0.5) {
    x ^ m * y ^ n
    }
cosSim <- function(x, y, m = 0L, n = 0.5, na.rm = TRUE) {
 wx <- .weightxy(x[, 1L], x[, 2L], m, n)
 wy <- .weightxy(y[, 1L], y[, 2L], m, n)
sum(wx^2L*wy^2L, na.rm = na.rm)/(sqrt(sum(wx^4L, na.rm = na.rm))*sqrt(sum(wy^4L, na.rm = na.rm)))
}

compareSpectra(data[1], data[2], FUN=cosSim)

I read in your pull that cosine similarity measure corresponds to the standard normalized dotproduct and that you are applying the other definition (literature definition). So I have modified the code you are using to implement the cosine similarity measure. It seems that works (although I want to check it better),

jorainer · 2020-10-14T12:13:43Z

Great @jmbadia ! - feel free to make a pull request adding your function to MsCoreUtils ("distance.R" file) - maybe calling it cosine (or ncosine if it's using the weighting function @sgibb ?). If so, please include also unit tests comparing against results you would get from a reference implementation.

jmbadia · 2020-10-15T09:28:10Z

Ok @jorainer. Happy to help.

jmbadia · 2020-10-19T19:23:23Z

@jorainer @sgibb, let me share what I have (related also with this pull),

A) Stein and Scott dot product is actually the conventional dot product (cosine) squared. Stein and Scott 1994 based their dot product measure on a technical manual of a commercial software (The Finnigan Library Search Program, 1978). Finnigan manual used a variable called purity= 1000 x dotproduct² and I guess Stein and Scott took the formula developed without being aware of the square

B) Best spectra comparison algorithms in Stein and Scott 1994 are dotproduct algorithm (squared or not) with m>1 and n~0.5 as optimal exponents. The point is that intensities must be normalized with the mass range somehow (otherwise same n values give different relative weights) and I assume Stein et al did it as Finnegan's manual tells ("[...] since the library unknown always has a base peak intensity of 1000", i.e. base peak of every spectra = 1000). This normalization is not related with the so called optional global/local normalization, described in the manual as a useful tool to "corrects for the differences that may exist between two spectra of the same compound acquired in different ways" .

So, what now?. I guess ...?
A) there is no point in adding a new ncos() function (it is √(ndotproduct())
B) ndotproduct() should be modified
C) intensities mus be normalized (base peak = 1000) in order to fit the default m and n values

Sorry if I am too direct here but my wife is waiting me for dinner...

jorainer · 2020-10-20T09:12:44Z

Thanks for the comprehensive description @jmbadia !

A) there is no point in adding a new ncos() function (it is √(ndotproduct())

I agree. But then we should mention this in the documentation.

Regarding B) and C), whether or not intensity values of a spectrum are normalized to values relative to the base peak intensity (whether that is set to 100 or 1000) seems to have no influence on the ndotproduct result:

a <- cbind(
    mz = c(74.03, 120.01, 122.02, 151.98, 153.99, 177.99, 195.02, 241.03),
    intensity = c(15800, 110400, 58100, 117900, 11100, 7700, 15300, 64400))
a_100 <- a
a_100[, 2] <- a[, 2] / max(a[, 2]) * 100
a_1000 <- a
a_1000[, 2] <- a[, 2] / max(a[, 2]) * 1000

b_100 <- cbind(
    mz = c(74.03, 120.01, 122.02, 151.98, 153.88, 177.99, 195.02, 241.03),
    intensity = c(3.47, 9.63, 8.36, 51.12, 7.75, 3.24, 10.15, 100))
b_1000 <- b_100
b_1000[, 2] <- b_100[, 2] / max(b_100[, 2]) * 1000

library(MsCoreUtils)

## One normalized against one not normalized.
ndotproduct(x = a, y = b_100)
[1] 0.7838264

## Both normalized to 100
ndotproduct(x = a_100, y = b_100)
[1] 0.7838264

## Both normalized to 1000
ndotproduct(x = a_1000, y = b_1000)
[1] 0.7838264

## One normalized to 100, one to 1000
ndotproduct(x = a_100, y = b_1000)
[1] 0.7838264

## Repeat by changing m and n
ndotproduct(x = a, y = b_100, n = 0.9, m = 0.7)
[1] 0.5769457

## Both normalized to 100
ndotproduct(x = a_100, y = b_100, n = 0.9, m = 0.7)
[1] 0.5769457

## Both normalized to 1000
ndotproduct(x = a_1000, y = b_1000, n = 0.9, m = 0.7)
[1] 0.5769457

## One normalized to 100, one to 1000
ndotproduct(x = a_100, y = b_1000, n = 0.9, m = 0.7)
[1] 0.5769457

Seems that from a mathematical standpoint the similarity calculation is independent on the scale of the intensity values, so I think there is no need to modify ndotproduct (please correct me if I did something completely wrong in my comparison above).

jmbadia · 2020-10-20T17:48:39Z

Hi @jorainer .

Clearly I was wrong on C). No intensity normalization needed even considering the mass-weighting. Thanks for the detailed demo !
On B) I was trying to suggest to change the ndotproduct() function to fit the conventional definition of dotproduct, but your last comment suggests me the opposite; that you prefer not to modify ndotproduct() and that I should add the proper comment in the documentation. Am I right? do i proceed?

jorainer · 2020-10-21T06:17:57Z

Clearly...

I have to thank you for rising this potential problem. This made me investigate and now I'm confident that it is working. Things like that increase the trust in the package's functionality!

On B)

We used the definition from Stein and cite their paper - thus I would like to keep the dotproduct function as described there. Yes, please add a comment in the documentation. Thanks!

sgibb added enhancement New feature or request help wanted Extra attention is needed labels Dec 11, 2019

sgibb mentioned this issue Dec 11, 2019

Add some popular distance metrices #33

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement additional distance metrices #30

Implement additional distance metrices #30

sgibb commented Dec 11, 2019 •

edited

Loading

tobiasko commented Dec 11, 2019

michaelwitting commented Apr 25, 2020

jorainer commented Apr 27, 2020

michaelwitting commented Apr 27, 2020

jmbadia commented Oct 13, 2020 •

edited

Loading

jorainer commented Oct 14, 2020

jmbadia commented Oct 15, 2020

jmbadia commented Oct 19, 2020

jorainer commented Oct 20, 2020

jmbadia commented Oct 20, 2020

jorainer commented Oct 21, 2020

Implement additional distance metrices #30

Implement additional distance metrices #30

Comments

sgibb commented Dec 11, 2019 • edited Loading

tobiasko commented Dec 11, 2019

michaelwitting commented Apr 25, 2020

jorainer commented Apr 27, 2020

michaelwitting commented Apr 27, 2020

jmbadia commented Oct 13, 2020 • edited Loading

jorainer commented Oct 14, 2020

jmbadia commented Oct 15, 2020

jmbadia commented Oct 19, 2020

jorainer commented Oct 20, 2020

jmbadia commented Oct 20, 2020

jorainer commented Oct 21, 2020

sgibb commented Dec 11, 2019 •

edited

Loading

jmbadia commented Oct 13, 2020 •

edited

Loading