Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement additional distance metrices #30

Open
3 tasks done
sgibb opened this issue Dec 11, 2019 · 11 comments
Open
3 tasks done

Implement additional distance metrices #30

sgibb opened this issue Dec 11, 2019 · 11 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@sgibb
Copy link
Member

sgibb commented Dec 11, 2019

See also #29

  • Euclidean distance
  • Absolute value distance
  • Normalized spectral angle

See: https://doi.org/10.1016/1044-0305(94)87009-8 for Euclidean/Absolute value distance

@tnaake , @tobiasko any other needed?

@sgibb sgibb added enhancement New feature or request help wanted Extra attention is needed labels Dec 11, 2019
@tobiasko
Copy link

Normalized spectral angle

Maybe we should ask somebody from ProteomeTools what is used in practice?

@michaelwitting
Copy link

What about having something similar to the functions used by GNPS?
https://ccms-ucsd.github.io/GNPSDocumentation/massspecbackground/networkingtheory/
Could be nice for the metabolomics community.

@jorainer
Copy link
Member

Nice idea @michaelwitting. Is there a publication, source code, reference implementation available?

@michaelwitting
Copy link

I'm not aware of any implementation in R yet.

@jmbadia
Copy link
Contributor

jmbadia commented Oct 13, 2020

We use cosine similarity as a measure to compare spectra in LC-MS/MS. Usually I apply the cosine() function from lsa package but now I have decided to use compareSpectra() with the following function

.weightxy <- function(x, y, m = 0, n = 0.5) {
    x ^ m * y ^ n
    }
cosSim <- function(x, y, m = 0L, n = 0.5, na.rm = TRUE) {
 wx <- .weightxy(x[, 1L], x[, 2L], m, n)
 wy <- .weightxy(y[, 1L], y[, 2L], m, n)
sum(wx^2L*wy^2L, na.rm = na.rm)/(sqrt(sum(wx^4L, na.rm = na.rm))*sqrt(sum(wy^4L, na.rm = na.rm)))
}

compareSpectra(data[1], data[2], FUN=cosSim)

I read in your pull that cosine similarity measure corresponds to the standard normalized dotproduct and that you are applying the other definition (literature definition). So I have modified the code you are using to implement the cosine similarity measure. It seems that works (although I want to check it better),

@jorainer
Copy link
Member

Great @jmbadia ! - feel free to make a pull request adding your function to MsCoreUtils ("distance.R" file) - maybe calling it cosine (or ncosine if it's using the weighting function @sgibb ?). If so, please include also unit tests comparing against results you would get from a reference implementation.

@jmbadia
Copy link
Contributor

jmbadia commented Oct 15, 2020

Ok @jorainer. Happy to help.

@jmbadia
Copy link
Contributor

jmbadia commented Oct 19, 2020

@jorainer @sgibb, let me share what I have (related also with this pull),

A) Stein and Scott dot product is actually the conventional dot product (cosine) squared. Stein and Scott 1994 based their dot product measure on a technical manual of a commercial software (The Finnigan Library Search Program, 1978). Finnigan manual used a variable called purity= 1000 x dotproduct² and I guess Stein and Scott took the formula developed without being aware of the square
a

B) Best spectra comparison algorithms in Stein and Scott 1994 are dotproduct algorithm (squared or not) with m>1 and n~0.5 as optimal exponents. The point is that intensities must be normalized with the mass range somehow (otherwise same n values give different relative weights) and I assume Stein et al did it as Finnegan's manual tells ("[...] since the library unknown always has a base peak intensity of 1000", i.e. base peak of every spectra = 1000). This normalization is not related with the so called optional global/local normalization, described in the manual as a useful tool to "corrects for the differences that may exist between two spectra of the same compound acquired in different ways" .

So, what now?. I guess ...?
A) there is no point in adding a new ncos() function (it is √(ndotproduct())
B) ndotproduct() should be modified
C) intensities mus be normalized (base peak = 1000) in order to fit the default m and n values

Sorry if I am too direct here but my wife is waiting me for dinner...

@jorainer
Copy link
Member

Thanks for the comprehensive description @jmbadia !

A) there is no point in adding a new ncos() function (it is √(ndotproduct())

I agree. But then we should mention this in the documentation.

Regarding B) and C), whether or not intensity values of a spectrum are normalized to values relative to the base peak intensity (whether that is set to 100 or 1000) seems to have no influence on the ndotproduct result:

a <- cbind(
    mz = c(74.03, 120.01, 122.02, 151.98, 153.99, 177.99, 195.02, 241.03),
    intensity = c(15800, 110400, 58100, 117900, 11100, 7700, 15300, 64400))
a_100 <- a
a_100[, 2] <- a[, 2] / max(a[, 2]) * 100
a_1000 <- a
a_1000[, 2] <- a[, 2] / max(a[, 2]) * 1000

b_100 <- cbind(
    mz = c(74.03, 120.01, 122.02, 151.98, 153.88, 177.99, 195.02, 241.03),
    intensity = c(3.47, 9.63, 8.36, 51.12, 7.75, 3.24, 10.15, 100))
b_1000 <- b_100
b_1000[, 2] <- b_100[, 2] / max(b_100[, 2]) * 1000

library(MsCoreUtils)

## One normalized against one not normalized.
ndotproduct(x = a, y = b_100)
[1] 0.7838264

## Both normalized to 100
ndotproduct(x = a_100, y = b_100)
[1] 0.7838264

## Both normalized to 1000
ndotproduct(x = a_1000, y = b_1000)
[1] 0.7838264

## One normalized to 100, one to 1000
ndotproduct(x = a_100, y = b_1000)
[1] 0.7838264

## Repeat by changing m and n
ndotproduct(x = a, y = b_100, n = 0.9, m = 0.7)
[1] 0.5769457

## Both normalized to 100
ndotproduct(x = a_100, y = b_100, n = 0.9, m = 0.7)
[1] 0.5769457

## Both normalized to 1000
ndotproduct(x = a_1000, y = b_1000, n = 0.9, m = 0.7)
[1] 0.5769457

## One normalized to 100, one to 1000
ndotproduct(x = a_100, y = b_1000, n = 0.9, m = 0.7)
[1] 0.5769457

Seems that from a mathematical standpoint the similarity calculation is independent on the scale of the intensity values, so I think there is no need to modify ndotproduct (please correct me if I did something completely wrong in my comparison above).

@jmbadia
Copy link
Contributor

jmbadia commented Oct 20, 2020

Hi @jorainer .

  • Clearly I was wrong on C). No intensity normalization needed even considering the mass-weighting. Thanks for the detailed demo !
  • On B) I was trying to suggest to change the ndotproduct() function to fit the conventional definition of dotproduct, but your last comment suggests me the opposite; that you prefer not to modify ndotproduct() and that I should add the proper comment in the documentation. Am I right? do i proceed?

@jorainer
Copy link
Member

Clearly...

I have to thank you for rising this potential problem. This made me investigate and now I'm confident that it is working. Things like that increase the trust in the package's functionality!

On B)

We used the definition from Stein and cite their paper - thus I would like to keep the dotproduct function as described there. Yes, please add a comment in the documentation. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

5 participants