Add cui2vec embeddings #25

souravsingh · 2018-04-06T17:31:05Z

The embeddings for over 100k medical concepts using data from 60 million patients, 1.7 million journal articles and 20 million notes is up, available here- https://figshare.com/s/00d69861786cd0156d81

Explorer available here- http://ec2-52-14-191-192.us-east-2.compute.amazonaws.com:1234/

piskvorky · 2018-04-07T05:51:53Z

Nice find!

menshikh-iv · 2018-04-07T05:55:24Z

Additional information:

license: CC BY 4.0
paper: https://arxiv.org/abs/1804.01486

beamandrew · 2018-04-16T15:22:58Z

Hey this is my paper, how cool! I'd be happy to contribute these, let me know if they need any clean up first.

menshikh-iv · 2018-04-16T16:01:16Z

Oh, hi @beamandrew, glad to see you here! Please follow the instruction https://github.com/RaRe-Technologies/gensim-data#want-to-add-a-new-corpus-or-model

beamandrew · 2018-04-16T16:39:53Z

Will do! It might be a couple weeks until I can get it together. I'm teaching a deep learning class right now that won't end until May which keeps me pretty busy.

I'm actually having them use the embeddings from this repo in class to build an RNN (which is how I ended up finding this issue).

You can check it out here if you're interested:
https://colab.research.google.com/drive/1JsdhsiJQP5JPEEGWWFtOMpQajBj4w1KA

menshikh-iv · 2018-04-17T04:36:59Z

@beamandrew can you give read access for [email protected] please (I can't open your link, lack of permissions)?

beamandrew · 2018-04-17T11:24:17Z

Oops, try this link which should let you view: https://drive.google.com/file/d/1WuoHWf1KyFsNiilbVa7qnKkSDALfch01/view?usp=sharing

matanox · 2018-05-22T09:23:10Z

Last I checked the actual concept names aren't include in this dataset and/or under the same license, but they are available from a different source which looks legitimately released. I have, in fact, a task to correlate them. Without this correlation, the embeddings discussed here include arbitrary codes instead of the original (concept) words that you see in the online demo.

hscells · 2018-05-23T06:51:50Z

I currently have some data that will allow for this mapping as @matanster describes from the author of this publication (Section 2).

If anyone is interested I can upload a link to this as I sit next to the author and he has given his permission @jimmyoentung.

piskvorky · 2018-05-23T08:58:03Z

Thanks guys.

What we want is for users who download this dataset to be able to use it easily.

If the dataset requires users to jump through hoops, it's not a good fit for gensim-data. The experience of applying / using a dataset has to be streamlined and intuitive, including access and code (not just data). That is why we created this repo, and it's a mandatory part of each new contribution.

@hscells and @matanster what does this extra step mean for users? Can we somehow integrate it directly, so it's transparent to people who want to use cui2vec? Is it necessary?

hscells · 2018-05-23T22:38:01Z

The CUI in cui2vec stands for Concept Unique Identifier. A CUI is an identifier for all of the types of synonyms for a particular medical string.

The dataset which I described in my comment is a mapping of CUI to the most commonly used string in the UMLS meta-thesaurus. One may simply replace the CUIs in the pre-trained vector file with terms from this mapping file (although I believe not all CUIs are mapped because the semantic types of the strings were filtered in this particular dataset).

One may use QuickUMLS or MetaMap to map a term to a CUI, then using the method described above map the CUI to the most commonly used term in UMLS or MetaMap.

I'm not exactly sure how the demo in the OP is mapping CUIs to strings, but I believe this is most likely how it would be done. In terms of how it could be integrated @piskvorky, the original data could be modified or this mapping could be performed in a separate step, however like I said, due to the relationship between CUI and the strings associated with that concept (one-to-many) this mapping would preferably be performed as two separate steps.

piskvorky · 2018-05-24T07:40:39Z

No problem, as long as the process is clearly described to users, and the dataset ready-to-use out of the box.

juancq · 2018-08-06T01:53:23Z

Just curious, any progress on this issue?

andresrosso · 2018-12-05T00:54:42Z

Hi, any body knows if the dataset 'cui2vec' is available??
@souravsingh share the vector in csv, but i don know how to load that in gensim and start using.
Can anyone help me or tell em when the dataset would be ready.

andresrosso · 2018-12-05T00:56:20Z

The embeddings for over 100k medical concepts using data from 60 million patients, 1.7 million journal articles and 20 million notes is up, available here- https://figshare.com/s/00d69861786cd0156d81

Explorer available here- http://ec2-52-14-191-192.us-east-2.compute.amazonaws.com:1234/

@souravsingh can i load the CSV in gensim?

Can you tell me how to do that.

beamandrew · 2018-12-05T01:00:07Z

Hi everyone,

I am lead author on this paper. Apologies for the radio silence on this request. We are currently working on a revision to the paper/approach that we hope to release this month. I will check back in and try to make it gensim compatible at that time.

menshikh-iv · 2018-12-14T11:08:41Z

@juancq @andresrosso sorry for waiting, I can't say when this will be added
BTW you always can load that manually (without api.load, just read the file from disk or s3).

menshikh-iv · 2018-12-14T11:10:43Z

@beamandrew great, thanks!

prabhatM · 2019-01-20T07:37:47Z

Is there any model using snowmed CT data?

Dhanachandra · 2019-03-18T08:15:21Z

Hi everyone,

I am lead author on this paper. Apologies for the radio silence on this request. We are currently working on a revision to the paper/approach that we hope to release this month. I will check back in and try to make it gensim compatible at that time.

Please share the source code for the evaluation metrics used in this work. I would like to evaluate my own embedding trained on EHRs. Thanks in advanced.

kaushikacharya · 2019-09-25T14:11:06Z

Hi, any body knows if the dataset 'cui2vec' is available??
@souravsingh share the vector in csv, but i don know how to load that in gensim and start using.
Can anyone help me or tell em when the dataset would be ready.

@andresrosso
Here are the steps for loading cui2vec in gensim:

Download the pre-trained embeddings from the download url mentioned in http://cui2vec.dbmi.hms.harvard.edu/
Dump the embeddings into a text file in word2vec format in these two steps:

Load the csv into pandas dataframe.

import pandas as pd
import numpy as np

with open('cui2vec_pretrained.csv') as fd:
      cui2vec_df = pd.read_csv(fd, index_col=0)

Dump the embeddings(loaded in dataframe) into a text file.

 np.savetxt('cui2vec_pretrained.txt', cui2vec_df.reset_index().values, delimiter=" ", header="{} {}".format(len(cui2vec_df), len(cui2vec_df.columns)), comments="", fmt=["%s"] + ["%.18e"]*len(cui2vec_df.columns))

Load the word vectors using gensim.models.keyedvectors.KeyedVectors.

from gensim.models.keyedvectors import KeyedVectors

word_vectors = KeyedVectors.load_word2vec_format('cui2vec_pretrained.txt', binary=False)

# An example
word_vectors.most_similar('C0034079')

Source: https://stackoverflow.com/questions/46297740/how-to-turn-embeddings-loaded-in-a-pandas-dataframe-into-a-gensim-model (Ken Syme's answer)

andresrosso · 2019-09-25T15:32:37Z

Great work, thanks a lot.

kaushikacharya mentioned this issue Dec 16, 2020

Lookup dictionary for pretrained embedding beamandrew/cui2vec#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cui2vec embeddings #25

Add cui2vec embeddings #25

souravsingh commented Apr 6, 2018

piskvorky commented Apr 7, 2018

menshikh-iv commented Apr 7, 2018

beamandrew commented Apr 16, 2018

menshikh-iv commented Apr 16, 2018

beamandrew commented Apr 16, 2018

menshikh-iv commented Apr 17, 2018

beamandrew commented Apr 17, 2018

matanox commented May 22, 2018 •

edited

Loading

hscells commented May 23, 2018

piskvorky commented May 23, 2018 •

edited

Loading

hscells commented May 23, 2018

piskvorky commented May 24, 2018

juancq commented Aug 6, 2018

andresrosso commented Dec 5, 2018

andresrosso commented Dec 5, 2018

beamandrew commented Dec 5, 2018

menshikh-iv commented Dec 14, 2018 •

edited

Loading

menshikh-iv commented Dec 14, 2018

prabhatM commented Jan 20, 2019

Dhanachandra commented Mar 18, 2019 •

edited

Loading

kaushikacharya commented Sep 25, 2019 •

edited

Loading

andresrosso commented Sep 25, 2019

Add cui2vec embeddings #25

Add cui2vec embeddings #25

Comments

souravsingh commented Apr 6, 2018

piskvorky commented Apr 7, 2018

menshikh-iv commented Apr 7, 2018

beamandrew commented Apr 16, 2018

menshikh-iv commented Apr 16, 2018

beamandrew commented Apr 16, 2018

menshikh-iv commented Apr 17, 2018

beamandrew commented Apr 17, 2018

matanox commented May 22, 2018 • edited Loading

hscells commented May 23, 2018

piskvorky commented May 23, 2018 • edited Loading

hscells commented May 23, 2018

piskvorky commented May 24, 2018

juancq commented Aug 6, 2018

andresrosso commented Dec 5, 2018

andresrosso commented Dec 5, 2018

beamandrew commented Dec 5, 2018

menshikh-iv commented Dec 14, 2018 • edited Loading

menshikh-iv commented Dec 14, 2018

prabhatM commented Jan 20, 2019

Dhanachandra commented Mar 18, 2019 • edited Loading

kaushikacharya commented Sep 25, 2019 • edited Loading

andresrosso commented Sep 25, 2019

matanox commented May 22, 2018 •

edited

Loading

piskvorky commented May 23, 2018 •

edited

Loading

menshikh-iv commented Dec 14, 2018 •

edited

Loading

Dhanachandra commented Mar 18, 2019 •

edited

Loading

kaushikacharya commented Sep 25, 2019 •

edited

Loading