Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lookup dictionary for pretrained embedding #3

Open
victorconan opened this issue Oct 9, 2020 · 4 comments
Open

Lookup dictionary for pretrained embedding #3

victorconan opened this issue Oct 9, 2020 · 4 comments

Comments

@victorconan
Copy link

Hi Andrew,

Do you have a lookup dictionary for the pretrained embeddings? I saw in the embedding file, the "medical concepts" are in format of "CXXXX", not sure if they are ICD codes, procedure codes or something else.

Thanks!

@reality
Copy link

reality commented Nov 5, 2020

Hello Victor,

I have been looking into this work recently, I think that CUI mapping files / scripts to convert can be found in the repository for embeddings: https://github.com/clinicalml/embeddings/tree/master/eval

Cheers

@kaushikacharya
Copy link

the "medical concepts" are in format of "CXXXX", not sure if they are ICD codes, procedure codes or something else

These are UMLS concept unique identifier(CUI)

Examples from https://arxiv.org/pdf/1804.01486.pdf

Primary condition: premature infant (CUI: C0021294) Comorbidity:
bronchopulmonary dysplasia (CUI: C0006287)

UMLS CUIs can be browsed on https://uts.nlm.nih.gov/metathesaurus.html
(N.B. You would need to register yourself first).

@KrishnaPG
Copy link

Came across this post while looking for information on the meaning of the columns in the cui2vec_pretrained.csv file. The columns are named v1, v2 ... v500. Where can we get information on what do these 500 columns stand for?

If we were to load this csv file into a database, what kind of schema should we create? (Or does it even make sense to load this into a database in the first place?) I have read the https://arxiv.org/pdf/1804.01486.pdf multiple times but could not get any information on the structure of this pretrained csv file. Any help is greatly appreciated.

@kaushikacharya
Copy link

kaushikacharya commented Dec 16, 2020

The columns are named v1, v2 ... v500. Where can we get information on what do these 500 columns stand for?

v1,...,v500 are the 500 dimensional vector embedding for the CUIs.

Quoting the paper from Section 4.1:

The 500-dimensional word2vec style embeddings using the combined data are referred to
as the cui2vec embeddings in all subsequent experiments.

Loading cui2vec:
You can use gensim as explained in piskvorky/gensim-data#25 (comment)

As a pre-requisite, you should read about word embeddings e.g. word2vec.
That will help you to understand vector embedding of text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants