Pre-trained word embeddings using the text of published biomedical manuscripts. These embeddings use 300 dimensions and were trained using the word2vec algorithm on all available manuscripts found in the PMC Open Access Subset. See the paper here: https://pubmed.ncbi.nlm.nih.gov/34920127/

Citation:

@article{flamholz2022word,
  title={Word embeddings trained on published case reports are lightweight, effective for clinical tasks, and free of protected health information},
  author={Flamholz, Zachary N and Crane-Droesch, Andrew and Ungar, Lyle H and Weissman, Gary E},
  journal={Journal of Biomedical Informatics},
  volume={125},
  pages={103971},
  year={2022},
  publisher={Elsevier}
}

Quick start

Word embeddings are compatible with the gensim Python package format.

First download the files from this archive. Then load the embeddings into Python.


from gensim.models import FastText, Word2Vec, KeyedVectors # KeyedVectors are used to load the GloVe models

# Load the model
model = Word2Vec.load('w2v_oa_all_300d.bin')

# Return 100-dimensional vector representations of each word
model.wv.word_vec('diabetes')
model.wv.word_vec('cardiac_arrest')
model.wv.word_vec('lymphangioleiomyomatosis')

# Try out cosine similarity
model.wv.similarity('copd', 'chronic_obstructive_pulmonary_disease')
model.wv.similarity('myocardial_infarction', 'heart_attack')
model.wv.similarity('lymphangioleiomyomatosis', 'lam')
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model's library. Check the docs .