File size: 1,567 Bytes
030baee 1b89a03 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
---
license: gpl-3.0
---
Pre-trained word embeddings using the text of published scientific manuscripts. These embeddings use 300 dimensions and were trained using the fasttext algorithm on all available manuscripts found in the [PMC Open Access Subset](https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/). See the paper here: https://pubmed.ncbi.nlm.nih.gov/34920127/
Citation:
```
@article{flamholz2022word,
title={Word embeddings trained on published case reports are lightweight, effective for clinical tasks, and free of protected health information},
author={Flamholz, Zachary N and Crane-Droesch, Andrew and Ungar, Lyle H and Weissman, Gary E},
journal={Journal of Biomedical Informatics},
volume={125},
pages={103971},
year={2022},
publisher={Elsevier}
}
```
## Quick start
Word embeddings are compatible with the [`gensim` Python package](https://radimrehurek.com/gensim/) format.
First download the files from this archive. Then load the embeddings into Python.
```python
from gensim.models import FastText, Word2Vec, KeyedVectors # KeyedVectors are used to load the GloVe models
# Load the model
model = FastText.load('ft_oa_all_300d.bin')
# Return 100-dimensional vector representations of each word
model.wv.word_vec('diabetes')
model.wv.word_vec('cardiac_arrest')
model.wv.word_vec('lymphangioleiomyomatosis')
# Try out cosine similarity
model.wv.similarity('copd', 'chronic_obstructive_pulmonary_disease')
model.wv.similarity('myocardial_infarction', 'heart_attack')
model.wv.similarity('lymphangioleiomyomatosis', 'lam')
``` |