PubMedBERT-base-uncased-sts-combined

This repo contains a fine-tuned version of PubMedBERT to generate semantic textual similarity pairs, primarily for use in the sts-select feature selection package detailed here. Details about the model and vocabulary can be in the paper here.

Citation

If you use this model for STS-based feature selection, please cite the following paper:

@misc{warner2023utilizing,
      title={Utilizing Semantic Textual Similarity for Clinical Survey Data Feature Selection}, 
      author={Benjamin C. Warner and Ziqi Xu and Simon Haroutounian and Thomas Kannampallil and Chenyang Lu},
      year={2023},
      eprint={2308.09892},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Additionally, the original model and fine-tuning papers should be cited as follows:

@article{Gu_Tinn_Cheng_Lucas_Usuyama_Liu_Naumann_Gao_Poon_2021, title={Domain-specific language model pretraining for biomedical natural language processing}, volume={3}, number={1}, journal={ACM Transactions on Computing for Healthcare (HEALTH)}, publisher={ACM New York, NY}, author={Gu, Yu and Tinn, Robert and Cheng, Hao and Lucas, Michael and Usuyama, Naoto and Liu, Xiaodong and Naumann, Tristan and Gao, Jianfeng and Poon, Hoifung}, year={2021}, pages={1–23} }

@inproceedings{Cer_Diab_Agirre_Lopez-Gazpio_Specia_2017, address={Vancouver, Canada}, title={SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation}, url={https://aclanthology.org/S17-2001}, DOI={10.18653/v1/S17-2001}, booktitle={Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)}, publisher={Association for Computational Linguistics}, author={Cer, Daniel and Diab, Mona and Agirre, Eneko and Lopez-Gazpio, Iñigo and Specia, Lucia}, year={2017}, month=aug, pages={1–14} }
@article{Chiu_Pyysalo_Vulić_Korhonen_2018, title={Bio-SimVerb and Bio-SimLex: wide-coverage evaluation sets of word similarity in biomedicine}, volume={19}, number={1}, journal={BMC bioinformatics}, publisher={BioMed Central}, author={Chiu, Billy and Pyysalo, Sampo and Vulić, Ivan and Korhonen, Anna}, year={2018}, pages={1–13} }
@inproceedings{May_2021, title={Machine translated multilingual STS benchmark dataset.}, url={https://github.com/PhilipMay/stsb-multi-mt}, author={May, Philip}, year={2021} }
@article{Pedersen_Pakhomov_Patwardhan_Chute_2007, title={Measures of semantic similarity and relatedness in the biomedical domain}, volume={40}, number={3}, journal={Journal of biomedical informatics}, publisher={Elsevier}, author={Pedersen, Ted and Pakhomov, Serguei VS and Patwardhan, Siddharth and Chute, Christopher G}, year={2007}, pages={288–299} }