--- license: afl-3.0 language: - es tags: - biomedical - social media - ner metrics: - f1 widget: - text: "La semana que viene estaremos en el I Congreso para personas con cáncer y familiares ☺ #aecc #Congreso #finde " example_title: "Oncology" - text: "No dejéis de leer esta interesantísima entrada del Dr. Martínez-Lage donde reivindica los errores médicos a la hora de diagnosticar #Alzheimer u otros tipos de #demencias." example_title: "Alzheimer" - text: "Cada vez hay más CCAA que se suman la regulación de #desfibriladores (#DESA) en espacios deportivos, lamentamos este caso de parada cardíaca que afectó de nuevo a un deportista." example_title: "cardiac arrest" - text: "La jaqueca o la migraña puede llegar a ser muy desesperante, algunas veces los remedios para dolor de cabeza de origen farmacéutico son ineficientes y por más analgésicos que tomemos el malestar no cede." example_title: "Migraine" - text: "Os sorprenderíais la de mensajes que me llegan cada día (sobre todo cuando se acerca el verano) preguntándome como eliminar la celulitis, como hacer que desaparezca mágicamente la grasita… " example_title: "Celulitis" --- # Disease mention recognizer for Spanish Social Media texts 🦠💬 This resource derives from the participation of the SINAI team in [Mining Social Media Content for Disease Mention (SocialDisNER)](https://temu.bsc.es/socialdisner/) shared task. This task focused on the recognition of disease mentions in tweets written in Spanish with the aim of using Twitter as a proxy to better understand societal perception of disease. This task brought the community effort to developing named entity recognition (NER) approaches to detect **all kinds** of disease mentions in social media text. Our approach is based on a [model pre-trained on general-domain text](https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne). In order to leverage large scale additional [Silver Standard data](https://zenodo.org/record/6803567/preview/SocialDisNER_LargeScale_additionaldata.zip#tree_item0) with automatically generated labels provided by task’s organisers we designed a two-stage fine-tuning framework. # Results The model contained in this repository constitutes the fundament of the NER system presented by the SINAI team on SocialDisNER. Enhanced with data [`pysentimiento`](https://github.com/pysentimiento/pysentimiento) pre-processing and rule-based submission post-processing, it obtained encouraging results during the official evaluation, which are summarised in the table below. | Precision | Recall | F1-score | |-----------|--------|----------| | 0.756 |0. 795 | 0.770 | # System description paper and citation [`The system description paper`](https://aclanthology.org/2022.smm4h-1.8/) was be published at Social Media Mining for Health Application (#SMM4H) held on COLING22 in October 2022. ``` @inproceedings{chizhikova-etal-2022-sinai, title = "{SINAI}@{SMM}4{H}{'}22: Transformers for biomedical social media text mining in {S}panish", author = "Chizhikova, Mariia and L{\'o}pez-{\'U}beda, Pilar and D{\'\i}az-Galiano, Manuel C. and Ure{\~n}a-L{\'o}pez, L. Alfonso and Mart{\'\i}n-Valdivia, M. Teresa", booktitle = "Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop {\&} Shared Task", month = oct, year = "2022", address = "Gyeongju, Republic of Korea", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.smm4h-1.8", pages = "27--30", abstract = "This paper covers participation of the SINAI team in Tasks 5 and 10 of the Social Media Mining for Health ({\#}SSM4H) workshop at COLING-2022. These tasks focus on leveraging Twitter posts written in Spanish for healthcare research. The objective of Task 5 was to classify tweets reporting COVID-19 symptoms, while Task 10 required identifying disease mentions in Twitter posts. The presented systems explore large RoBERTa language models pre-trained on Twitter data in the case of tweet classification task and general-domain data for the disease recognition task. We also present a text pre-processing methodology implemented in both systems and describe an initial weakly-supervised fine-tuning phase alongside with a submission post-processing procedure designed for Task 10. The systems obtained 0.84 F1-score on the Task 5 and 0.77 F1-score on Task 10.", } ```