Update README.md

e9ee06a about 2 years ago

4.47 kB

	---
	license: afl-3.0
	language:
	- es
	tags:
	- biomedical
	- social media
	- ner
	metrics:
	- f1
	widget:
	- text: "La semana que viene estaremos en el I Congreso para personas con cáncer y familiares ☺ #aecc #Congreso #finde "
	example_title: "Oncology"
	- text: "No dejéis de leer esta interesantísima entrada del Dr. Martínez-Lage donde reivindica los errores médicos a la hora de diagnosticar #Alzheimer u otros tipos de #demencias."
	example_title: "Alzheimer"
	- text: "Cada vez hay más CCAA que se suman la regulación de #desfibriladores (#DESA) en espacios deportivos, lamentamos este caso de parada cardíaca que afectó de nuevo a un deportista."
	example_title: "cardiac arrest"
	- text: "La jaqueca o la migraña puede llegar a ser muy desesperante, algunas veces los remedios para dolor de cabeza de origen farmacéutico son ineficientes y por más analgésicos que tomemos el malestar no cede."
	example_title: "Migraine"
	- text: "Os sorprenderíais la de mensajes que me llegan cada día (sobre todo cuando se acerca el verano) preguntándome como eliminar la celulitis, como hacer que desaparezca mágicamente la grasita… "
	example_title: "Celulitis"
	---

	# Disease mention recognizer for Spanish Social Media texts 🦠💬
	This resource derives from the participation of the SINAI team in [Mining Social Media Content for Disease Mention (SocialDisNER)](https://temu.bsc.es/socialdisner/) shared task. This task focused on the recognition of disease mentions in tweets written in Spanish with the aim of using Twitter as a proxy to better understand societal perception of disease. This task brought the community effort to developing named entity recognition (NER) approaches to detect all kinds of disease mentions in social media text.

	Our approach is based on a [model pre-trained on general-domain text](https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne). In order to leverage large scale additional [Silver Standard data](https://zenodo.org/record/6803567/preview/SocialDisNER_LargeScale_additionaldata.zip#tree_item0) with automatically generated labels provided by task’s organisers we designed a two-stage fine-tuning framework.

	# Results
	The model contained in this repository constitutes the fundament of the NER system presented by the SINAI team on SocialDisNER. Enhanced with data [`pysentimiento`](https://github.com/pysentimiento/pysentimiento) pre-processing and rule-based submission post-processing, it obtained encouraging results during the official evaluation, which are summarised in the table below.

	\| Precision \| Recall \| F1-score \|
	\|-----------\|--------\|----------\|
	\| 0.756 \|0. 795 \| 0.770 \|


	# System description paper and citation
	[`The system description paper`](https://aclanthology.org/2022.smm4h-1.8/) was be published at Social Media Mining for Health Application (#SMM4H) held on COLING22 in October 2022.

	```
	@inproceedings{chizhikova-etal-2022-sinai,
	title = "{SINAI}@{SMM}4{H}{'}22: Transformers for biomedical social media text mining in {S}panish",
	author = "Chizhikova, Mariia and
	L{\'o}pez-{\'U}beda, Pilar and
	D{\'\i}az-Galiano, Manuel C. and
	Ure{\~n}a-L{\'o}pez, L. Alfonso and
	Mart{\'\i}n-Valdivia, M. Teresa",
	booktitle = "Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop {\&} Shared Task",
	month = oct,
	year = "2022",
	address = "Gyeongju, Republic of Korea",
	publisher = "Association for Computational Linguistics",
	url = "https://aclanthology.org/2022.smm4h-1.8",
	pages = "27--30",
	abstract = "This paper covers participation of the SINAI team in Tasks 5 and 10 of the Social Media Mining for Health ({\#}SSM4H) workshop at COLING-2022. These tasks focus on leveraging Twitter posts written in Spanish for healthcare research. The objective of Task 5 was to classify tweets reporting COVID-19 symptoms, while Task 10 required identifying disease mentions in Twitter posts. The presented systems explore large RoBERTa language models pre-trained on Twitter data in the case of tweet classification task and general-domain data for the disease recognition task. We also present a text pre-processing methodology implemented in both systems and describe an initial weakly-supervised fine-tuning phase alongside with a submission post-processing procedure designed for Task 10. The systems obtained 0.84 F1-score on the Task 5 and 0.77 F1-score on Task 10.",
	}
	```