metadata

license: afl-3.0
language:
  - es
tags:
  - biomedical
  - social media
  - ner
metrics:
  - f1
widget:
  - text: >-
      La semana que viene estaremos en el I Congreso para personas con cáncer y
      familiares ☺ #aecc #Congreso #finde 
    example_title: Oncology
  - text: >-
      No dejéis de leer esta interesantísima entrada del Dr. Martínez-Lage donde
      reivindica los errores médicos a la hora de diagnosticar #Alzheimer u
      otros tipos de #demencias.
    example_title: Alzheimer
  - text: >-
      Cada vez hay más CCAA que se suman la regulación de #desfibriladores
      (#DESA) en espacios deportivos, lamentamos este caso de parada cardíaca
      que afectó de nuevo a un deportista.
    example_title: cardiac arrest
  - text: >-
      La jaqueca o la migraña puede llegar a ser muy desesperante, algunas veces
      los remedios para dolor de cabeza de origen farmacéutico son ineficientes
      y por más analgésicos que tomemos el malestar no cede.
    example_title: Migraine
  - text: >-
      Os sorprenderíais la de mensajes que me llegan cada día (sobre todo cuando
      se acerca el verano) preguntándome como eliminar la celulitis, como hacer
      que desaparezca mágicamente la grasita… 
    example_title: Celulitis

Disease mention recognizer for Spanish Social Media texts 🦠💬

This resource derives from the participation of the SINAI team in Mining Social Media Content for Disease Mention (SocialDisNER) shared task. This task focused on the recognition of disease mentions in tweets written in Spanish with the aim of using Twitter as a proxy to better understand societal perception of disease. This task brought the community effort to developing named entity recognition (NER) approaches to detect all kinds of disease mentions in social media text.

Our approach is based on a model pre-trained on general-domain text. In order to leverage large scale additional Silver Standard data with automatically generated labels provided by task’s organisers we designed a two-stage fine-tuning framework.

Results

The model contained in this repository constitutes the fundament of the NER system presented by the SINAI team on SocialDisNER. Enhanced with data pysentimiento pre-processing and rule-based submission post-processing, it obtained encouraging results during the official evaluation, which are summarised in the table below.

Precision	Recall	F1-score
0.756	0. 795	0.770

System description paper and citation

The system description paper was be published at Social Media Mining for Health Application (#SMM4H) held on COLING22 in October 2022.

@inproceedings{chizhikova-etal-2022-sinai,
    title = "{SINAI}@{SMM}4{H}{'}22: Transformers for biomedical social media text mining in {S}panish",
    author = "Chizhikova, Mariia  and
      L{\'o}pez-{\'U}beda, Pilar  and
      D{\'\i}az-Galiano, Manuel C.  and
      Ure{\~n}a-L{\'o}pez, L. Alfonso  and
      Mart{\'\i}n-Valdivia, M. Teresa",
    booktitle = "Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop {\&} Shared Task",
    month = oct,
    year = "2022",
    address = "Gyeongju, Republic of Korea",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.smm4h-1.8",
    pages = "27--30",
    abstract = "This paper covers participation of the SINAI team in Tasks 5 and 10 of the Social Media Mining for Health ({\#}SSM4H) workshop at COLING-2022. These tasks focus on leveraging Twitter posts written in Spanish for healthcare research. The objective of Task 5 was to classify tweets reporting COVID-19 symptoms, while Task 10 required identifying disease mentions in Twitter posts. The presented systems explore large RoBERTa language models pre-trained on Twitter data in the case of tweet classification task and general-domain data for the disease recognition task. We also present a text pre-processing methodology implemented in both systems and describe an initial weakly-supervised fine-tuning phase alongside with a submission post-processing procedure designed for Task 10. The systems obtained 0.84 F1-score on the Task 5 and 0.77 F1-score on Task 10.",
}