ajtamayoh commited on
Commit
a525e42
1 Parent(s): 66268ff

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -45
README.md CHANGED
@@ -32,51 +32,7 @@ It achieves the following results on the evaluation set:
32
 
33
  ## Model description
34
 
35
- \section{System description}
36
-
37
- In this work, we present a transfer learning approach using the model proposed by \citet{Tamayo_etal2022DisTEMIST}, which is a version of multilingual BERT \citep{devlin-etal-2019-bert} fine-tuned for disease mention extraction from clinical texts and we apply post-processing rules to extract diseases mentioned in a corpus of tweets in Spanish. Our system tackles the problem in three steps, namely, pre-processing, transfer learning, and post-processing. Below we describe each of them.
38
-
39
- \subsection{Pre-processing}
40
- To implement the fine-tuning process, the BIO scheme (Begin, Inside, Outside) \citep{ramshaw-marcus-1995-text} was used. Since the dataset provided by SocialDisNER is formatted in a different way, pre-processing was needed to take it to the BIO scheme. We used the disease mentions in the provided structured dataset as a reference to annotate disease mentions in each tweet with their corresponding labels in the BIO scheme. Tokenization was carried out using SpaCy \citep{honnibal2017spacy} instead of a NER dedicated library such as SciSpacy \citep{neumann-etal-2019-scispacy} because the former works for Spanish.
41
-
42
- \subsection{Transfer learning}
43
- We tackled disease mention extraction as a sequence labeling problem using the whole tweet as input, and the labels mentioned above as output. We randomly split partitions of the training dataset into training (75\%) and validation (25\%) sets. This partition was done iteratively five times with random seeds. Additionally, we carried out a hyperparameter tuning searching for the best model’s configuration using a grid search for the epochs (3, 5, 7) and the learning rate (5e-03, 5e-05, 5e-07). 7 epochs and a learning rate of 5e-05 yielded the best results. With regard to the rest of hyperparameters, default values were kept. For this process, we used a transformer library and the model available at Hugging Face\footnote{The model is available at: https://bit.ly/3zGlxWy}. Google Colab Pro with a GPU Tesla P100 with 27.3 gigabytes of available RAM was used to run all the experiments. The data we used for our training process together with the source code to replicate this work are available at a GitHub repository\footnote{https://github.com/ajtamayoh/NLP-CIC-WFU-Contribution-to-SocialDisNER-shared-task-2022.git}.
44
-
45
-
46
- \subsection{Post-processing plus search by propagation}
47
- Post-processing was carried out through a custom Python script to clean up and format the output as follows: 1) Because mBERT works with a subword tokenization system, we decoded the output that contained subwords. 2) We concatenated all the named entities detected by the model one after the other. This means that if the model detected a named entity whose final character position (or final character position plus one) concurred with the first position of the next named entity detected, our system considered that these two entities were part of one single entity. This was necessary because the model extracts parts of some entities separately. 3) We also applied simple but effective post-processing based on some orthographic and grammatical rules which are detailed in Table~\ref{tab:post_processing}. 4) Under the assumption that SocialDisNER participants were required to extract all the mentions of a disease mention occurring in a tweet, we used the entities extracted by the model to identify and extract any repetitions of said entities in the same document. In order to retrieve misspelled mentions or mentions subsumed by hashtags, urls, or user names, we carried out a search by propagation applying the following steps: a) lowercase both the entity identified by the model and the tweet, b) concatenate multi-word entities, c) delete accents, and d) search entity occurrences throughout the tweet. Lastly, since we work with the BIO scheme, the last post-processing step consisted of decoding the predictions to put them in the data format required by SocialDisNER.
48
-
49
- \begin{table}[ht]
50
- \centering
51
- \begin{tabular}{ll}
52
- \hline
53
- \textbf{If the disease} & \textbf{… then apply}\\
54
- \textbf{mention detected …} & \textbf{this rule}\\
55
- \hline
56
- 1. Starts with & 1. Delete the match\\
57
- punctuation mark & and adjust the entity's \\
58
- & beginning index \\
59
- 2. Contains a mark of & 2. Replace the match \\
60
- new line & with a space \\
61
- 3. Contains a space & 3. Delete the space(s) \\
62
- before and/or after & and adjust the entity's \\
63
- a hyphen or a & ending index\\
64
- parenthesis & \\
65
- 4. Ends with & 4. Delete the match \\
66
- non-content words or & and adjust the entity's \\
67
- punctuation marks & ending index\\
68
- 5. Concurs with & 5. Leave out of \\
69
- non-content words or & the entities detected \\
70
- punctuation/hashtag& \\
71
- marks & \\\hline
72
- \end{tabular}
73
- \caption{Post-processing rules}
74
- \label{tab:post_processing}
75
- \end{table}
76
-
77
- ## Intended uses & limitations
78
-
79
- More information needed
80
 
81
  ## Training and evaluation data
82
 
@@ -108,6 +64,18 @@ The following hyperparameters were used during training:
108
  | 0.0067 | 7.0 | 3269 | 0.1483 | 0.8699 | 0.8722 | 0.8711 | 0.9771 |
109
 
110
 
 
 
 
 
 
 
 
 
 
 
 
 
111
  ### Framework versions
112
 
113
  - Transformers 4.20.1
 
32
 
33
  ## Model description
34
 
35
+ For a complete description of our system, please go to: https://aclanthology.org/2022.smm4h-1.6.pdf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
 
37
  ## Training and evaluation data
38
 
 
64
  | 0.0067 | 7.0 | 3269 | 0.1483 | 0.8699 | 0.8722 | 0.8711 | 0.9771 |
65
 
66
 
67
+ ### How to cite this work:
68
+
69
+ Tamayo, A., Gelbukh, A., & Burgos, D. A. (2022, October). Nlp-cic-wfu at socialdisner: Disease mention extraction in spanish tweets using transfer learning and search by propagation. In Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task (pp. 19-22).
70
+
71
+ @inproceedings{tamayo2022nlp,
72
+ title={Nlp-cic-wfu at socialdisner: Disease mention extraction in spanish tweets using transfer learning and search by propagation},
73
+ author={Tamayo, Antonio and Gelbukh, Alexander and Burgos, Diego A},
74
+ booktitle={Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop \& Shared Task},
75
+ pages={19--22},
76
+ year={2022}
77
+ }
78
+
79
  ### Framework versions
80
 
81
  - Transformers 4.20.1