daveni commited on
Commit
78a263f
1 Parent(s): 7ed9d38

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +88 -0
README.md ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - es
4
+
5
+ tags:
6
+ - Emotion Analysis
7
+
8
+ ---
9
+
10
+ **Note**: This model & model card are based on the [finetuned XLM-T for Sentiment Analysis](cardiffnlp/twitter-xlm-roberta-base-sentiment)
11
+
12
+ # twitter-XLM-roBERTa-base for Emotion Analysis
13
+ This is a XLM-roBERTa-base model trained on ~198M tweets and finetuned for emotion analysis on Spanish language. This model was presented to EmoEvalEs competition, part of [IberLEF 2021 Conference](https://sites.google.com/view/iberlef2021/), where the proposed task was the classification of Spanish tweets between seven different classes: *anger*, *disgust*, *fear*, *joy*, *sadness*, *surprise*, and *other*. We achieved the first position in the competition with a macro-averaged F1 score of 71.70%.
14
+ - [Our code for EmoEvalEs submission](https://github.com/gsi-upm/emoevales-iberlef2021).
15
+ - [EmoEvalEs Dataset](https://github.com/pendrag/EmoEvalEs)
16
+ ## Example Pipeline with a [Tweet from @JaSantaolalla](https://twitter.com/JaSantaolalla/status/1398383243645177860)
17
+ ```python
18
+ from transformers import pipeline
19
+ model_path = "daveni/twitter-xlm-roberta-emotion-es"
20
+ tokenizer_path = 'cardiffnlp/twitter-xlm-roberta-base'
21
+ emotion_analysis = pipeline("text-classification", model=model_path, tokenizer=tokenizer_path)
22
+ emotion_analysis("Einstein dijo: Solo hay dos cosas infinitas, el universo y los pinches anuncios de bitcoin en Twitter. Paren ya carajo aaaaaaghhgggghhh me quiero murir")
23
+ ```
24
+ ```
25
+ [{'label': 'anger', 'score': 0.48307016491889954}]
26
+ ```
27
+ ## Full classification example
28
+ ```python
29
+ from transformers import AutoModelForSequenceClassification
30
+ from transformers import TFAutoModelForSequenceClassification
31
+ from transformers import AutoTokenizer, AutoConfig
32
+ import numpy as np
33
+ from scipy.special import softmax
34
+ # Preprocess text (username and link placeholders)
35
+ def preprocess(text):
36
+ new_text = []
37
+ for t in text.split(" "):
38
+ t = '@user' if t.startswith('@') and len(t) > 1 else t
39
+ t = 'http' if t.startswith('http') else t
40
+ new_text.append(t)
41
+ return " ".join(new_text)
42
+ model_path = "daveni/twitter-xlm-roberta-emotion-es"
43
+ tokenizer_path = 'cardiffnlp/twitter-xlm-roberta-base'
44
+ tokenizer = AutoTokenizer.from_pretrained(tokenizer_path )
45
+ config = AutoConfig.from_pretrained(model_path )
46
+ # PT
47
+ model = AutoModelForSequenceClassification.from_pretrained(model_path )
48
+ text = "Se ha quedao bonito día para publicar vídeo, ¿no? Hoy del tema más diferente que hemos tocado en el canal."
49
+ text = preprocess(text)
50
+ encoded_input = tokenizer(text, return_tensors='pt')
51
+ output = model(**encoded_input)
52
+ scores = output[0][0].detach().numpy()
53
+ scores = softmax(scores)
54
+
55
+ # Print labels and scores
56
+ ranking = np.argsort(scores)
57
+ ranking = ranking[::-1]
58
+ for i in range(scores.shape[0]):
59
+ l = config.id2label[ranking[i]]
60
+ s = scores[ranking[i]]
61
+ print(f"{i+1}) {l} {np.round(float(s), 4)}")
62
+ ```
63
+ Output:
64
+ ```
65
+ Se ha quedao bonito día para publicar vídeo, ¿no? Hoy del tema más diferente que hemos tocado en el canal.
66
+ 1) joy 0.7887
67
+ 2) others 0.1679
68
+ 3) surprise 0.0152
69
+ 4) sadness 0.0145
70
+ 5) anger 0.0077
71
+ 6) disgust 0.0033
72
+ 7) fear 0.0027
73
+ ```
74
+
75
+ #### Limitations and bias
76
+
77
+ - The dataset we used for finetuning was unbalanced, where almost half of the records belonged to the *other* class so there might be bias towards this class.
78
+
79
+
80
+ ## Training data
81
+
82
+ Pretrained weights were left identical to the original model released by [cardiffnlp](https://huggingface.co/cardiffnlp/twitter-xlm-roberta-base). We used the [EmoEvalEs Dataset](https://github.com/pendrag/EmoEvalEs) for finetuning.
83
+
84
+ ### BibTeX entry and citation info
85
+
86
+ ```bibtex
87
+ Coming soon
88
+ ```