annieske commited on
Commit
c5ce7a5
·
1 Parent(s): 3d78161

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +41 -0
README.md CHANGED
@@ -1,3 +1,44 @@
1
  ---
2
  license: cc-by-sa-4.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-sa-4.0
3
  ---
4
+
5
+ ### xlm-roberta-base for register labeling, specifically fine-tuned for question-answer document identification
6
+
7
+ This is the `xlm-roberta-base`, fine-tuned on register annotated data in English (https://github.com/TurkuNLP/CORE-corpus) and Finnish (https://github.com/TurkuNLP/FinCORE_full) as well as unpublished versions of Swedish and French (https://github.com/TurkuNLP/multilingual-register-labeling). The model is trained to predict whether a text includes something related to questions and answers or not.
8
+
9
+ ### Overview
10
+ Language model: xlm-roberta-base
11
+
12
+ Downstream-task: multi-class text classification
13
+
14
+
15
+ ### Usage
16
+
17
+ the model can be used through a huggingface pipeline:
18
+ ```
19
+ model = transformers.AutoModelForSequenceClassification.from_pretrained("TurkuNLP/xlmr-qa-register")
20
+ tokenizer = transformers.AutoTokenizer.from_pretrained("xlm-roberta-base")
21
+ pipe = transformers.pipeline(task="text-classification", model=model, tokenizer=tokenizer)
22
+ ```
23
+
24
+ ### Hyperparameters
25
+ ```
26
+ batch_size = 8
27
+ epochs = 10 (trained for 4)
28
+ base_LM_model = "xlm-roberta-base"
29
+ max_seq_len = 512
30
+ learning_rate = 4e-6
31
+ ```
32
+
33
+ ### Performance
34
+ ```
35
+ F1-micro = 0.98
36
+ F1-macro = 0.79
37
+
38
+ F1 QA label = 0.60
39
+ F1 not QA label = 0.99
40
+ Precision QA label = 0.82
41
+ Precision not QA label = 0.99
42
+ Recall QA label = 0.47
43
+ Recall not QA label = 1.00
44
+ ```