michal-stefanik commited on
Commit
ac31af3
1 Parent(s): 064f0c5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +70 -0
README.md CHANGED
@@ -0,0 +1,70 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - exbert
4
+ - question-answering
5
+ language:
6
+ - multilingual
7
+ - cs
8
+ - en
9
+ ---
10
+
11
+ # XLM RoBERTa for Czech+English Extractive Question Answering
12
+
13
+ This is the [XLM-RoBERTa-large](https://huggingface.co/xlm-roberta-large) model with a head for extractive question answering trained on a combination of [English SQuAD 1.1](https://huggingface.co/datasets/squad) and [Czech SQAD 3.0](https://lindat.cz/repository/xmlui/handle/11234/1-3069) Question Answering datasets. For the Czech SQAD 3.0, original contexts (=whole Wikipedia websites) were limited to fit the RoBERTa's context window, excluding ~3% of the samples.
14
+
15
+ ## Intended uses & limitations
16
+
17
+ This model is purposed to extract a segment of a given context that contains an answer to a given question (Extractive Question Answering) in English and Czech.
18
+ Given the fine-tuning on two languages and a good reported zero-shot cross-lingual applicability of other fine-tuned XLM-RoBERTas, the model will likely work on other languages as well, with a decay in quality.
19
+
20
+ Note that despite its size, English SQuAD has a variety of reported biases (see, e.g. [L. Mikula (2022)](https://is.muni.cz/th/adh58/?lang=en), Chap. 4.1).
21
+
22
+ ## Usage
23
+
24
+ Here is how to use this model to answer the question on a given context using 🤗 Transformers in PyTorch:
25
+
26
+ ```python
27
+ from transformers import AutoTokenizer, AutoModelForQuestionAnswering
28
+
29
+ tokenizer = AutoTokenizer.from_pretrained("gaussalgo/xlm-roberta-large_extractive-QA_en-cs")
30
+ model = AutoModelForQuestionAnswering.from_pretrained("gaussalgo/xlm-roberta-large_extractive-QA_en-cs")
31
+
32
+ context = """
33
+ Podle slovenského lidového podání byl Juro Jánošík obdařen magickými předměty (kouzelná valaška, čarovný opasek),
34
+ které mu dodávaly nadpřirozené schopnosti. Okrádal především šlechtice,
35
+ trestal panské dráby a ze svého lupu vyděloval část pro chudé, tedy bohatým bral a chudým dával.
36
+ """
37
+ question = "Jaké schopnosti daly magické předměty Juro Jánošíkovi?"
38
+
39
+ inputs = tokenizer(question, context, return_tensors="pt")
40
+ outputs = model(**inputs)
41
+ start_position = outputs.start_logits[0].argmax()
42
+ end_position = outputs.end_logits[0].argmax()
43
+ answer_ids = tokenizer.decode(inputs["input_ids"][0][start_position:end_position])
44
+
45
+ print("Answer:")
46
+ print(tokenizer.decode(answer_ids))
47
+
48
+ ```
49
+
50
+ ## Training
51
+
52
+ The model has been trained using [Adaptor library](https://github.com/gaussalgo/adaptor) v0.1.5, in parallel on both Czech and English data, with the following parameters:
53
+
54
+ ```python
55
+ training_arguments = AdaptationArguments(output_dir="train_dir",
56
+ learning_rate=1e-5,
57
+ stopping_strategy=StoppingStrategy.ALL_OBJECTIVES_CONVERGED,
58
+ do_train=True,
59
+ do_eval=True,
60
+ warmup_steps=1000,
61
+ max_steps=100000,
62
+ gradient_accumulation_steps=30,
63
+ eval_steps=100,
64
+ logging_steps=10,
65
+ save_steps=1000,
66
+ num_train_epochs=30,
67
+ evaluation_strategy="steps")
68
+ ```
69
+
70
+ You can find the full training script in [train_roberta_extractive_qa.py](train_roberta_extractive_qa.py), reproducible after a specific data preprocessing for Czech SQAD in [parse_czech_squad.py](parse_czech_squad.py)