Branden Chan commited on
Commit
deedc3e
1 Parent(s): 390d6ae

Update to v2

Browse files
README.md ADDED
@@ -0,0 +1,120 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - squad_v2
4
+ ---
5
+
6
+ # roberta-base for QA
7
+
8
+ NOTE: This is version 2 of the model. See [this github issue](https://github.com/deepset-ai/FARM/issues/552) from the FARM repository for an explanation of why we updated.
9
+
10
+ ## Overview
11
+ **Language model:** roberta-base
12
+ **Language:** English
13
+ **Downstream-task:** Extractive QA
14
+ **Training data:** SQuAD 2.0
15
+ **Eval data:** SQuAD 2.0
16
+ **Code:** See [example](https://github.com/deepset-ai/FARM/blob/master/examples/question_answering.py) in [FARM](https://github.com/deepset-ai/FARM/blob/master/examples/question_answering.py)
17
+ **Infrastructure**: 4x Tesla v100
18
+
19
+ ## Hyperparameters
20
+
21
+ ```
22
+ batch_size = 96
23
+ n_epochs = 2
24
+ base_LM_model = "roberta-base"
25
+ max_seq_len = 386
26
+ learning_rate = 3e-5
27
+ lr_schedule = LinearWarmup
28
+ warmup_proportion = 0.2
29
+ doc_stride=128
30
+ max_query_length=64
31
+ ```
32
+
33
+ ## Performance
34
+ Evaluated on the SQuAD 2.0 dev set with the [official eval script](https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/).
35
+
36
+ ```
37
+ "exact": 79.97136359807968
38
+ "f1": 83.00449234495325
39
+
40
+ "total": 11873
41
+ "HasAns_exact": 78.03643724696356
42
+ "HasAns_f1": 84.11139298441825
43
+ "HasAns_total": 5928
44
+ "NoAns_exact": 81.90075693860386
45
+ "NoAns_f1": 81.90075693860386
46
+ "NoAns_total": 5945
47
+ ```
48
+
49
+ ## Usage
50
+
51
+ ### In Transformers
52
+ ```python
53
+ from transformers.pipelines import pipeline
54
+ from transformers.modeling_auto import AutoModelForQuestionAnswering
55
+ from transformers.tokenization_auto import AutoTokenizer
56
+
57
+ model_name = "deepset/roberta-base-squad2-v2"
58
+
59
+ # a) Get predictions
60
+ nlp = pipeline('question-answering', model=model_name, tokenizer=model_name)
61
+ QA_input = {
62
+ 'question': 'Why is model conversion important?',
63
+ 'context': 'The option to convert models between FARM and transformers gives freedom to the user and let people easily switch between frameworks.'
64
+ }
65
+ res = nlp(QA_input)
66
+
67
+ # b) Load model & tokenizer
68
+ model = AutoModelForQuestionAnswering.from_pretrained(model_name)
69
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
70
+ ```
71
+
72
+ ### In FARM
73
+
74
+ ```python
75
+ from farm.modeling.adaptive_model import AdaptiveModel
76
+ from farm.modeling.tokenization import Tokenizer
77
+ from farm.infer import Inferencer
78
+
79
+ model_name = "deepset/roberta-base-squad2-v2"
80
+
81
+ # a) Get predictions
82
+ nlp = Inferencer.load(model_name, task_type="question_answering")
83
+ QA_input = [{"questions": ["Why is model conversion important?"],
84
+ "text": "The option to convert models between FARM and transformers gives freedom to the user and let people easily switch between frameworks."}]
85
+ res = nlp.inference_from_dicts(dicts=QA_input, rest_api_schema=True)
86
+
87
+ # b) Load model & tokenizer
88
+ model = AdaptiveModel.convert_from_transformers(model_name, device="cpu", task_type="question_answering")
89
+ tokenizer = Tokenizer.load(model_name)
90
+ ```
91
+
92
+ ### In haystack
93
+ For doing QA at scale (i.e. many docs instead of single paragraph), you can load the model also in [haystack](https://github.com/deepset-ai/haystack/):
94
+ ```python
95
+ reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2")
96
+ # or
97
+ reader = TransformersReader(model="deepset/roberta-base-squad2",tokenizer="deepset/roberta-base-squad2")
98
+ ```
99
+
100
+
101
+ ## Authors
102
+ Branden Chan: `branden.chan [at] deepset.ai`
103
+ Timo M枚ller: `timo.moeller [at] deepset.ai`
104
+ Malte Pietsch: `malte.pietsch [at] deepset.ai`
105
+ Tanay Soni: `tanay.soni [at] deepset.ai`
106
+
107
+ ## About us
108
+ ![deepset logo](https://raw.githubusercontent.com/deepset-ai/FARM/master/docs/img/deepset_logo.png)
109
+
110
+ We bring NLP to the industry via open source!
111
+ Our focus: Industry specific language models & large scale QA systems.
112
+
113
+ Some of our work:
114
+ - [German BERT (aka "bert-base-german-cased")](https://deepset.ai/german-bert)
115
+ - [FARM](https://github.com/deepset-ai/FARM)
116
+ - [Haystack](https://github.com/deepset-ai/haystack/)
117
+
118
+ Get in touch:
119
+ [Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Website](https://deepset.ai)
120
+
config.json CHANGED
@@ -5,6 +5,7 @@
5
  "attention_probs_dropout_prob": 0.1,
6
  "bos_token_id": 0,
7
  "eos_token_id": 2,
 
8
  "hidden_act": "gelu",
9
  "hidden_dropout_prob": 0.1,
10
  "hidden_size": 768,
@@ -17,7 +18,6 @@
17
  "name": "Roberta",
18
  "num_attention_heads": 12,
19
  "num_hidden_layers": 12,
20
- "output_past": true,
21
  "pad_token_id": 1,
22
  "type_vocab_size": 1,
23
  "vocab_size": 50265
5
  "attention_probs_dropout_prob": 0.1,
6
  "bos_token_id": 0,
7
  "eos_token_id": 2,
8
+ "gradient_checkpointing": false,
9
  "hidden_act": "gelu",
10
  "hidden_dropout_prob": 0.1,
11
  "hidden_size": 768,
18
  "name": "Roberta",
19
  "num_attention_heads": 12,
20
  "num_hidden_layers": 12,
 
21
  "pad_token_id": 1,
22
  "type_vocab_size": 1,
23
  "vocab_size": 50265
pytorch_model.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:6f38b2e5443a0aa0aab23c665a8c033f00ba9d85424a0f5c1acd8bbdf36df6ef
3
- size 498637366
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e0b64ccefc1bcb569b604baea27eb873e5482fdf6eb3ceff1fb5368397db5aed
3
+ size 496313727
special_tokens_map.json CHANGED
@@ -1 +1 @@
1
- {"bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "sep_token": "</s>", "pad_token": "<pad>", "cls_token": "<s>", "mask_token": "<mask>"}
1
+ {"bos_token": {"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "eos_token": {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "unk_token": {"content": "<unk>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "sep_token": {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "pad_token": {"content": "<pad>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "cls_token": {"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true}}
tokenizer_config.json CHANGED
@@ -1 +1 @@
1
- {"do_lower_case": true, "max_len": 512, "bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "sep_token": "</s>", "pad_token": "<pad>", "cls_token": "<s>", "mask_token": "<mask>"}
1
+ {"do_lower_case": false, "model_max_length": 512, "full_tokenizer_file": null}