TristanThrush commited on
Commit
ab8f50e
1 Parent(s): 12727ad

added dec olm roberta

Browse files
README.md ADDED
@@ -0,0 +1,92 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+
4
+ ---
5
+
6
+
7
+ # OLM RoBERTa/BERT December 2022
8
+
9
+ This is a more up-to-date version of the [original BERT](https://huggingface.co/bert-base-cased) and [original RoBERTa](https://huggingface.co/roberta-base).
10
+ In addition to being more up-to-date, it also tends to perform better than the original BERT on standard benchmarks.
11
+ We think it is fair to directly compare our model to the original BERT because our model was trained with about the same level of compute as the original BERT, and the architecture of BERT and RoBERTa are basically the same.
12
+ The original RoBERTa takes an order of magnitude more compute, although our model is also not that different in performance from the original RoBERTa on many standard benchmarks.
13
+ Our model was trained on a cleaned December 2022 snapshot of Common Crawl and Wikipedia.
14
+
15
+ This model was created as part of the OLM project, which has the goal of continuously training and releasing models that are up-to-date and comparable in standard language model performance to their static counterparts.
16
+ This is important because we want our models to know about events like COVID or
17
+ a presidential election right after they happen.
18
+
19
+ ## Intended uses
20
+
21
+ You can use the raw model for masked language modeling, but it's mostly intended to
22
+ be fine-tuned on a downstream task, such as sequence classification, token classification or question answering.
23
+
24
+ ### How to use
25
+
26
+ You can use this model directly with a pipeline for masked language modeling:
27
+
28
+ ```python
29
+ >>> from transformers import pipeline
30
+ >>> unmasker = pipeline('fill-mask', model='olm/olm-roberta-base-dec-2022')
31
+ >>> unmasker("Hello I'm a <mask> model.")
32
+ [{'score': 0.04252663999795914,
33
+ 'token': 631,
34
+ 'token_str': ' new',
35
+ 'sequence': "Hello I'm a new model."},
36
+ {'score': 0.034064881503582,
37
+ 'token': 4750,
38
+ 'token_str': ' female',
39
+ 'sequence': "Hello I'm a female model."},
40
+ {'score': 0.03066524863243103,
41
+ 'token': 932,
42
+ 'token_str': ' business',
43
+ 'sequence': "Hello I'm a business model."},
44
+ {'score': 0.029599128291010857,
45
+ 'token': 10345,
46
+ 'token_str': ' junior',
47
+ 'sequence': "Hello I'm a junior model."},
48
+ {'score': 0.025790784507989883,
49
+ 'token': 2219,
50
+ 'token_str': ' human',
51
+ 'sequence': "Hello I'm a human model."}]
52
+ ```
53
+
54
+ Here is how to use this model to get the features of a given text in PyTorch:
55
+
56
+ ```python
57
+ from transformers import AutoTokenizer, RobertaModel
58
+ tokenizer = AutoTokenizer.from_pretrained('olm/olm-roberta-base-dec-2022')
59
+ model = RobertaModel.from_pretrained("olm/olm-roberta-base-dec-2022")
60
+ text = "Replace me by any text you'd like."
61
+ encoded_input = tokenizer(text, return_tensors='pt')
62
+ output = model(**encoded_input)
63
+ ```
64
+
65
+ ## Dataset
66
+
67
+ The model and tokenizer were trained with this [December 2022 cleaned Common Crawl dataset](https://huggingface.co/datasets/olm/olm-CC-MAIN-2022-49-sampling-ratio-olm-0.15114822547) plus this [December 2022 cleaned Wikipedia dataset](https://huggingface.co/datasets/olm/olm-wikipedia-20221220).\
68
+ The tokenized version of these concatenated datasets is [here](https://huggingface.co/datasets/olm/olm-december-2022-tokenized-512).\
69
+ The datasets were created with this [repo](https://github.com/huggingface/olm-datasets).
70
+
71
+ ## Training
72
+
73
+ The model was trained according to the OLM BERT/RoBERTa instructions at this [repo](https://github.com/huggingface/olm-training).
74
+
75
+ ## Evaluation results
76
+
77
+ The model achieves the following results after tuning on GLUE tasks:
78
+
79
+ | Task | Metric | Original BERT | OLM RoBERTa Dec 2022 (Ours) |
80
+ |:-----|:---------|----------------:|----------------------------:|
81
+ |cola |mcc |**0.5889** |0.28067 |
82
+ |sst2 |acc |0.9181 |**0.9275** |
83
+ |mrpc |acc/f1 |**0.9182**/0.8923|0.8662/**0.9033** |
84
+ |stsb |pear/spear|0.8822/0.8794 |**0.8870**/**0.8857** |
85
+ |qqp |acc/f1 |0.9071/0.8748 |**0.9097**/**0.8791** |
86
+ |mnli |acc/acc_mm|0.8400/0.8410 |**0.8576**/**0.8621** |
87
+ |qnli |acc |0.9075 |**0.9192** |
88
+ |rte |acc |0.6296 |**0.6390** |
89
+ |wnli |acc |0.4000 |**0.4648** |
90
+
91
+ For both the original BERT and our model, we used the Hugging Face run_glue.py script [here](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification).
92
+ For both models, we used the default fine-tuning hyperparameters and we averaged the results over five training seeds. These are the results for the GLUE dev sets, which can be a bit different than the results for the test sets.
config.json ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "roberta-base",
3
+ "architectures": [
4
+ "RobertaForMaskedLM"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 0,
8
+ "classifier_dropout": null,
9
+ "eos_token_id": 2,
10
+ "hidden_act": "gelu",
11
+ "hidden_dropout_prob": 0.1,
12
+ "hidden_size": 768,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 3072,
15
+ "layer_norm_eps": 1e-05,
16
+ "max_position_embeddings": 514,
17
+ "model_type": "roberta",
18
+ "num_attention_heads": 12,
19
+ "num_hidden_layers": 12,
20
+ "pad_token_id": 1,
21
+ "position_embedding_type": "absolute",
22
+ "torch_dtype": "float32",
23
+ "transformers_version": "4.24.0",
24
+ "type_vocab_size": 1,
25
+ "use_cache": true,
26
+ "vocab_size": 50265
27
+ }
logs/1672179272.9817288/events.out.tfevents.1672179272.tristan-olm-training-a100-80.4584.1 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4a6532e37ffb79672235ffdf1bbf2b1fb4f35211e82bfd9b3d9cbb1332f24604
3
+ size 5515
logs/events.out.tfevents.1672179272.tristan-olm-training-a100-80.4584.0 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0f2098e2c57fddbdcdf2a067e122b816daabe5916ad99dc74196e9999e799000
3
+ size 163271
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6d4a6603680a1cbfc2d21ff0e20cb2932be649e5db2cddf8172c9b777a5f247e
3
+ size 498863417
special_tokens_map.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<s>",
3
+ "cls_token": "<s>",
4
+ "eos_token": "</s>",
5
+ "mask_token": {
6
+ "content": "<mask>",
7
+ "lstrip": true,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false
11
+ },
12
+ "pad_token": "<pad>",
13
+ "sep_token": "</s>",
14
+ "unk_token": "<unk>"
15
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "bos_token": "<s>",
4
+ "cls_token": "<s>",
5
+ "eos_token": "</s>",
6
+ "errors": "replace",
7
+ "mask_token": {
8
+ "__type": "AddedToken",
9
+ "content": "<mask>",
10
+ "lstrip": true,
11
+ "normalized": false,
12
+ "rstrip": false,
13
+ "single_word": false
14
+ },
15
+ "model_max_length": 512,
16
+ "name_or_path": "Tristan/olm-tokenizer",
17
+ "pad_token": "<pad>",
18
+ "sep_token": "</s>",
19
+ "special_tokens_map_file": null,
20
+ "tokenizer_class": "RobertaTokenizer",
21
+ "trim_offsets": true,
22
+ "unk_token": "<unk>"
23
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1820be18be126368b97f91ea51f07b8b4bd2d32b3143c8e878d855ed59c34ae5
3
+ size 3451
vocab.json ADDED
The diff for this file is too large to render. See raw diff