ondfa commited on
Commit
28bc819
1 Parent(s): f823e28

initial commit

Browse files
Files changed (6) hide show
  1. README.md +140 -0
  2. config.json +44 -0
  3. pytorch_model.bin +3 -0
  4. special_tokens_map.json +1 -0
  5. tokenizer_config.json +1 -0
  6. vocab.txt +0 -0
README.md ADDED
@@ -0,0 +1,140 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CZERT
2
+ This repository keeps trained Czert-B-base-cased-long-zero-shot model for the paper [Czert – Czech BERT-like Model for Language Representation
3
+ ](https://arxiv.org/abs/2103.13031)
4
+ For more information, see the paper
5
+
6
+ This is long version of Czert-B-base-cased created without any finetunning on long documents. Positional embedings were created by simply repeating the positional embeddings of the original Czert-B model
7
+
8
+
9
+ ## Available Models
10
+ You can download **MLM & NSP only** pretrained models
11
+ ~~[CZERT-A-v1](https://air.kiv.zcu.cz/public/CZERT-A-czert-albert-base-uncased.zip)
12
+ [CZERT-B-v1](https://air.kiv.zcu.cz/public/CZERT-B-czert-bert-base-cased.zip)~~
13
+
14
+ After some additional experiments, we found out that the tokenizers config was exported wrongly. In Czert-B-v1, the tokenizer parameter "do_lower_case" was wrongly set to true. In Czert-A-v1 the parameter "strip_accents" was incorrectly set to true.
15
+
16
+ Both mistakes are repaired in v2.
17
+ [CZERT-A-v2](https://air.kiv.zcu.cz/public/CZERT-A-v2-czert-albert-base-uncased.zip)
18
+ [CZERT-B-v2](https://air.kiv.zcu.cz/public/CZERT-B-v2-czert-bert-base-cased.zip)
19
+
20
+
21
+
22
+ or choose from one of **Finetuned Models**
23
+ | | Models |
24
+ | - | - |
25
+ | Sentiment Classification<br> (Facebook or CSFD) | [CZERT-A-sentiment-FB](https://air.kiv.zcu.cz/public/CZERT-A_fb.zip) <br> [CZERT-B-sentiment-FB](https://air.kiv.zcu.cz/public/CZERT-B_fb.zip) <br> [CZERT-A-sentiment-CSFD](https://air.kiv.zcu.cz/public/CZERT-A_csfd.zip) <br> [CZERT-B-sentiment-CSFD](https://air.kiv.zcu.cz/public/CZERT-B_csfd.zip) | Semantic Text Similarity <br> (Czech News Agency) | [CZERT-A-sts-CNA](https://air.kiv.zcu.cz/public/CZERT-A-sts-CNA.zip) <br> [CZERT-B-sts-CNA](https://air.kiv.zcu.cz/public/CZERT-B-sts-CNA.zip)
26
+ | Named Entity Recognition | [CZERT-A-ner-CNEC](https://air.kiv.zcu.cz/public/CZERT-A-ner-CNEC-cased.zip) <br> [CZERT-B-ner-CNEC](https://air.kiv.zcu.cz/public/CZERT-B-ner-CNEC-cased.zip) <br>[PAV-ner-CNEC](https://air.kiv.zcu.cz/public/PAV-ner-CNEC-cased.zip) <br> [CZERT-A-ner-BSNLP](https://air.kiv.zcu.cz/public/CZERT-A-ner-BSNLP-cased.zip)<br>[CZERT-B-ner-BSNLP](https://air.kiv.zcu.cz/public/CZERT-B-ner-BSNLP-cased.zip) <br>[PAV-ner-BSNLP](https://air.kiv.zcu.cz/public/PAV-ner-BSNLP-cased.zip) |
27
+ | Morphological Tagging<br> | [CZERT-A-morphtag-126k](https://air.kiv.zcu.cz/public/CZERT-A-morphtag-126k-cased.zip)<br>[CZERT-B-morphtag-126k](https://air.kiv.zcu.cz/public/CZERT-B-morphtag-126k-cased.zip) |
28
+ | Semantic Role Labelling |[CZERT-A-srl](https://air.kiv.zcu.cz/public/CZERT-A-srl-cased.zip)<br> [CZERT-B-srl](https://air.kiv.zcu.cz/public/CZERT-B-srl-cased.zip) |
29
+
30
+
31
+
32
+
33
+
34
+ ## How to Use CZERT?
35
+
36
+ ### Sentence Level Tasks
37
+ We evaluate our model on two sentence level tasks:
38
+ * Sentiment Classification,
39
+ * Semantic Text Similarity.
40
+
41
+
42
+
43
+ <!-- tokenizer = BertTokenizerFast.from_pretrained(CZERT_MODEL_PATH, strip_accents=False)
44
+ model = TFAlbertForSequenceClassification.from_pretrained(CZERT_MODEL_PATH, num_labels=1)
45
+
46
+ or
47
+
48
+ self.tokenizer = BertTokenizerFast.from_pretrained(CZERT_MODEL_PATH, strip_accents=False)
49
+ self.model_encoder = AutoModelForSequenceClassification.from_pretrained(CZERT_MODEL_PATH, from_tf=True)
50
+ -->
51
+
52
+ ### Document Level Tasks
53
+ We evaluate our model on one document level task
54
+ * Multi-label Document Classification.
55
+
56
+ ### Token Level Tasks
57
+ We evaluate our model on three token level tasks:
58
+ * Named Entity Recognition,
59
+ * Morphological Tagging,
60
+ * Semantic Role Labelling.
61
+
62
+
63
+ ## Downstream Tasks Fine-tuning Results
64
+
65
+ ### Sentiment Classification
66
+ | | mBERT | SlavicBERT | ALBERT-r | Czert-A | Czert-B |
67
+ |:----:|:------------------------:|:------------------------:|:------------------------:|:-----------------------:|:--------------------------------:|
68
+ | FB | 71.72 ± 0.91 | 73.87 ± 0.50 | 59.50 ± 0.47 | 72.47 ± 0.72 | **76.55** ± **0.14** |
69
+ | CSFD | 82.80 ± 0.14 | 82.51 ± 0.14 | 75.40 ± 0.18 | 79.58 ± 0.46 | **84.79** ± **0.26** |
70
+
71
+ Average F1 results for the Sentiment Classification task. For more information, see [the paper](https://arxiv.org/abs/2103.13031).
72
+
73
+
74
+ ### Semantic Text Similarity
75
+
76
+ | | **mBERT** | **Pavlov** | **Albert-random** | **Czert-A** | **Czert-B** |
77
+ |:-------------|:--------------:|:--------------:|:-----------------:|:--------------:|:----------------------:|
78
+ | STA-CNA | 83.335 ± 0.063 | 83.593 ± 0.050 | 43.184 ± 0.125 | 82.942 ± 0.106 | **84.345** ± **0.028** |
79
+ | STS-SVOB-img | 79.367 ± 0.486 | 79.900 ± 0.810 | 15.739 ± 2.992 | 79.444 ± 0.338 | **83.744** ± **0.395** |
80
+ | STS-SVOB-hl | 78.833 ± 0.296 | 76.996 ± 0.305 | 33.949 ± 1.807 | 75.089 ± 0.806 | **79.827 ± 0.469** |
81
+
82
+ Comparison of Pearson correlation achieved using pre-trained CZERT-A, CZERT-B, mBERT, Pavlov and randomly initialised Albert on semantic text similarity. For more information see [the paper](https://arxiv.org/abs/2103.13031).
83
+
84
+
85
+
86
+
87
+ ### Multi-label Document Classification
88
+ | | mBERT | SlavicBERT | ALBERT-r | Czert-A | Czert-B |
89
+ |:-----:|:------------:|:------------:|:------------:|:------------:|:-------------------:|
90
+ | AUROC | 97.62 ± 0.08 | 97.80 ± 0.06 | 94.35 ± 0.13 | 97.49 ± 0.07 | **98.00** ± **0.04** |
91
+ | F1 | 83.04 ± 0.16 | 84.08 ± 0.14 | 72.44 ± 0.22 | 82.27 ± 0.17 | **85.06** ± **0.11** |
92
+
93
+ Comparison of F1 and AUROC score achieved using pre-trained CZERT-A, CZERT-B, mBERT, Pavlov and randomly initialised Albert on multi-label document classification. For more information see [the paper](https://arxiv.org/abs/2103.13031).
94
+
95
+ ### Morphological Tagging
96
+ | | mBERT | Pavlov | Albert-random | Czert-A | Czert-B |
97
+ |:-----------------------|:---------------|:---------------|:---------------|:---------------|:---------------|
98
+ | Universal Dependencies | 99.176 ± 0.006 | 99.211 ± 0.008 | 96.590 ± 0.096 | 98.713 ± 0.008 | **99.300 ± 0.009** |
99
+
100
+ Comparison of F1 score achieved using pre-trained CZERT-A, CZERT-B, mBERT, Pavlov and randomly initialised Albert on morphological tagging task. For more information see [the paper](https://arxiv.org/abs/2103.13031).
101
+ ### Semantic Role Labelling
102
+
103
+ <div id="tab:SRL">
104
+
105
+ | | mBERT | Pavlov | Albert-random | Czert-A | Czert-B | dep-based | gold-dep |
106
+ |:------:|:----------:|:----------:|:-------------:|:----------:|:----------:|:---------:|:--------:|
107
+ | span | 78.547 ± 0.110 | 79.333 ± 0.080 | 51.365 ± 0.423 | 72.254 ± 0.172 | **81.861 ± 0.102** | \- | \- |
108
+ | syntax | 90.226 ± 0.224 | 90.492 ± 0.040 | 80.747 ± 0.131 | 80.319 ± 0.054 | **91.462 ± 0.062** | 85.19 | 89.52 |
109
+
110
+ SRL results – dep columns are evaluate with labelled F1 from CoNLL 2009 evaluation script, other columns are evaluated with span F1 score same as it was used for NER evaluation. For more information see [the paper](https://arxiv.org/abs/2103.13031).
111
+
112
+ </div>
113
+
114
+
115
+ ### Named Entity Recognition
116
+ | | mBERT | Pavlov | Albert-random | Czert-A | Czert-B |
117
+ |:-----------|:---------------|:---------------|:---------------|:---------------|:---------------|
118
+ | CNEC | **86.225 ± 0.208** | **86.565 ± 0.198** | 34.635 ± 0.343 | 72.945 ± 0.227 | 86.274 ± 0.116 |
119
+ | BSNLP 2019 | 84.006 ± 1.248 | **86.699 ± 0.370** | 19.773 ± 0.938 | 48.859 ± 0.605 | **86.729 ± 0.344** |
120
+
121
+ Comparison of f1 score achieved using pre-trained CZERT-A, CZERT-B, mBERT, Pavlov and randomly initialised Albert on named entity recognition task. For more information see [the paper](https://arxiv.org/abs/2103.13031).
122
+
123
+
124
+ ## Licence
125
+ This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/
126
+
127
+ ## How should I cite CZERT?
128
+ For now, please cite [the Arxiv paper](https://arxiv.org/abs/2103.13031):
129
+ ```
130
+ @article{sido2021czert,
131
+ title={Czert -- Czech BERT-like Model for Language Representation},
132
+ author={Jakub Sido and Ondřej Pražák and Pavel Přibáň and Jan Pašek and Michal Seják and Miloslav Konopík},
133
+ year={2021},
134
+ eprint={2103.13031},
135
+ archivePrefix={arXiv},
136
+ primaryClass={cs.CL},
137
+ journal={arXiv preprint arXiv:2103.13031},
138
+ }
139
+ ```
140
+
config.json ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "allenai/longformer-base-4096",
3
+ "architectures": [
4
+ "LongformerModel"
5
+ ],
6
+ "attention_mode": "longformer",
7
+ "attention_probs_dropout_prob": 0.1,
8
+ "attention_window": [
9
+ 512,
10
+ 512,
11
+ 512,
12
+ 512,
13
+ 512,
14
+ 512,
15
+ 512,
16
+ 512,
17
+ 512,
18
+ 512,
19
+ 512,
20
+ 512
21
+ ],
22
+ "bos_token_id": 0,
23
+ "eos_token_id": 2,
24
+ "gradient_checkpointing": false,
25
+ "hidden_act": "gelu",
26
+ "hidden_dropout_prob": 0.1,
27
+ "hidden_size": 768,
28
+ "ignore_attention_mask": false,
29
+ "initializer_range": 0.02,
30
+ "intermediate_size": 3072,
31
+ "layer_norm_eps": 1e-05,
32
+ "max_position_embeddings": 4098,
33
+ "model_type": "longformer",
34
+ "num_attention_heads": 12,
35
+ "num_hidden_layers": 12,
36
+ "pad_token_id": 1,
37
+ "position_embedding_type": "absolute",
38
+ "return_dict": false,
39
+ "sep_token_id": 2,
40
+ "transformers_version": "4.2.0",
41
+ "type_vocab_size": 2,
42
+ "use_cache": true,
43
+ "vocab_size": 30522
44
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0dfb83b333dedcf410912ad14e8558a51dd8d6713a55b54087e3d83ba79c52aa
3
+ size 534127422
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
1
+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
1
+ {"do_lower_case": false, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": false, "special_tokens_map_file": "/home/prazak/.cache/huggingface/transformers/0f58b06b684586c2df51e8c72c2bc7363131bc26d6696823fbe13a3a7ea9ac29.dd8bd9bfd3664b530ea4e645105f557769387b3da9f79bdb55ed556bdd80611d", "name_or_path": "UWB-AIR/Czert-B-base-cased", "do_basic_tokenize": true, "never_split": null}
vocab.txt ADDED
The diff for this file is too large to render. See raw diff