ondfa commited on
Commit
a84bfa3
β€’
1 Parent(s): 7f5064b

initial commit

Browse files
Files changed (6) hide show
  1. README.md +109 -0
  2. config.json +22 -0
  3. special_tokens_map.json +1 -0
  4. tf_model.h5 +3 -0
  5. tokenizer_config.json +1 -0
  6. vocab.txt +0 -0
README.md ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CZERT
2
+ This repository keeps trained model Czert-B for the paper [Czert – Czech BERT-like Model for Language Representation
3
+ ](https://arxiv.org/abs/2103.13031)
4
+ For more information, see the paper
5
+
6
+
7
+ ## How to Use CZERT?
8
+
9
+ ### Sentence Level Tasks
10
+ We evaluate our model on two sentence level tasks:
11
+ * Sentiment Classification,
12
+ * Semantic Text Similarity.
13
+
14
+
15
+
16
+ <!-- tokenizer = BertTokenizerFast.from_pretrained(CZERT_MODEL_PATH, strip_accents=False)
17
+ model = TFAlbertForSequenceClassification.from_pretrained(CZERT_MODEL_PATH, num_labels=1)
18
+
19
+ or
20
+
21
+ self.tokenizer = BertTokenizerFast.from_pretrained(CZERT_MODEL_PATH, strip_accents=False)
22
+ self.model_encoder = AutoModelForSequenceClassification.from_pretrained(CZERT_MODEL_PATH, from_tf=True)
23
+ -->
24
+
25
+ ### Document Level Tasks
26
+ We evaluate our model on one document level task
27
+ * Multi-label Document Classification.
28
+
29
+ ### Token Level Tasks
30
+ We evaluate our model on three token level tasks:
31
+ * Named Entity Recognition,
32
+ * Morphological Tagging,
33
+ * Semantic Role Labelling.
34
+
35
+
36
+ ## Downstream Tasks Fine-tuning Results
37
+
38
+ ### Sentiment Classification
39
+ | | mBERT | SlavicBERT | ALBERT-r | Czert-A | Czert-B |
40
+ |:----:|:------------------------:|:------------------------:|:------------------------:|:-----------------------:|:--------------------------------:|
41
+ | FB | 71.72β€…Β±β€…0.91 | 73.87β€…Β±β€…0.50 | 59.50β€…Β±β€…0.47 | 72.47β€…Β±β€…0.72 | **76.55**β€…Β±β€…**0.14** |
42
+ | CSFD | 82.80β€…Β±β€…0.14 | 82.51β€…Β±β€…0.14 | 75.40β€…Β±β€…0.18 | 79.58β€…Β±β€…0.46 | **84.79**β€…Β±β€…**0.26** |
43
+
44
+ Average F1 results for the Sentiment Classification task. For more information, see [the paper](https://arxiv.org/abs/2103.13031).
45
+
46
+
47
+ ### Semantic Text Similarity
48
+
49
+ | | **mBERT** | **Pavlov** | **Albert-random** | **Czert-A** | **Czert-B** |
50
+ |:-------------|:--------------:|:--------------:|:-----------------:|:--------------:|:----------------------:|
51
+ | STA-CNA | 83.335β€…Β±β€…0.063 | 83.593β€…Β±β€…0.050 | 43.184β€…Β±β€…0.125 | 82.942β€…Β±β€…0.106 | **84.345**β€…Β±β€…**0.028** |
52
+ | STS-SVOB-img | 79.367β€…Β±β€…0.486 | 79.900β€…Β±β€…0.810 | 15.739β€…Β±β€…2.992 | 79.444β€…Β±β€…0.338 | **83.744**β€…Β±β€…**0.395** |
53
+ | STS-SVOB-hl | 78.833β€…Β±β€…0.296 | 76.996β€…Β±β€…0.305 | 33.949β€…Β±β€…1.807 | 75.089β€…Β±β€…0.806 | **79.827β€…Β±β€…0.469** |
54
+
55
+ Comparison of Pearson correlation achieved using pre-trained CZERT-A, CZERT-B, mBERT, Pavlov and randomly initialised Albert on semantic text similarity. For more information see [the paper](https://arxiv.org/abs/2103.13031).
56
+
57
+
58
+
59
+
60
+ ### Multi-label Document Classification
61
+ | | mBERT | SlavicBERT | ALBERT-r | Czert-A | Czert-B |
62
+ |:-----:|:------------:|:------------:|:------------:|:------------:|:-------------------:|
63
+ | AUROC | 97.62β€…Β±β€…0.08 | 97.80β€…Β±β€…0.06 | 94.35β€…Β±β€…0.13 | 97.49β€…Β±β€…0.07 | **98.00**β€…Β±β€…**0.04** |
64
+ | F1 | 83.04β€…Β±β€…0.16 | 84.08β€…Β±β€…0.14 | 72.44β€…Β±β€…0.22 | 82.27β€…Β±β€…0.17 | **85.06**β€…Β±β€…**0.11** |
65
+
66
+ Comparison of F1 and AUROC score achieved using pre-trained CZERT-A, CZERT-B, mBERT, Pavlov and randomly initialised Albert on multi-label document classification. For more information see [the paper](https://arxiv.org/abs/2103.13031).
67
+
68
+ ### Morphological Tagging
69
+ | | mBERT | Pavlov | Albert-random | Czert-A | Czert-B |
70
+ |:-----------------------|:---------------|:---------------|:---------------|:---------------|:---------------|
71
+ | Universal Dependencies | 99.176β€…Β±β€…0.006 | 99.211β€…Β±β€…0.008 | 96.590β€…Β±β€…0.096 | 98.713β€…Β±β€…0.008 | **99.300β€…Β±β€…0.009** |
72
+
73
+ Comparison of F1 score achieved using pre-trained CZERT-A, CZERT-B, mBERT, Pavlov and randomly initialised Albert on morphological tagging task. For more information see [the paper](https://arxiv.org/abs/2103.13031).
74
+ ### Semantic Role Labelling
75
+
76
+ <div id="tab:SRL">
77
+
78
+ | | mBERT | Pavlov | Albert-random | Czert-A | Czert-B | dep-based | gold-dep |
79
+ |:------:|:----------:|:----------:|:-------------:|:----------:|:----------:|:---------:|:--------:|
80
+ | span | 78.547 Β± 0.110 | **79.333 Β± 0.080** | 51.365 Β± 0.423 | 72.254 Β± 0.172 | **79.112 Β± 0.141** | \- | \- |
81
+ | syntax | 90.226 Β± 0.224 | **90.492 Β± 0.040** | 80.747 Β± 0.131 | 80.319 Β± 0.054 | **90.516 Β± 0.047** | 85.19 | 89.52 |
82
+
83
+ SRL results – dep columns are evaluate with labelled F1 from CoNLL 2009 evaluation script, other columns are evaluated with span F1 score same as it was used for NER evaluation. For more information see [the paper](https://arxiv.org/abs/2103.13031).
84
+
85
+ </div>
86
+
87
+
88
+ ### Named Entity Recognition
89
+ | | mBERT | Pavlov | Albert-random | Czert-A | Czert-B |
90
+ |:-----------|:---------------|:---------------|:---------------|:---------------|:---------------|
91
+ | CNEC | **86.225β€…Β±β€…0.208** | **86.565β€…Β±β€…0.198** | 34.635β€…Β±β€…0.343 | 72.945β€…Β±β€…0.227 | 86.274β€…Β±β€…0.116 |
92
+ | BSNLP 2019 | 84.006β€…Β±β€…1.248 | **86.699β€…Β±β€…0.370** | 19.773β€…Β±β€…0.938 | 48.859β€…Β±β€…0.605 | **86.729 Β± 0.344** |
93
+
94
+ Comparison of f1 score achieved using pre-trained CZERT-A, CZERT-B, mBERT, Pavlov and randomly initialised Albert on named entity recognition task. For more information see [the paper](https://arxiv.org/abs/2103.13031).
95
+
96
+
97
+ ## How should I cite CZERT?
98
+ For now, please cite [the Arxiv paper](https://arxiv.org/abs/2103.13031):
99
+ ```
100
+ @article{sido2021czert,
101
+ title={Czert -- Czech BERT-like Model for Language Representation},
102
+ author={Jakub Sido and OndΕ™ej PraΕΎΓ‘k and Pavel PΕ™ibÑň and Jan PaΕ‘ek and Michal SejΓ‘k and Miloslav KonopΓ­k},
103
+ year={2021},
104
+ eprint={2103.13031},
105
+ archivePrefix={arXiv},
106
+ primaryClass={cs.CL},
107
+ journal={arXiv preprint arXiv:2103.13031},
108
+ }
109
+ ```
config.json ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "attention_probs_dropout_prob": 0,
3
+ "bos_token_id": 2,
4
+ "classifier_dropout_prob": 0.1,
5
+ "embedding_size": 128,
6
+ "eos_token_id": 3,
7
+ "hidden_act": "gelu_new",
8
+ "hidden_dropout_prob": 0,
9
+ "hidden_size": 768,
10
+ "initializer_range": 0.02,
11
+ "inner_group_num": 1,
12
+ "intermediate_size": 3072,
13
+ "layer_norm_eps": 1e-12,
14
+ "max_position_embeddings": 512,
15
+ "model_type": "albert",
16
+ "num_attention_heads": 12,
17
+ "num_hidden_groups": 1,
18
+ "num_hidden_layers": 12,
19
+ "pad_token_id": 0,
20
+ "type_vocab_size": 2,
21
+ "vocab_size": 30000
22
+ }
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
tf_model.h5 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e2215f633e3508126bbf793a01f8508f4fa17597d4022cffde9438b29f58c89b
3
+ size 63059416
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "strip_accents": false}
vocab.txt ADDED
The diff for this file is too large to render. See raw diff