File size: 9,669 Bytes
cedf40b
 
 
 
 
a84bfa3
6c81f44
a84bfa3
 
 
 
6c81f44
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a84bfa3
 
 
 
 
 
 
 
 
 
cedf40b
a84bfa3
 
 
 
 
 
cedf40b
a84bfa3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cedf40b
6c81f44
a84bfa3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6c81f44
 
 
a84bfa3
 
 
 
 
 
 
 
 
 
 
 
da4acc3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
---
tags:
- cs
---

# CZERT
This repository keeps Czert-A model for the paper [Czert – Czech BERT-like Model for Language Representation
](https://arxiv.org/abs/2103.13031)
For more information, see the paper


## Available Models
You can download **MLM & NSP only** pretrained models
~~[CZERT-A-v1](https://air.kiv.zcu.cz/public/CZERT-A-czert-albert-base-uncased.zip)
[CZERT-B-v1](https://air.kiv.zcu.cz/public/CZERT-B-czert-bert-base-cased.zip)~~

After some additional experiments, we found out that the tokenizers config was exported wrongly. In Czert-B-v1, the tokenizer parameter "do_lower_case"  was wrongly set to true. In Czert-A-v1 the parameter "strip_accents"  was incorrectly set to true. 

Both mistakes are repaired in v2.
[CZERT-A-v2](https://air.kiv.zcu.cz/public/CZERT-A-v2-czert-albert-base-uncased.zip)
[CZERT-B-v2](https://air.kiv.zcu.cz/public/CZERT-B-v2-czert-bert-base-cased.zip)



or choose from one of **Finetuned Models**
| | Models  |
| - | - |
| Sentiment Classification<br> (Facebook or CSFD)                                                                                                                           | [CZERT-A-sentiment-FB](https://air.kiv.zcu.cz/public/CZERT-A_fb.zip) <br> [CZERT-B-sentiment-FB](https://air.kiv.zcu.cz/public/CZERT-B_fb.zip) <br> [CZERT-A-sentiment-CSFD](https://air.kiv.zcu.cz/public/CZERT-A_csfd.zip)  <br>   [CZERT-B-sentiment-CSFD](https://air.kiv.zcu.cz/public/CZERT-B_csfd.zip) | Semantic Text Similarity <br> (Czech News Agency)                                                                                                                        | [CZERT-A-sts-CNA](https://air.kiv.zcu.cz/public/CZERT-A-sts-CNA.zip) <br> [CZERT-B-sts-CNA](https://air.kiv.zcu.cz/public/CZERT-B-sts-CNA.zip)                                                                                                                                               
| Named Entity Recognition                                                                                                                                                 | [CZERT-A-ner-CNEC](https://air.kiv.zcu.cz/public/CZERT-A-ner-CNEC-cased.zip) <br>  [CZERT-B-ner-CNEC](https://air.kiv.zcu.cz/public/CZERT-B-ner-CNEC-cased.zip) <br>[PAV-ner-CNEC](https://air.kiv.zcu.cz/public/PAV-ner-CNEC-cased.zip) <br> [CZERT-A-ner-BSNLP](https://air.kiv.zcu.cz/public/CZERT-A-ner-BSNLP-cased.zip)<br>[CZERT-B-ner-BSNLP](https://air.kiv.zcu.cz/public/CZERT-B-ner-BSNLP-cased.zip) <br>[PAV-ner-BSNLP](https://air.kiv.zcu.cz/public/PAV-ner-BSNLP-cased.zip) |
| Morphological Tagging<br> | [CZERT-A-morphtag-126k](https://air.kiv.zcu.cz/public/CZERT-A-morphtag-126k-cased.zip)<br>[CZERT-B-morphtag-126k](https://air.kiv.zcu.cz/public/CZERT-B-morphtag-126k-cased.zip)                                                                                                                                                                                                                                                                                  |
| Semantic Role Labelling                                                                                                                                                  |[CZERT-A-srl](https://air.kiv.zcu.cz/public/CZERT-A-srl-cased.zip)<br>                                              [CZERT-B-srl](https://air.kiv.zcu.cz/public/CZERT-B-srl-cased.zip)                                                                                                                                                                                                                                                                                                    |





## How to Use CZERT?

### Sentence Level Tasks
We evaluate our model on two sentence level tasks:
* Sentiment Classification,
* Semantic Text Similarity.



<!--     tokenizer = BertTokenizerFast.from_pretrained(CZERT_MODEL_PATH, strip_accents=False)  
\tmodel = TFAlbertForSequenceClassification.from_pretrained(CZERT_MODEL_PATH, num_labels=1)
    
or
    
    self.tokenizer = BertTokenizerFast.from_pretrained(CZERT_MODEL_PATH, strip_accents=False)
    self.model_encoder = AutoModelForSequenceClassification.from_pretrained(CZERT_MODEL_PATH, from_tf=True)
     -->
\t
### Document Level Tasks
We evaluate our model on one document level task
* Multi-label Document Classification.

### Token Level Tasks
We evaluate our model on three token level tasks:
* Named Entity Recognition,
* Morphological Tagging,
* Semantic Role Labelling. 


## Downstream Tasks Fine-tuning Results

### Sentiment Classification
|      |          mBERT           |        SlavicBERT        |         ALBERT-r         |         Czert-A         |             Czert-B              |
|:----:|:------------------------:|:------------------------:|:------------------------:|:-----------------------:|:--------------------------------:|
|  FB  | 71.72β€…Β±β€…0.91   | 73.87β€…Β±β€…0.50  | 59.50β€…Β±β€…0.47  | 72.47β€…Β±β€…0.72  | **76.55**β€…Β±β€…**0.14** |
| CSFD | 82.80β€…Β±β€…0.14   | 82.51β€…Β±β€…0.14  | 75.40β€…Β±β€…0.18  | 79.58β€…Β±β€…0.46  | **84.79**β€…Β±β€…**0.26** |

Average F1 results for the Sentiment Classification task. For more information, see [the paper](https://arxiv.org/abs/2103.13031). 
                 

### Semantic Text Similarity

|              |   **mBERT**    |   **Pavlov**   | **Albert-random** |  **Czert-A**   |      **Czert-B**       |
|:-------------|:--------------:|:--------------:|:-----------------:|:--------------:|:----------------------:|
| STA-CNA      | 83.335β€…Β±β€…0.063 | 83.593β€…Β±β€…0.050 |  43.184β€…Β±β€…0.125   | 82.942β€…Β±β€…0.106 | **84.345**β€…Β±β€…**0.028** |
| STS-SVOB-img | 79.367β€…Β±β€…0.486 | 79.900β€…Β±β€…0.810 |  15.739β€…Β±β€…2.992   | 79.444β€…Β±β€…0.338 | **83.744**β€…Β±β€…**0.395** |
| STS-SVOB-hl  | 78.833β€…Β±β€…0.296 | 76.996β€…Β±β€…0.305 |  33.949β€…Β±β€…1.807   | 75.089β€…Β±β€…0.806 |     **79.827β€…Β±β€…0.469**     |

Comparison of Pearson correlation achieved using pre-trained CZERT-A, CZERT-B, mBERT, Pavlov and randomly initialised Albert on semantic text similarity. For more information see [the paper](https://arxiv.org/abs/2103.13031).




### Multi-label Document Classification
|       |    mBERT     |  SlavicBERT  |   ALBERT-r   |   Czert-A    |      Czert-B        |
|:-----:|:------------:|:------------:|:------------:|:------------:|:-------------------:|
| AUROC | 97.62β€…Β±β€…0.08 | 97.80β€…Β±β€…0.06 | 94.35β€…Β±β€…0.13 | 97.49β€…Β±β€…0.07 | **98.00**β€…Β±β€…**0.04** |
|  F1   | 83.04β€…Β±β€…0.16 | 84.08β€…Β±β€…0.14 | 72.44β€…Β±β€…0.22 | 82.27β€…Β±β€…0.17 | **85.06**β€…Β±β€…**0.11** |

Comparison of F1 and AUROC score achieved using pre-trained CZERT-A, CZERT-B, mBERT, Pavlov and randomly initialised Albert on multi-label document classification. For more information see [the paper](https://arxiv.org/abs/2103.13031).

### Morphological Tagging
|                        | mBERT          | Pavlov         | Albert-random  | Czert-A        | Czert-B        |
|:-----------------------|:---------------|:---------------|:---------------|:---------------|:---------------|
| Universal Dependencies | 99.176β€…Β±β€…0.006 | 99.211β€…Β±β€…0.008 | 96.590β€…Β±β€…0.096 | 98.713β€…Β±β€…0.008 | **99.300β€…Β±β€…0.009** |

Comparison of F1 score achieved using pre-trained CZERT-A, CZERT-B, mBERT, Pavlov and randomly initialised Albert on morphological tagging task. For more information see [the paper](https://arxiv.org/abs/2103.13031).
### Semantic Role Labelling

<div id="tab:SRL">

|        |   mBERT    |   Pavlov   | Albert-random |  Czert-A   |  Czert-B   | dep-based | gold-dep |
|:------:|:----------:|:----------:|:-------------:|:----------:|:----------:|:---------:|:--------:|
|  span  | 78.547 Β± 0.110 | 79.333 Β± 0.080 |  51.365 Β± 0.423   | 72.254 Β± 0.172 | **81.861 Β± 0.102** |    \\-     |    \\-    |
| syntax | 90.226 Β± 0.224 | 90.492 Β± 0.040 |  80.747 Β± 0.131   | 80.319 Β± 0.054 | **91.462 Β± 0.062** |   85.19   |  89.52   |

SRL results – dep columns are evaluate with labelled F1 from CoNLL 2009 evaluation script, other columns are evaluated with span F1 score same as it was used for NER evaluation. For more information see [the paper](https://arxiv.org/abs/2103.13031).

</div>


### Named Entity Recognition
|            | mBERT          | Pavlov         | Albert-random  | Czert-A        | Czert-B        |
|:-----------|:---------------|:---------------|:---------------|:---------------|:---------------|
| CNEC       | **86.225β€…Β±β€…0.208** | **86.565β€…Β±β€…0.198** | 34.635β€…Β±β€…0.343 | 72.945β€…Β±β€…0.227 | 86.274β€…Β±β€…0.116 |
| BSNLP 2019 | 84.006β€…Β±β€…1.248 | **86.699β€…Β±β€…0.370** | 19.773β€…Β±β€…0.938 | 48.859β€…Β±β€…0.605 | **86.729 Β± 0.344** |

Comparison of f1 score achieved using pre-trained CZERT-A, CZERT-B, mBERT, Pavlov and randomly initialised Albert on named entity recognition task. For more information see [the paper](https://arxiv.org/abs/2103.13031).


## Licence
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/

## How should I cite CZERT? 
For now, please cite [the Arxiv paper](https://arxiv.org/abs/2103.13031):
```
@article{sido2021czert,
      title={Czert -- Czech BERT-like Model for Language Representation}, 
      author={Jakub Sido and OndΕ™ej PraΕΎΓ‘k and Pavel PΕ™ibÑň and Jan PaΕ‘ek and Michal SejΓ‘k and Miloslav KonopΓ­k},
      year={2021},
      eprint={2103.13031},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      journal={arXiv preprint arXiv:2103.13031},
}
```