File size: 1,693 Bytes
1818d35
 
 
 
 
49736c4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9cc7a78
 
49736c4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
---
language: cs
license: MIT
---

# Small-E-Czech

Small-E-Czech is an [Electra](https://arxiv.org/abs/2003.10555)-small model pretrained on a Czech corpus created at Seznam.cz. Like other pretrained models, it should be finetuned on a downstream task of interest before use.

### How to use the discriminator in transformers
```python
from transformers import ElectraForPreTraining, ElectraTokenizerFast
import torch

discriminator = ElectraForPreTraining.from_pretrained("seznam/small-e-czech")
tokenizer = ElectraTokenizerFast.from_pretrained(
    "seznam/small-e-czech", strip_accents=False
)

sentence = "Za hory, za doly, mé zlaté parohy"
fake_sentence = "Za hory, za doly, kočka zlaté parohy"

fake_sentence_tokens = ["[CLS]"] + tokenizer.tokenize(fake_sentence) + ["[SEP]"]
fake_inputs = tokenizer.encode(fake_sentence, return_tensors="pt")
outputs = discriminator(fake_inputs)
predictions = torch.nn.Sigmoid()(outputs[0]).cpu().detach().numpy()

for token in fake_sentence_tokens:
    print("{:>7s}".format(token), end="")
print()

for prediction in predictions.squeeze():
    print("{:7.1f}".format(prediction), end="")
print()
```

In the output we can see the probabilities of particular tokens not belonging in the sentence (i.e. having been faked by the generator) according to the discriminator:

```
  [CLS]     za   hory      ,     za    dol    ##y      ,  kočka  zlaté   paro   ##hy  [SEP]
    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.8    0.3    0.2    0.1    0.0
```

### Finetuning

For instructions on how to finetune the model on a new task, see the official HuggingFace transformers [tutorial](https://huggingface.co/transformers/training.html).