Small-E-Czech
Small-E-Czech is an Electra-small model pretrained on a Czech web corpus created at Seznam.cz and introduced in an IAAI 2022 paper. Like other pretrained models, it should be finetuned on a downstream task of interest before use. At Seznam.cz, it has helped improve web search ranking, query typo correction or clickbait titles detection. We release it under CC BY 4.0 license (i.e. allowing commercial use). To raise an issue, please visit our github.
How to use the discriminator in transformers
from transformers import ElectraForPreTraining, ElectraTokenizerFast
import torch
discriminator = ElectraForPreTraining.from_pretrained("Seznam/small-e-czech")
tokenizer = ElectraTokenizerFast.from_pretrained("Seznam/small-e-czech")
sentence = "Za hory, za doly, mé zlaté parohy"
fake_sentence = "Za hory, za doly, kočka zlaté parohy"
fake_sentence_tokens = ["[CLS]"] + tokenizer.tokenize(fake_sentence) + ["[SEP]"]
fake_inputs = tokenizer.encode(fake_sentence, return_tensors="pt")
outputs = discriminator(fake_inputs)
predictions = torch.nn.Sigmoid()(outputs[0]).cpu().detach().numpy()
for token in fake_sentence_tokens:
print("{:>7s}".format(token), end="")
print()
for prediction in predictions.squeeze():
print("{:7.1f}".format(prediction), end="")
print()
In the output we can see the probabilities of particular tokens not belonging in the sentence (i.e. having been faked by the generator) according to the discriminator:
[CLS] za hory , za dol ##y , kočka zlaté paro ##hy [SEP]
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.8 0.3 0.2 0.1 0.0
Finetuning
For instructions on how to finetune the model on a new task, see the official HuggingFace transformers tutorial.
- Downloads last month
- 134