monologg commited on
Commit
31a3ea7
1 Parent(s): fc3894f

upload tf v1 ckpt

Browse files
README.md DELETED
@@ -1,52 +0,0 @@
1
- ---
2
- language: ko
3
- ---
4
-
5
- # KoELECTRA (Small Discriminator)
6
-
7
- Pretrained ELECTRA Language Model for Korean (`koelectra-small-discriminator`)
8
-
9
- For more detail, please see [original repository](https://github.com/monologg/KoELECTRA/blob/master/README_EN.md).
10
-
11
- ## Usage
12
-
13
- ### Load model and tokenizer
14
-
15
- ```python
16
- >>> from transformers import ElectraModel, ElectraTokenizer
17
-
18
- >>> model = ElectraModel.from_pretrained("monologg/koelectra-small-discriminator")
19
- >>> tokenizer = ElectraTokenizer.from_pretrained("monologg/koelectra-small-discriminator")
20
- ```
21
-
22
- ### Tokenizer example
23
-
24
- ```python
25
- >>> from transformers import ElectraTokenizer
26
- >>> tokenizer = ElectraTokenizer.from_pretrained("monologg/koelectra-small-discriminator")
27
- >>> tokenizer.tokenize("[CLS] 한국어 ELECTRA를 공유합니다. [SEP]")
28
- ['[CLS]', '한국어', 'E', '##L', '##EC', '##T', '##RA', '##를', '공유', '##합니다', '.', '[SEP]']
29
- >>> tokenizer.convert_tokens_to_ids(['[CLS]', '한국어', 'E', '##L', '##EC', '##T', '##RA', '##를', '공유', '##합니다', '.', '[SEP]'])
30
- [2, 18429, 41, 6240, 15229, 6204, 20894, 5689, 12622, 10690, 18, 3]
31
- ```
32
-
33
- ## Example using ElectraForPreTraining
34
-
35
- ```python
36
- import torch
37
- from transformers import ElectraForPreTraining, ElectraTokenizer
38
-
39
- discriminator = ElectraForPreTraining.from_pretrained("monologg/koelectra-small-discriminator")
40
- tokenizer = ElectraTokenizer.from_pretrained("monologg/koelectra-small-discriminator")
41
-
42
- sentence = "나는 방금 밥을 먹었다."
43
- fake_sentence = "나는 내일 밥을 먹었다."
44
-
45
- fake_tokens = tokenizer.tokenize(fake_sentence)
46
- fake_inputs = tokenizer.encode(fake_sentence, return_tensors="pt")
47
-
48
- discriminator_outputs = discriminator(fake_inputs)
49
- predictions = torch.round((torch.sign(discriminator_outputs[0]) + 1) / 2)
50
-
51
- print(list(zip(fake_tokens, predictions.tolist()[1:-1])))
52
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
config.json DELETED
@@ -1,20 +0,0 @@
1
- {
2
- "architectures": [
3
- "ElectraForPreTraining"
4
- ],
5
- "attention_probs_dropout_prob": 0.1,
6
- "embedding_size": 128,
7
- "hidden_act": "gelu",
8
- "hidden_dropout_prob": 0.1,
9
- "hidden_size": 256,
10
- "initializer_range": 0.02,
11
- "intermediate_size": 1024,
12
- "layer_norm_eps": 1e-12,
13
- "max_position_embeddings": 512,
14
- "model_type": "electra",
15
- "num_attention_heads": 4,
16
- "num_hidden_layers": 12,
17
- "pad_token_id": 0,
18
- "type_vocab_size": 2,
19
- "vocab_size": 32200
20
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
pytorch_model.bin → koelectra-small-v1.tar.gz RENAMED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:9397dd1796572c590e2a5b39a49e0b031ddf295af46a6ad61c8ff96b6a3a8757
3
- size 55102217
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:be0413ef3a3c3ac83d8f973cbdc598567d36dd71d0c818ecbbfa587ca0402440
3
+ size 258579342
tokenizer_config.json DELETED
@@ -1,4 +0,0 @@
1
- {
2
- "do_lower_case": false,
3
- "model_max_length": 512
4
- }
 
 
 
 
 
vocab.txt DELETED
The diff for this file is too large to render. See raw diff