monologg commited on
Commit
fcafbd4
1 Parent(s): 68b30cd

upload tf v1 ckpt

Browse files
README.md DELETED
@@ -1,55 +0,0 @@
1
- ---
2
- language: ko
3
- license: apache-2.0
4
- tags:
5
- - korean
6
- ---
7
-
8
- # KoELECTRA v3 (Base Discriminator)
9
-
10
- Pretrained ELECTRA Language Model for Korean (`koelectra-base-v3-discriminator`)
11
-
12
- For more detail, please see [original repository](https://github.com/monologg/KoELECTRA/blob/master/README_EN.md).
13
-
14
- ## Usage
15
-
16
- ### Load model and tokenizer
17
-
18
- ```python
19
- >>> from transformers import ElectraModel, ElectraTokenizer
20
-
21
- >>> model = ElectraModel.from_pretrained("monologg/koelectra-base-v3-discriminator")
22
- >>> tokenizer = ElectraTokenizer.from_pretrained("monologg/koelectra-base-v3-discriminator")
23
- ```
24
-
25
- ### Tokenizer example
26
-
27
- ```python
28
- >>> from transformers import ElectraTokenizer
29
- >>> tokenizer = ElectraTokenizer.from_pretrained("monologg/koelectra-base-v3-discriminator")
30
- >>> tokenizer.tokenize("[CLS] 한국어 ELECTRA를 공유합니다. [SEP]")
31
- ['[CLS]', '한국어', 'EL', '##EC', '##TRA', '##를', '공유', '##합니다', '.', '[SEP]']
32
- >>> tokenizer.convert_tokens_to_ids(['[CLS]', '한국어', 'EL', '##EC', '##TRA', '##를', '공유', '##합니다', '.', '[SEP]'])
33
- [2, 11229, 29173, 13352, 25541, 4110, 7824, 17788, 18, 3]
34
- ```
35
-
36
- ## Example using ElectraForPreTraining
37
-
38
- ```python
39
- import torch
40
- from transformers import ElectraForPreTraining, ElectraTokenizer
41
-
42
- discriminator = ElectraForPreTraining.from_pretrained("monologg/koelectra-base-v3-discriminator")
43
- tokenizer = ElectraTokenizer.from_pretrained("monologg/koelectra-base-v3-discriminator")
44
-
45
- sentence = "나는 방금 밥을 먹었다."
46
- fake_sentence = "나는 내일 밥을 먹었다."
47
-
48
- fake_tokens = tokenizer.tokenize(fake_sentence)
49
- fake_inputs = tokenizer.encode(fake_sentence, return_tensors="pt")
50
-
51
- discriminator_outputs = discriminator(fake_inputs)
52
- predictions = torch.round((torch.sign(discriminator_outputs[0]) + 1) / 2)
53
-
54
- print(list(zip(fake_tokens, predictions.tolist()[1:-1])))
55
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
config.json DELETED
@@ -1,20 +0,0 @@
1
- {
2
- "architectures": [
3
- "ElectraForPreTraining"
4
- ],
5
- "attention_probs_dropout_prob": 0.1,
6
- "hidden_size": 768,
7
- "intermediate_size": 3072,
8
- "num_attention_heads": 12,
9
- "num_hidden_layers": 12,
10
- "embedding_size": 768,
11
- "hidden_act": "gelu",
12
- "hidden_dropout_prob": 0.1,
13
- "initializer_range": 0.02,
14
- "layer_norm_eps": 1e-12,
15
- "max_position_embeddings": 512,
16
- "model_type": "electra",
17
- "type_vocab_size": 2,
18
- "vocab_size": 35000,
19
- "pad_token_id": 0
20
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
pytorch_model.bin → koelectra-base-v3.tar.gz RENAMED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:f3f249cc07826bc2964a2c0dbbd0d239df4fe55efc03ba33e1bd1584cf56c566
3
- size 451741507
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2c27c8ec7bb7034ecc201292e7d79fab762cbd00795021131c999c0f56232730
3
+ size 1354355275
tokenizer_config.json DELETED
@@ -1,4 +0,0 @@
1
- {
2
- "do_lower_case": false,
3
- "model_max_length": 512
4
- }
 
 
 
 
vocab.txt DELETED
The diff for this file is too large to render. See raw diff