bongsoo commited on
Commit
26c4687
1 Parent(s): 8d35fc2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +156 -0
README.md CHANGED
@@ -1,3 +1,159 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ pipeline_tag: fill-mask
4
+ tags:
5
+ - fill-mask
6
+ - transformers
7
+ - en
8
+ - ko
9
+ widget:
10
+ - text: 대한민국의 수도는 [MASK] 입니다.
11
  ---
12
+ # mdistilbertV2.1
13
+
14
+ - distilbert-base-multilingual-cased 모델에 [moco-corpus-kowiki2022 말뭉치](https://huggingface.co/datasets/bongsoo/moco-corpus-kowiki2022)(kowiki202206 + MOCOMSYS 추출 3.2M 문장)로 vocab 추가하여 학습 시킨 모델
15
+ - **vocab: 159,552개**(기존 bert 모델 vocab(119,548개)에 40,004개 (한글단어30,000개+영문10,000개+수동 4개)vocab 추가
16
+
17
+ ## Usage (HuggingFace Transformers)
18
+
19
+ ### 1. MASK 예시
20
+ ```python
21
+ from transformers import AutoTokenizer, AutoModel, DistilBertForMaskedLM
22
+ import torch
23
+ import torch.nn.functional as F
24
+
25
+ tokenizer = AutoTokenizer.from_pretrained('bongsoo/mdistilbertV3.1', do_lower_case=False)
26
+ model = DistilBertForMaskedLM.from_pretrained('bongsoo/mdistilbertV3.1')
27
+
28
+ text = ['한국의 수도는 [MASK] 이다', '에펠탑은 [MASK]에 있다', '충무공 이순신은 [MASK]에 최고의 장수였다']
29
+ tokenized_input = tokenizer(text, max_length=128, truncation=True, padding='max_length', return_tensors='pt')
30
+
31
+ outputs = model(**tokenized_input)
32
+ logits = outputs.logits
33
+
34
+ mask_idx_list = []
35
+ for tokens in tokenized_input['input_ids'].tolist():
36
+ token_str = [tokenizer.convert_ids_to_tokens(s) for s in tokens]
37
+
38
+ # **위 token_str리스트에서 [MASK] 인덱스를 구함
39
+ # => **해당 [MASK] 안덱스 값 mask_idx 에서는 아래 출력하는데 사용됨
40
+ mask_idx = token_str.index('[MASK]')
41
+ mask_idx_list.append(mask_idx)
42
+
43
+ for idx, mask_idx in enumerate(mask_idx_list):
44
+
45
+ logits_pred=torch.argmax(F.softmax(logits[idx]), dim=1)
46
+ mask_logits_idx = int(logits_pred[mask_idx])
47
+ # [MASK]에 해당하는 token 구함
48
+ mask_logits_token = tokenizer.convert_ids_to_tokens(mask_logits_idx)
49
+ # 결과 출력
50
+ print('\n')
51
+ print('*Input: {}'.format(text[idx]))
52
+ print('*[MASK] : {} ({})'.format(mask_logits_token, mask_logits_idx))
53
+ ```
54
+ - 결과
55
+ ```
56
+ *Input: 한국의 수도는 [MASK] 이다
57
+ *[MASK] : 서울 (48253)
58
+
59
+
60
+ *Input: 에펠탑은 [MASK]에 있다
61
+ *[MASK] : 프랑스 (47364)
62
+
63
+
64
+ *Input: 충무공 이순신은 [MASK]에 최고의 장수였다
65
+ *[MASK] : 임진왜란 (121990)
66
+ ```
67
+ ### 2. 임베딩 예시
68
+ - 평균 폴링(mean_pooling) 방식 사용. ([cls 폴링](https://huggingface.co/sentence-transformers/bert-base-nli-cls-token), [max 폴링](https://huggingface.co/sentence-transformers/bert-base-nli-max-tokens))
69
+
70
+ ```python
71
+ from transformers import AutoTokenizer, AutoModel
72
+ import torch
73
+
74
+
75
+ #Mean Pooling - Take attention mask into account for correct averaging
76
+ def mean_pooling(model_output, attention_mask):
77
+ token_embeddings = model_output[0] #First element of model_output contains all token embeddings
78
+ input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
79
+ return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
80
+
81
+
82
+ # Sentences we want sentence embeddings for
83
+ sentences = ['This is an example sentence', 'Each sentence is converted']
84
+
85
+ # Load model from HuggingFace Hub
86
+ tokenizer = AutoTokenizer.from_pretrained('bongsoo/mdistilbertV3.1')
87
+ model = AutoModel.from_pretrained('bongsoo/mdistilbertV3.1')
88
+
89
+ # Tokenize sentences
90
+ encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
91
+
92
+ # Compute token embeddings
93
+ with torch.no_grad():
94
+ model_output = model(**encoded_input)
95
+
96
+ # Perform pooling. In this case, mean pooling.
97
+ sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
98
+
99
+ print("Sentence embeddings:")
100
+ print(sentence_embeddings)
101
+
102
+ # sklearn 을 이용하여 cosine_scores를 구함
103
+ # => 입력값 embeddings 은 (1,768) 처럼 2D 여야 함.
104
+ from sklearn.metrics.pairwise import paired_cosine_distances, paired_euclidean_distances, paired_manhattan_distances
105
+ cosine_scores = 1 - (paired_cosine_distances(sentence_embeddings[0].reshape(1,-1), sentence_embeddings[1].reshape(1,-1)))
106
+
107
+ print(f'*cosine_score:{cosine_scores[0]}')
108
+ ```
109
+ - 결과
110
+ ```
111
+ Sentence embeddings:
112
+ tensor([[-0.1137, 0.1491, 0.6711, ..., -0.0217, 0.1839, -0.6143],
113
+ [ 0.0482, -0.0649, 0.5333, ..., 0.1424, -0.0982, -0.3414]])
114
+ *cosine_score:0.4784715175628662
115
+ ```
116
+ ## Training
117
+
118
+ **MLM(Masked Langeuage Model) 훈련**
119
+ - 입력 모델 : distilbert-base-multilingual-cased
120
+ - 말뭉치 : 훈련 : bongsoo/moco-corpus-kowiki2022(7.6M) , 평가: ** bongsoo/moco_eva **
121
+ - HyperParameter : **LearningRate : 5e-5, ** epochs: 12 **, batchsize: 32, max_token_len : 128**
122
+ - vocab : **159,552개** (기존 bert 모델 vocab(119,548개)에 40,004개 (한글단어30,000개+영문10,000개+수동 4개)vocab 추가
123
+ - 출력 모델 : mdistilbertV3.1 (size: 634MB)
124
+ - 훈련시간 : 90h/1GPU (24GB/16.5 use)
125
+ - loss : **훈련loss: 2.1154, 평가loss: 2.5275 **
126
+ - 훈련코드 [여기](https://github.com/kobongsoo/BERT/blob/master/distilbert/distilbert-MLM-Trainer-V1.2.ipynb) 참조
127
+ <br>perplexity 평가 코드는 [여기](https://github.com/kobongsoo/BERT/blob/master/distilbert/distilbert-perplexity-eval-V1.2.ipynb) 참조
128
+
129
+ ## Model Config
130
+ ```
131
+ {
132
+ "_name_or_path": "",
133
+ "activation": "gelu",
134
+ "architectures": [
135
+ "DistilBertForMaskedLM"
136
+ ],
137
+ "attention_dropout": 0.1,
138
+ "dim": 768,
139
+ "dropout": 0.1,
140
+ "hidden_dim": 3072,
141
+ "initializer_range": 0.02,
142
+ "max_position_embeddings": 512,
143
+ "model_type": "distilbert",
144
+ "n_heads": 12,
145
+ "n_layers": 6,
146
+ "output_past": true,
147
+ "pad_token_id": 0,
148
+ "qa_dropout": 0.1,
149
+ "seq_classif_dropout": 0.2,
150
+ "sinusoidal_pos_embds": false,
151
+ "tie_weights_": true,
152
+ "torch_dtype": "float32",
153
+ "transformers_version": "4.21.2",
154
+ "vocab_size": 159552
155
+ }
156
+ ```
157
+ ## Citing & Authors
158
+
159
+ bongsoo