File size: 5,755 Bytes
4fb12d5 26c4687 4fb12d5 26c4687 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 |
---
license: apache-2.0
pipeline_tag: fill-mask
tags:
- fill-mask
- transformers
- en
- ko
widget:
- text: 대한민국의 수도는 [MASK] 입니다.
---
# mdistilbertV2.1
- distilbert-base-multilingual-cased 모델에 [moco-corpus-kowiki2022 말뭉치](https://huggingface.co/datasets/bongsoo/moco-corpus-kowiki2022)(kowiki202206 + MOCOMSYS 추출 3.2M 문장)로 vocab 추가하여 학습 시킨 모델
- **vocab: 159,552개**(기존 bert 모델 vocab(119,548개)에 40,004개 (한글단어30,000개+영문10,000개+수동 4개)vocab 추가
## Usage (HuggingFace Transformers)
### 1. MASK 예시
```python
from transformers import AutoTokenizer, AutoModel, DistilBertForMaskedLM
import torch
import torch.nn.functional as F
tokenizer = AutoTokenizer.from_pretrained('bongsoo/mdistilbertV3.1', do_lower_case=False)
model = DistilBertForMaskedLM.from_pretrained('bongsoo/mdistilbertV3.1')
text = ['한국의 수도는 [MASK] 이다', '에펠탑은 [MASK]에 있다', '충무공 이순신은 [MASK]에 최고의 장수였다']
tokenized_input = tokenizer(text, max_length=128, truncation=True, padding='max_length', return_tensors='pt')
outputs = model(**tokenized_input)
logits = outputs.logits
mask_idx_list = []
for tokens in tokenized_input['input_ids'].tolist():
token_str = [tokenizer.convert_ids_to_tokens(s) for s in tokens]
# **위 token_str리스트에서 [MASK] 인덱스를 구함
# => **해당 [MASK] 안덱스 값 mask_idx 에서는 아래 출력하는데 사용됨
mask_idx = token_str.index('[MASK]')
mask_idx_list.append(mask_idx)
for idx, mask_idx in enumerate(mask_idx_list):
logits_pred=torch.argmax(F.softmax(logits[idx]), dim=1)
mask_logits_idx = int(logits_pred[mask_idx])
# [MASK]에 해당하는 token 구함
mask_logits_token = tokenizer.convert_ids_to_tokens(mask_logits_idx)
# 결과 출력
print('\n')
print('*Input: {}'.format(text[idx]))
print('*[MASK] : {} ({})'.format(mask_logits_token, mask_logits_idx))
```
- 결과
```
*Input: 한국의 수도는 [MASK] 이다
*[MASK] : 서울 (48253)
*Input: 에펠탑은 [MASK]에 있다
*[MASK] : 프랑스 (47364)
*Input: 충무공 이순신은 [MASK]에 최고의 장수였다
*[MASK] : 임진왜란 (121990)
```
### 2. 임베딩 예시
- 평균 폴링(mean_pooling) 방식 사용. ([cls 폴링](https://huggingface.co/sentence-transformers/bert-base-nli-cls-token), [max 폴링](https://huggingface.co/sentence-transformers/bert-base-nli-max-tokens))
```python
from transformers import AutoTokenizer, AutoModel
import torch
#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0] #First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('bongsoo/mdistilbertV3.1')
model = AutoModel.from_pretrained('bongsoo/mdistilbertV3.1')
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)
# sklearn 을 이용하여 cosine_scores를 구함
# => 입력값 embeddings 은 (1,768) 처럼 2D 여야 함.
from sklearn.metrics.pairwise import paired_cosine_distances, paired_euclidean_distances, paired_manhattan_distances
cosine_scores = 1 - (paired_cosine_distances(sentence_embeddings[0].reshape(1,-1), sentence_embeddings[1].reshape(1,-1)))
print(f'*cosine_score:{cosine_scores[0]}')
```
- 결과
```
Sentence embeddings:
tensor([[-0.1137, 0.1491, 0.6711, ..., -0.0217, 0.1839, -0.6143],
[ 0.0482, -0.0649, 0.5333, ..., 0.1424, -0.0982, -0.3414]])
*cosine_score:0.4784715175628662
```
## Training
**MLM(Masked Langeuage Model) 훈련**
- 입력 모델 : distilbert-base-multilingual-cased
- 말뭉치 : 훈련 : bongsoo/moco-corpus-kowiki2022(7.6M) , 평가: ** bongsoo/moco_eva **
- HyperParameter : **LearningRate : 5e-5, ** epochs: 12 **, batchsize: 32, max_token_len : 128**
- vocab : **159,552개** (기존 bert 모델 vocab(119,548개)에 40,004개 (한글단어30,000개+영문10,000개+수동 4개)vocab 추가
- 출력 모델 : mdistilbertV3.1 (size: 634MB)
- 훈련시간 : 90h/1GPU (24GB/16.5 use)
- loss : **훈련loss: 2.1154, 평가loss: 2.5275 **
- 훈련코드 [여기](https://github.com/kobongsoo/BERT/blob/master/distilbert/distilbert-MLM-Trainer-V1.2.ipynb) 참조
<br>perplexity 평가 코드는 [여기](https://github.com/kobongsoo/BERT/blob/master/distilbert/distilbert-perplexity-eval-V1.2.ipynb) 참조
## Model Config
```
{
"_name_or_path": "",
"activation": "gelu",
"architectures": [
"DistilBertForMaskedLM"
],
"attention_dropout": 0.1,
"dim": 768,
"dropout": 0.1,
"hidden_dim": 3072,
"initializer_range": 0.02,
"max_position_embeddings": 512,
"model_type": "distilbert",
"n_heads": 12,
"n_layers": 6,
"output_past": true,
"pad_token_id": 0,
"qa_dropout": 0.1,
"seq_classif_dropout": 0.2,
"sinusoidal_pos_embds": false,
"tie_weights_": true,
"torch_dtype": "float32",
"transformers_version": "4.21.2",
"vocab_size": 159552
}
```
## Citing & Authors
bongsoo
|