|
--- |
|
license: apache-2.0 |
|
pipeline_tag: fill-mask |
|
tags: |
|
- fill-mask |
|
- transformers |
|
- en |
|
- ko |
|
widget: |
|
- text: 대한민국의 수도는 [MASK] 입니다. |
|
--- |
|
# mdistilbertV2.1 |
|
|
|
- distilbert-base-multilingual-cased 모델에 [moco-corpus-kowiki2022 말뭉치](https://huggingface.co/datasets/bongsoo/moco-corpus-kowiki2022)(kowiki202206 + MOCOMSYS 추출 3.2M 문장)로 vocab 추가하여 학습 시킨 모델 |
|
- **vocab: 159,552개**(기존 bert 모델 vocab(119,548개)에 40,004개 (한글단어30,000개+영문10,000개+수동 4개)vocab 추가 |
|
|
|
## Usage (HuggingFace Transformers) |
|
|
|
### 1. MASK 예시 |
|
```python |
|
from transformers import AutoTokenizer, AutoModel, DistilBertForMaskedLM |
|
import torch |
|
import torch.nn.functional as F |
|
|
|
tokenizer = AutoTokenizer.from_pretrained('bongsoo/mdistilbertV3.1', do_lower_case=False) |
|
model = DistilBertForMaskedLM.from_pretrained('bongsoo/mdistilbertV3.1') |
|
|
|
text = ['한국의 수도는 [MASK] 이다', '에펠탑은 [MASK]에 있다', '충무공 이순신은 [MASK]에 최고의 장수였다'] |
|
tokenized_input = tokenizer(text, max_length=128, truncation=True, padding='max_length', return_tensors='pt') |
|
|
|
outputs = model(**tokenized_input) |
|
logits = outputs.logits |
|
|
|
mask_idx_list = [] |
|
for tokens in tokenized_input['input_ids'].tolist(): |
|
token_str = [tokenizer.convert_ids_to_tokens(s) for s in tokens] |
|
|
|
# **위 token_str리스트에서 [MASK] 인덱스를 구함 |
|
# => **해당 [MASK] 안덱스 값 mask_idx 에서는 아래 출력하는데 사용됨 |
|
mask_idx = token_str.index('[MASK]') |
|
mask_idx_list.append(mask_idx) |
|
|
|
for idx, mask_idx in enumerate(mask_idx_list): |
|
|
|
logits_pred=torch.argmax(F.softmax(logits[idx]), dim=1) |
|
mask_logits_idx = int(logits_pred[mask_idx]) |
|
# [MASK]에 해당하는 token 구함 |
|
mask_logits_token = tokenizer.convert_ids_to_tokens(mask_logits_idx) |
|
# 결과 출력 |
|
print('\n') |
|
print('*Input: {}'.format(text[idx])) |
|
print('*[MASK] : {} ({})'.format(mask_logits_token, mask_logits_idx)) |
|
``` |
|
- 결과 |
|
``` |
|
*Input: 한국의 수도는 [MASK] 이다 |
|
*[MASK] : 서울 (48253) |
|
|
|
|
|
*Input: 에펠탑은 [MASK]에 있다 |
|
*[MASK] : 프랑스 (47364) |
|
|
|
|
|
*Input: 충무공 이순신은 [MASK]에 최고의 장수였다 |
|
*[MASK] : 임진왜란 (121990) |
|
``` |
|
### 2. 임베딩 예시 |
|
- 평균 폴링(mean_pooling) 방식 사용. ([cls 폴링](https://huggingface.co/sentence-transformers/bert-base-nli-cls-token), [max 폴링](https://huggingface.co/sentence-transformers/bert-base-nli-max-tokens)) |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModel |
|
import torch |
|
|
|
|
|
#Mean Pooling - Take attention mask into account for correct averaging |
|
def mean_pooling(model_output, attention_mask): |
|
token_embeddings = model_output[0] #First element of model_output contains all token embeddings |
|
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() |
|
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) |
|
|
|
|
|
# Sentences we want sentence embeddings for |
|
sentences = ['This is an example sentence', 'Each sentence is converted'] |
|
|
|
# Load model from HuggingFace Hub |
|
tokenizer = AutoTokenizer.from_pretrained('bongsoo/mdistilbertV3.1') |
|
model = AutoModel.from_pretrained('bongsoo/mdistilbertV3.1') |
|
|
|
# Tokenize sentences |
|
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') |
|
|
|
# Compute token embeddings |
|
with torch.no_grad(): |
|
model_output = model(**encoded_input) |
|
|
|
# Perform pooling. In this case, mean pooling. |
|
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask']) |
|
|
|
print("Sentence embeddings:") |
|
print(sentence_embeddings) |
|
|
|
# sklearn 을 이용하여 cosine_scores를 구함 |
|
# => 입력값 embeddings 은 (1,768) 처럼 2D 여야 함. |
|
from sklearn.metrics.pairwise import paired_cosine_distances, paired_euclidean_distances, paired_manhattan_distances |
|
cosine_scores = 1 - (paired_cosine_distances(sentence_embeddings[0].reshape(1,-1), sentence_embeddings[1].reshape(1,-1))) |
|
|
|
print(f'*cosine_score:{cosine_scores[0]}') |
|
``` |
|
- 결과 |
|
``` |
|
Sentence embeddings: |
|
tensor([[-0.1137, 0.1491, 0.6711, ..., -0.0217, 0.1839, -0.6143], |
|
[ 0.0482, -0.0649, 0.5333, ..., 0.1424, -0.0982, -0.3414]]) |
|
*cosine_score:0.4784715175628662 |
|
``` |
|
## Training |
|
|
|
**MLM(Masked Langeuage Model) 훈련** |
|
- 입력 모델 : distilbert-base-multilingual-cased |
|
- 말뭉치 : 훈련 : bongsoo/moco-corpus-kowiki2022(7.6M) , 평가: ** bongsoo/moco_eva ** |
|
- HyperParameter : **LearningRate : 5e-5, ** epochs: 12 **, batchsize: 32, max_token_len : 128** |
|
- vocab : **159,552개** (기존 bert 모델 vocab(119,548개)에 40,004개 (한글단어30,000개+영문10,000개+수동 4개)vocab 추가 |
|
- 출력 모델 : mdistilbertV3.1 (size: 634MB) |
|
- 훈련시간 : 90h/1GPU (24GB/16.5 use) |
|
- loss : **훈련loss: 2.1154, 평가loss: 2.5275 ** |
|
- 훈련코드 [여기](https://github.com/kobongsoo/BERT/blob/master/distilbert/distilbert-MLM-Trainer-V1.2.ipynb) 참조 |
|
<br>perplexity 평가 코드는 [여기](https://github.com/kobongsoo/BERT/blob/master/distilbert/distilbert-perplexity-eval-V1.2.ipynb) 참조 |
|
|
|
## Model Config |
|
``` |
|
{ |
|
"_name_or_path": "", |
|
"activation": "gelu", |
|
"architectures": [ |
|
"DistilBertForMaskedLM" |
|
], |
|
"attention_dropout": 0.1, |
|
"dim": 768, |
|
"dropout": 0.1, |
|
"hidden_dim": 3072, |
|
"initializer_range": 0.02, |
|
"max_position_embeddings": 512, |
|
"model_type": "distilbert", |
|
"n_heads": 12, |
|
"n_layers": 6, |
|
"output_past": true, |
|
"pad_token_id": 0, |
|
"qa_dropout": 0.1, |
|
"seq_classif_dropout": 0.2, |
|
"sinusoidal_pos_embds": false, |
|
"tie_weights_": true, |
|
"torch_dtype": "float32", |
|
"transformers_version": "4.21.2", |
|
"vocab_size": 159552 |
|
} |
|
``` |
|
## Citing & Authors |
|
|
|
bongsoo |
|
|