File size: 5,809 Bytes
a2c3195 6694a00 da842a2 04488c1 da842a2 a2c3195 96c8ba9 6694a00 1f14af8 b3de27a 6694a00 96c8ba9 6694a00 96c8ba9 6694a00 1f14af8 6694a00 2eb8131 8008663 64f1eda 779fbb1 6694a00 64f1eda 6694a00 2eb8131 6694a00 3b64fd9 6694a00 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 |
---
license: apache-2.0
tags:
- fill-mask
- transformers
- en
- ko
datasets: bongsoo/bongeval
pipeline_tag: fill-mask
widget:
- text: 대한민국의 수도는 [MASK] 입니다.
---
# mbertV2.0
- bert-base-multilingual-cased 모델에 [moco-corpus-kowiki2022 말뭉치](https://huggingface.co/datasets/bongsoo/moco-corpus-kowiki2022)(kowiki202206 + MOCOMSYS 추출 3.2M 문장)로 vocab 추가하여 학습 시킨 모델
- **vocab: 152,537개**(기존 bert 모델 vocab(119,548개)에 32,989개 vocab 추가)
## Usage (HuggingFace Transformers)
### 1. MASK 예시
```python
from transformers import AutoTokenizer, AutoModel, BertForMaskedLM
import torch
import torch.nn.functional as F
tokenizer = AutoTokenizer.from_pretrained('bongsoo/mbertV2.0', do_lower_case=False)
model = BertForMaskedLM.from_pretrained('bongsoo/mbertV2.0')
text = ['한국의 수도는 [MASK] 이다', '에펠탑은 [MASK]에 있다', '충무공 이순신은 [MASK]에 최고의 장수였다']
tokenized_input = tokenizer(text, max_length=128, truncation=True, padding='max_length', return_tensors='pt')
outputs = model(**tokenized_input)
logits = outputs.logits
mask_idx_list = []
for tokens in tokenized_input['input_ids'].tolist():
token_str = [tokenizer.convert_ids_to_tokens(s) for s in tokens]
# **위 token_str리스트에서 [MASK] 인덱스를 구함
# => **해당 [MASK] 안덱스 값 mask_idx 에서는 아래 출력하는데 사용됨
mask_idx = token_str.index('[MASK]')
mask_idx_list.append(mask_idx)
for idx, mask_idx in enumerate(mask_idx_list):
logits_pred=torch.argmax(F.softmax(logits[idx]), dim=1)
mask_logits_idx = int(logits_pred[mask_idx])
# [MASK]에 해당하는 token 구함
mask_logits_token = tokenizer.convert_ids_to_tokens(mask_logits_idx)
# 결과 출력
print('\n')
print('*Input: {}'.format(text[idx]))
print('*[MASK] : {} ({})'.format(mask_logits_token, mask_logits_idx))
```
- 결과
```
*Input: 한국의 수도는 [MASK] 이다
*[MASK] : 서울 (48253)
*Input: 에펠탑은 [MASK]에 있다
*[MASK] : 런던 (120350)
*Input: 충무공 이순신은 [MASK]에 최고의 장수였다
*[MASK] : 조선 (59906)
```
### 2. 임베딩 예시
- 평균 폴링(mean_pooling) 방식 사용. ([cls 폴링](https://huggingface.co/sentence-transformers/bert-base-nli-cls-token), [max 폴링](https://huggingface.co/sentence-transformers/bert-base-nli-max-tokens))
```python
from transformers import AutoTokenizer, AutoModel
import torch
#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0] #First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('bongsoo/mbertV2.0')
model = AutoModel.from_pretrained('bongsoo/mbertV2.0')
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)
# sklearn 을 이용하여 cosine_scores를 구함
# => 입력값 embeddings 은 (1,768) 처럼 2D 여야 함.
from sklearn.metrics.pairwise import paired_cosine_distances, paired_euclidean_distances, paired_manhattan_distances
cosine_scores = 1 - (paired_cosine_distances(sentence_embeddings[0].reshape(1,-1), sentence_embeddings[1].reshape(1,-1)))
print(f'*cosine_score:{cosine_scores[0]}')
```
- 결과
```
*cosine_score:0.5596463680267334
```
## Training
**MLM(Masked Langeuage Model) 훈련**
- 입력 모델 : bert-base-multilingual-cased(vocab(119,548개))
- 말뭉치 : 훈련 : bongsoo/moco-corpus-kowiki2022(7.6M) , 평가: bongsoo/bongevalsmall(200)
- HyperParameter : **LearningRate : 5e-5, epochs: 8, batchsize: 32, max_token_len : 128**
- vocab : **152,537개** (기존 119,548 에 32,989 신규 vocab 추가)
- 출력 모델 : mbertV2.0 (size: 813MB)
- 훈련시간 : 90h/1GPU (24GB/19.6GB use)
- loss : **훈련loss: 2.258400, 평가loss: 3.102096, perplexity: 19.78158**([bongsoo/bongeval](https://huggingface.co/datasets/bongsoo/bongeval):1,500개)
- 훈련코드 [여기](https://github.com/kobongsoo/BERT/blob/master/bert/bert-MLM-Trainer-V1.2.ipynb) 참조
<br>perplexity 평가 코드는 [여기](https://github.com/kobongsoo/BERT/blob/master/bert/bert-perplexity-eval-V1.2.ipynb) 참조
## Model Config
```
{
"_name_or_path": "bert-base-multilingual-cased",
"architectures": [
"BertForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"classifier_dropout": null,
"directionality": "bidi",
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 0,
"pooler_fc_size": 768,
"pooler_num_attention_heads": 12,
"pooler_num_fc_layers": 3,
"pooler_size_per_head": 128,
"pooler_type": "first_token_transform",
"position_embedding_type": "absolute",
"torch_dtype": "float32",
"transformers_version": "4.21.2",
"type_vocab_size": 2,
"use_cache": true,
"vocab_size": 152537
}
```
## Citing & Authors
bongsoo
|