Edit model card

mdistilbertV3.1

  • distilbert-base-multilingual-cased 모델에 moco-corpus-kowiki2022 말뭉치(kowiki202206 + MOCOMSYS 추출 3.2M 문장)로 vocab 추가하여 학습 시킨 모델
  • vocab: 159,552개 (기존 bert 모델 vocab(119,548개)에 40,004개 (한글단어30,000개+영문10,000개+수동 4개)vocab 추가
  • mdistilbertV2.1 보다 약 7,000개 단어가 더 많고, 한글단어는 mecab를 이용하여 추출함.
  • epoch은 12번 학습함(mdistilbertV2.1은 8번)

Usage (HuggingFace Transformers)

1. MASK 예시

from transformers import AutoTokenizer, AutoModel, DistilBertForMaskedLM
import torch
import torch.nn.functional as F

tokenizer = AutoTokenizer.from_pretrained('bongsoo/mdistilbertV3.1', do_lower_case=False)
model = DistilBertForMaskedLM.from_pretrained('bongsoo/mdistilbertV3.1')

text = ['한국의 수도는 [MASK] 이다', '에펠탑은 [MASK]에 있다', '충무공 이순신은 [MASK]에 최고의 장수였다']
tokenized_input = tokenizer(text, max_length=128, truncation=True, padding='max_length', return_tensors='pt')

outputs = model(**tokenized_input)
logits = outputs.logits

mask_idx_list = []
for tokens in tokenized_input['input_ids'].tolist():
    token_str = [tokenizer.convert_ids_to_tokens(s) for s in tokens]
    
    # **위 token_str리스트에서 [MASK] 인덱스를 구함
    # => **해당 [MASK] 안덱스 값 mask_idx 에서는 아래 출력하는데 사용됨
    mask_idx = token_str.index('[MASK]')
    mask_idx_list.append(mask_idx)
    
for idx, mask_idx in enumerate(mask_idx_list):
    
    logits_pred=torch.argmax(F.softmax(logits[idx]), dim=1)
    mask_logits_idx = int(logits_pred[mask_idx])
    # [MASK]에 해당하는 token 구함
    mask_logits_token = tokenizer.convert_ids_to_tokens(mask_logits_idx)
    # 결과 출력 
    print('\n')
    print('*Input: {}'.format(text[idx]))
    print('*[MASK] : {} ({})'.format(mask_logits_token, mask_logits_idx))
  • 결과
*Input: 한국의 수도는 [MASK] 이다
*[MASK] : 서울 (48253)


*Input: 에펠탑은 [MASK]에 있다
*[MASK] : 프랑스 (47364)


*Input: 충무공 이순신은 [MASK]에 최고의 장수였다
*[MASK] : 임진왜란 (121990)

2. 임베딩 예시

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('bongsoo/mdistilbertV3.1')
model = AutoModel.from_pretrained('bongsoo/mdistilbertV3.1')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

# sklearn 을 이용하여 cosine_scores를 구함
# => 입력값 embeddings 은 (1,768) 처럼 2D 여야 함.
from sklearn.metrics.pairwise import paired_cosine_distances, paired_euclidean_distances, paired_manhattan_distances
cosine_scores = 1 - (paired_cosine_distances(sentence_embeddings[0].reshape(1,-1), sentence_embeddings[1].reshape(1,-1)))

print(f'*cosine_score:{cosine_scores[0]}')
  • 결과
Sentence embeddings:
tensor([[-0.1137,  0.1491,  0.6711,  ..., -0.0217,  0.1839, -0.6143],
        [ 0.0482, -0.0649,  0.5333,  ...,  0.1424, -0.0982, -0.3414]])
*cosine_score:0.4784715175628662

Training

MLM(Masked Langeuage Model) 훈련

  • 입력 모델 : distilbert-base-multilingual-cased
  • 말뭉치 : 훈련 : bongsoo/moco-corpus-kowiki2022(7.6M) , 평가: bongsoo/moco_eval
  • HyperParameter : LearningRate : 5e-5, epochs: 12 , batchsize: 32, max_token_len : 128
  • vocab : 159,552개 (기존 bert 모델 vocab(119,548개)에 40,004개 (한글단어30,000개+영문10,000개+수동 4개)vocab 추가
  • 출력 모델 : mdistilbertV3.1 (size: 634MB)
  • 훈련시간 : 90h/1GPU (24GB/16.5 use)
  • 훈련loss: 2.1154, 평가loss: 2.5275
  • 훈련코드 여기 참조
    perplexity 평가 코드는 여기 참조

Model Config

{
  "_name_or_path": "",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.21.2",
  "vocab_size": 159552
}

Citing & Authors

bongsoo

Downloads last month
54
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.