"Open

# Fine-Tuning DialoGPT3 on your telegram chat

Here is a ready-to-run code for fine-tuning a RuDialoGPT3 model using HuggingFace and PyTorch on **your telegram chat**.

I used RuDialoGPT-3 trained on forums to fine tune. It was trained by [@Grossmend](https://github.com/Grossmend) on Russian forums. The training process took 12 days using 4x RTX 2080 Ti (2 epochs on 32GB text corpus). The training procedure of GPT-3 for dialogue is described in Grossmend's [blogpost](https://habr.com/ru/company/icl_services/blog/548244/) (in Russian).

I have created a simple pipeline and fine tuned that model on my own exported telegram chat (~30mb json). It is in fact very easy to get the data from telegram and fine tune a model. Therefore, I made this notebook!

If you want just to try / to talk to my fine-tuned model than go **straight to the Inference section**.

## Uploading your data for fine-tuning

In [None]:
# installing huggingface datasets and accelerate 
! pip install datasets transformers[sentencepiece]
! pip install accelerate

# [optional] Login to google drive to save models
from google.colab import drive
drive.mount('/content/drive')

# [optional] Login to wandb to track model's behaviour
'''! pip install wandb
! wandb login
wandb.init(project="fine tune RuDialoGPT2 on KirArChat")'''

In [None]:
#@title Imports
import sys
import re
import json

from sklearn.model_selection import train_test_split
from tqdm import tqdm

import torch
from transformers import TextDataset, DataCollatorForLanguageModeling
from torch.utils.data import DataLoader

from accelerate import Accelerator
from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler

Next cell downloads model and tokenizer using HuggingFace.

You can start with my version or @Grossmend's: "Grossmend/rudialogpt3_medium_based_on_gpt2". Moreover, you can even start with any different DialoGPT trained on your language (with the notation of |x|y|text).

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "Kirili4ik/ruDialoGpt3-medium-finetuned-telegram" 
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint)

In [None]:
#@title Utility functions
def get_length_param(text: str, tokenizer) -> str:
 """Maps text to 1 of 4 buckets based on length after encoding.

 Parameters
 ----------
 text: str
 The text to be given 1 of 4 length parameters.

 tokenizer: HuggingFace tokenizer 
 Tokenizer that used to compute the length of the text after encoding.
 For more info ee https://huggingface.co/transformers/main_classes/tokenizer.html

 Returns
 -------
 len_param: str
 One of four buckets: 
 '1' for short, '2' for medium, '3' for long texts and '-' for all others. 
 """
 tokens_count = len(tokenizer.encode(text))
 if tokens_count <= 15:
 len_param = '1'
 elif tokens_count <= 50:
 len_param = '2'
 elif tokens_count <= 256:
 len_param = '3'
 else:
 len_param = '-'
 return len_param


def get_user_param(text: dict, machine_name_in_chat: str) -> str:
 """Maps text by 1/0 for it to be the person or the machine in the dialog

 Parameters
 ----------
 text: Dict[..., 'from', ...]
 Dict containing field 'from' with the name of the user who sent the message

 machine_name_in_chat: str
 Str with the name of the machine - it will be predicted
 """
 if text['from'] == machine_name_in_chat:
 return '1' # machine
 else:
 return '0' # human


def build_text_file(data_json: dict, dest_path: str, 
 tokenizer, machine_name_in_chat='Кирилл Гельван'):
 """Create a text file for training in special format for ruDialoGPT-3.

 Parameters
 ----------
 data_json: dict
 Dict containing 'text' (message) and 'from' (user who sent the message)
 
 dest_path: str
 String containing path to write data there

 tokenizer: HuggingFace tokenizer 
 Tokenizer that used to compute the length of the text after encoding.
 For more info ee https://huggingface.co/transformers/main_classes/tokenizer.html
 """
 f = open(dest_path, 'w')
 new_data = ''
 for i in range(len(data_json) - 1):
 message, next_message = data_json[i], data_json[i+1]
 if message['text'] == '' or type(message['text']) != str:
 continue
 if next_message['text'] == '' or type(next_message['text']) != str:
 continue

 user = get_user_param(message, machine_name_in_chat=machine_name_in_chat)
 length = get_length_param(data_json[i+1]['text'], tokenizer)
 message_text = re.sub(r"\n", ". ", message['text'])
 new_data += f"|{user}|{length}|{message_text}{tokenizer.eos_token}" + "\n"

 f.write(new_data)


def load_dataset(train_path, test_path, tokenizer):
 """Creates train and test PyTorch datasets and collate_fn using HuggingFace.

 Parameters
 ----------
 train_path: str
 String containing path to train data
 
 test_path: str
 String containing path to test data

 tokenizer: HuggingFace tokenizer 
 Tokenizer that used to compute the length of the text after encoding.
 For more info ee https://huggingface.co/transformers/main_classes/tokenizer.html
 """
 train_dataset = TextDataset(
 tokenizer = tokenizer,
 file_path = train_path,
 block_size = 256)
 
 test_dataset = TextDataset(
 tokenizer = tokenizer,
 file_path = test_path,
 block_size = 256) 
 
 data_collator = DataCollatorForLanguageModeling(
 tokenizer=tokenizer, mlm=False
 )
 return train_dataset, test_dataset, data_collator

1) Export your telegram chat

![](https://raw.githubusercontent.com/Kirili4ik/ruDialoGpt3-finetune-colab/main/how-to-export-chat.jpg)

2) Upload it to colab

![](https://raw.githubusercontent.com/Kirili4ik/ruDialoGpt3-finetune-colab/main/how-to-upload-json.jpg)

3) Next cell creates train and test set from it

4) :tada:

In [None]:
#@markdown Your telegram chat json path 'ChatExport.../YourChatName.json':
path_to_telegram_chat_json = 'example: /content/drive/MyDrive/char27.json' #@param {type : "string"}
#@markdown Name of the user to predict by GPT-3:
machine_name_in_chat = 'example: Kirill Gelvan' #@param {type : "string"}


with open(path_to_telegram_chat_json) as f: data = json.load(f)['messages']

# test data is first 10% of chat, train - last 90%
train, test = data[int(len(data)*0.1):], data[:int(len(data)*0.1)]

build_text_file(train, 'train_dataset.txt', tokenizer)
build_text_file(test, 'test_dataset.txt', tokenizer)

print("Train dataset length: " + str(len(train)) + "samples")
print("Test dataset length: " + str(len(test)) + "samples")

In [None]:
# let's look at our data
! head -n 10 train_dataset.txt

Here the first number is the spearker number - '1' for GPT and '0' for the person. 

The second number is the lengths of the expected answer: '1' for short, '2' for medium, '3' for long texts and '-' for all others. 


In [None]:
# Create PyTorch Datasets
train_dataset, test_dataset, data_collator = load_dataset('train_dataset.txt', 'test_dataset.txt', tokenizer)

# Create PyTorch Dataloaders
train_loader = DataLoader(train_dataset, shuffle=True, batch_size=2, collate_fn=data_collator)
test_loader = DataLoader(test_dataset, batch_size=2, collate_fn=data_collator)

In [None]:
# this cell checks 1 forward pass
try:
 for batch in train_loader:
 break
 {k: v.shape for k, v in batch.items()}

 outputs = model(**batch)
except:
 print("Unexpected error:", sys.exc_info()[0])
 raise

## Fine-tuning

In [None]:
#@title Fine-tuning params
num_epochs = 3 #@param {type:"integer"}
optimizer = AdamW(model.parameters(), lr=3e-5) #@param
save_checkpoint_path = 'exmaple: drive/MyDrive/GPT2_checkpoint-more-data-2ep.pt' #@param {type:"string"}


num_training_steps = num_epochs * len(train_dataset)
lr_scheduler = get_scheduler(
 "linear",
 optimizer=optimizer,
 num_warmup_steps=100,
 num_training_steps=num_training_steps
)

accelerator = Accelerator()
train_dl, test_dl, model, optimizer = accelerator.prepare(
 train_loader, test_loader, model, optimizer
)
# wandb.watch(model, log="all")

In [None]:
progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_epochs):
 
 ### TRAIN EPOCH
 model.train()
 for batch in train_dl:
 optimizer.zero_grad()
 outputs = model(**batch)
 loss = outputs.loss
 accelerator.backward(loss)
 
 # wandb.log({'train_loss':loss.item()})
 optimizer.step()
 lr_scheduler.step()
 progress_bar.update(1)

 ### SAVE
 torch.save({
 'model_state_dict': model.state_dict(),
 }, save_checkpoint_path)
 
 ### VALIDATE ONCE
 cum_loss = 0
 model.eval()
 with torch.inference_mode():
 for batch in test_dl:
 outputs = model(**batch)
 cum_loss += float(outputs.loss.item())
 
 print(cum_loss/len(test_loader))
 # wandb.log({'val_mean_loss':cum_loss/len(test_loader)})

## Inference

In [None]:
#@title Installs and Utility functions

%%capture
# installing huggingface datasets and accelerate 
! pip install datasets transformers[sentencepiece]
! pip install accelerate

def get_length_param(text: str, tokenizer) -> str:
 """Maps text to 1 of 4 buckets based on length after encoding.

 Parameters
 ----------
 text: str
 The text to be given 1 of 4 length parameters.

 tokenizer: HuggingFace tokenizer 
 Tokenizer that used to compute the length of the text after encoding.
 For more info ee https://huggingface.co/transformers/main_classes/tokenizer.html

 Returns
 -------
 len_param: str
 One of four buckets: 
 '1' for short, '2' for medium, '3' for long texts and '-' for all others. 
 """
 tokens_count = len(tokenizer.encode(text))
 if tokens_count <= 15:
 len_param = '1'
 elif tokens_count <= 50:
 len_param = '2'
 elif tokens_count <= 256:
 len_param = '3'
 else:
 len_param = '-'
 return len_param


def get_user_param(text: dict, machine_name_in_chat: str) -> str:
 """Maps text by 1/0 for it to be the person or the machine in the dialogue

 Parameters
 ----------
 text: Dict[..., 'from', ...]
 Dict containing field 'from' with the name of the user who sent the message

 machine_name_in_chat: str
 Str with the name of the machine - it will be predicted
 """
 if text['from'] == machine_name_in_chat:
 return '1' # machine
 else:
 return '0' # human


def build_text_file(data_json: dict, dest_path: str, 
 tokenizer, machine_name_in_chat='Кирилл Гельван'):
 """Create a text file for training in special format for ruDialoGPT-3.

 Parameters
 ----------
 data_json: dict
 Dict containing 'text' (message) and 'from' (user who sent the message)
 
 dest_path: str
 String containing path to write data there

 tokenizer: HuggingFace tokenizer 
 Tokenizer that used to compute the length of the text after encoding.
 For more info ee https://huggingface.co/transformers/main_classes/tokenizer.html
 """
 f = open(dest_path, 'w')
 new_data = ''
 for i in range(len(data_json) - 1):
 message, next_message = data_json[i], data_json[i+1]
 if message['text'] == '' or type(message['text']) != str:
 continue
 if next_message['text'] == '' or type(next_message['text']) != str:
 continue

 user = get_user_param(message, machine_name_in_chat=machine_name_in_chat)
 length = get_length_param(data_json[i+1]['text'], tokenizer)
 message_text = re.sub(r"\n", ". ", message['text'])
 new_data += f"|{user}|{length}|{message_text}{tokenizer.eos_token}" + "\n"

 f.write(new_data)


def load_dataset(train_path, test_path, tokenizer):
 """Creates train and test PyTorch datasets and collate_fn using HuggingFace.

 Parameters
 ----------
 train_path: str
 String containing path to train data
 
 test_path: str
 String containing path to test data

 tokenizer: HuggingFace tokenizer 
 Tokenizer that used to compute the length of the text after encoding.
 For more info ee https://huggingface.co/transformers/main_classes/tokenizer.html
 """
 train_dataset = TextDataset(
 tokenizer = tokenizer,
 file_path = train_path,
 block_size = 256)
 
 test_dataset = TextDataset(
 tokenizer = tokenizer,
 file_path = test_path,
 block_size = 256) 
 
 data_collator = DataCollatorForLanguageModeling(
 tokenizer=tokenizer, mlm=False
 )
 return train_dataset, test_dataset, data_collator

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Download checkpoint:
checkpoint = "Kirili4ik/ruDialoGpt3-medium-finetuned-telegram" 
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint)

# [optional] Insert your checkpoint if needed:
'''from google.colab import drive
drive.mount('/content/drive')
checkpoint = torch.load('drive/MyDrive/GPT2_checkpoint.pt', map_location='cpu')
model.load_state_dict(checkpoint['model_state_dict'])'''

model = model.to('cpu')
model.eval()
print()

In [None]:
### INFERENCE

chat_history_ids = torch.zeros((1, 0), dtype=torch.int)

while True:
 
 next_who = input("Who's phrase?\t") #input("H / G?") # Human or GPT

 # In case Human
 if next_who == "H":
 input_user = input("===> Human: ")
 
 # encode the new user input, add parameters and return a tensor in Pytorch
 new_user_input_ids = tokenizer.encode(f"|0|{get_length_param(input_user, tokenizer)}|" \
 + input_user + tokenizer.eos_token, return_tensors="pt")
 # append the new user input tokens to the chat history
 chat_history_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1)

 if next_who == "G":

 next_len = input("Phrase len? 1/2/3/-\t") #input("Exp. len?(-/1/2/3): ")
 # encode the new user input, add parameters and return a tensor in Pytorch
 new_user_input_ids = tokenizer.encode(f"|1|{next_len}|", return_tensors="pt")
 # append the new user input tokens to the chat history
 chat_history_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1)
 
 # print(tokenizer.decode(chat_history_ids[-1])) # uncomment to see full gpt input
 
 # save previous len
 input_len = chat_history_ids.shape[-1]
 # generated a response; PS you can read about the parameters at hf.co/blog/how-to-generate
 chat_history_ids = model.generate(
 chat_history_ids,
 num_return_sequences=1, # use for more variants, but have to print [i]
 max_length=512,
 no_repeat_ngram_size=3,
 do_sample=True,
 top_k=50,
 top_p=0.9,
 temperature = 0.6, # 0 for greedy
 mask_token_id=tokenizer.mask_token_id,
 eos_token_id=tokenizer.eos_token_id,
 unk_token_id=tokenizer.unk_token_id,
 pad_token_id=tokenizer.pad_token_id,
 device='cpu'
 )
 
 # pretty print last ouput tokens from bot
 print(f"===> GPT-3: {tokenizer.decode(chat_history_ids[:, input_len:][0], skip_special_tokens=True)}")