Spanish GPT-2 as backbone

Fine-tuned model on Spanish language using Opensubtitle dataset. The original GPT-2 model was used as backbone which has been trained from scratch on the Spanish portion of OSCAR dataset, according to the Flax/Jax Community by HuggingFace.

Model description and fine tunning

First, the model used as backbone was the OpenAI's GPT-2, introduced in the paper "Language Models are Unsupervised Multitask Learners" by Alec Radford et al. Second, transfer learning approach with a large dataset in Spanish was used to transform the text generation model to conversational tasks. The use of special tokens plays a key role in the process of fine-tuning.

tokenizer.add_special_tokens({"pad_token": "<pad>",
                              "bos_token": "<startofstring>",
                              "eos_token": "<endofstring>"})
tokenizer.add_tokens(["<bot>:"])

How to use

You can use this model directly with a pipeline for auto model with casual LM:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("erikycd/chatbot_hadita")
model = AutoModelForCausalLM.from_pretrained("erikycd/chatbot_hadita")
device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
model = model.to(device)

def infer(inp):
    inp = "<startofstring> "+ inp +" <bot>: "
    inp = tokenizer(inp, return_tensors = "pt")
    X = inp["input_ids"].to(device)
    attn = inp["attention_mask"].to(device)
    output = model.generate(X, attention_mask = attn, pad_token_id = tokenizer.eos_token_id)
    output = tokenizer.decode(output[0], skip_special_tokens = True)
    return output

exit_commands = ('bye', 'quit')
text = ''
while text not in exit_commands:
    
    text = input('\nUser: ')
    output = infer(text)
    print('Bot: ', output)

erikycd
/

chatbot_hadita

Spanish GPT-2 as backbone

Model description and fine tunning

How to use

Dataset used to train erikycd/chatbot_hadita