File size: 2,252 Bytes
27b5e4b
 
155c044
 
ddd3cae
ce3b6f3
934e88d
ddd3cae
ce3b6f3
97c4ca4
226c80b
 
51ce842
cd1508e
 
 
226c80b
27b5e4b
 
934e88d
5323a2b
934e88d
b78749f
934e88d
5323a2b
934e88d
5323a2b
934e88d
 
 
 
 
 
 
 
 
 
5323a2b
9c1bf4f
5323a2b
9c1bf4f
5323a2b
 
3117354
9c1bf4f
 
9373bec
9c1bf4f
 
 
 
 
 
 
 
 
 
 
 
 
3117354
 
 
9c1bf4f
 
 
 
 
ce3b6f3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
---
license: gpl-3.0
tags:
- conversational
- gpt2
language:
- es
datasets:
- open_subtitles
widget:
- text: Me gusta el deporte
  example_title: Interacción
- text: Hola
  example_title: Saludo
- text: ¿Como estas?
  example_title: Pregunta

---

# Spanish GPT-2 as backbone

Fine-tuned model on Spanish language using [Opensubtitle](https://opus.nlpl.eu/OpenSubtitles-v2018.php) dataset. The original GPT-2 
model was used as backbone which has been trained from scratch on the Spanish portion of OSCAR dataset, according to the [Flax/Jax](https://huggingface.co/flax-community/gpt-2-spanish) 
Community by HuggingFace.

## Model description and fine tunning

First, the model used as backbone was the OpenAI's GPT-2, introduced in the paper "Language Models are Unsupervised Multitask Learners" 
by Alec Radford et al. Second, transfer learning approach with a large dataset in Spanish was used to transform the text generation model to 
conversational tasks. The use of special tokens plays a key role in the process of fine-tuning.

```python
tokenizer.add_special_tokens({"pad_token": "<pad>",
                              "bos_token": "<startofstring>",
                              "eos_token": "<endofstring>"})
tokenizer.add_tokens(["<bot>:"])
```

## How to use

You can use this model directly with a pipeline for auto model with casual LM:

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("erikycd/chatbot_hadita")
model = AutoModelForCausalLM.from_pretrained("erikycd/chatbot_hadita")
device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
model = model.to(device)

def infer(inp):
    inp = "<startofstring> "+ inp +" <bot>: "
    inp = tokenizer(inp, return_tensors = "pt")
    X = inp["input_ids"].to(device)
    attn = inp["attention_mask"].to(device)
    output = model.generate(X, attention_mask = attn, pad_token_id = tokenizer.eos_token_id)
    output = tokenizer.decode(output[0], skip_special_tokens = True)
    return output

exit_commands = ('bye', 'quit')
text = ''
while text not in exit_commands:
    
    text = input('\nUser: ')
    output = infer(text)
    print('Bot: ', output)
    
```