File size: 4,932 Bytes
e91873d
d2cf822
4c013d9
 
e70c496
 
d2cf822
e91873d
ec6bf39
6dbd6e9
ec6bf39
 
6acd909
ec6bf39
be808c4
 
a14f22f
 
02b7285
 
 
 
 
 
 
 
3dfe921
ec6bf39
 
 
 
 
 
 
 
 
4c013d9
ec6bf39
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3dfe921
 
 
ec6bf39
 
3dfe921
ec6bf39
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3dfe921
ec6bf39
 
 
 
 
 
 
 
 
 
6dbd6e9
ec6bf39
 
3dfe921
 
 
6dbd6e9
4c013d9
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
---
license: apache-2.0
datasets:
- ayoolaolafenwa/sft-data
language:
- en
---

## ChatLM 
It is a chat Large Language Model finetuned with pretrained [Falcon-1B model](https://huggingface.co/tiiuae/falcon-rw-1b)
and trained on [chat-bot-instructions prompts dataset](https://huggingface.co/datasets/ayoolaolafenwa/sft-data).
ChatLM was trained on a dataset containing normal day to day human conversations, due to limited data used in training
it does not generalize well for tasks like coding, current affairs and hallucinations may occur. 

# Github Repo: https://github.com/ayoolaolafenwa/ChatLM

# Have a live chat with ChatLM on space https://huggingface.co/spaces/ayoolaolafenwa/ChatLM

# Install Required Packages
```
pip install transformers
pip install accelerate
pip install einops
pip install bitsandbytes
```

## Load Model in bfloat16
``` python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "ayoolaolafenwa/ChatLM"

tokenizer = AutoTokenizer.from_pretrained(model_path)

model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code = True,
torch_dtype=torch.bfloat16).to("cuda")

prompt = "<user>: Give me a financial advise on investing in stocks. <chatbot>: "

tokens = tokenizer(prompt, return_tensors="pt")

token_ids = tokens.input_ids
attention_mask=tokens.attention_mask

token_ids = token_ids.to(model.device)
attention_mask=attention_mask.to(model.device)

outputs = model.generate(input_ids=token_ids, attention_mask = attention_mask,  max_length=2048,do_sample=True,
num_return_sequences=1,top_k = 10, temperature = 0.7, eos_token_id=tokenizer.eos_token_id)

output_text = tokenizer.decode(outputs[0])
output_text = output_text.replace("<|endoftext|>", "")

print(output_text)
```

## Load Model in bfloat16 and int8
``` python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "ayoolaolafenwa/ChatLM"

tokenizer = AutoTokenizer.from_pretrained(model_path)

model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code = True,
torch_dtype=torch.bfloat16, load_in_8bit=True)

prompt = "<user>: Give me a financial advise on investing in stocks. <chatbot>: "

tokens = tokenizer(prompt, return_tensors="pt")

token_ids = tokens.input_ids
attention_mask=tokens.attention_mask

token_ids = token_ids.to(model.device)
attention_mask=attention_mask.to(model.device)

outputs = model.generate(input_ids=token_ids, attention_mask = attention_mask,  max_length=2048,do_sample=True,
num_return_sequences=1,top_k = 10, temperature = 0.7, eos_token_id=tokenizer.eos_token_id)

output_text = tokenizer.decode(outputs[0])
output_text = output_text.replace("<|endoftext|>", "")

print(output_text)
```
# Training procedure for Supervised Finetuning

## Dataset Preparation

Chatbot Instructions prompts dataset from https://huggingface.co/datasets/alespalla/chatbot_instruction_prompts/viewer/alespalla--chatbot_instruction_prompts
was processed into a supervised finetuning format for training a user prompt and a corresponding response.

##### Download Data
``` python
from datasets import load_dataset

dataset = load_dataset("alespalla/chatbot_instruction_prompts", split = "train")
dataset.save_to_disk('ChatBotInsP')
dataset.to_csv('CIPtrain.csv')
```

##### Code to process dataset into Supervised finetuning format
``` python
# Import pandas library
import pandas as pd

# Read the text dataset from csv file
text_data = pd.read_csv("CIPtrain.csv")

# Create empty lists for prompts and responses
prompts = []
responses = []

# Loop through the text data
for i in range(len(text_data)):
    # Get the sender, message, and timestamp of the current row
    prompt = text_data["prompt"][i]
    prompt = str(prompt)

    response = text_data["response"][i]
    response = str(response)
    
    # Add the message to the prompts list with <user> tag
    prompts.append("<user>: " + prompt)
    
    # Add the message to the responses list with <chatbot> tag
    responses.append("<chatbot>: " + response)

# Create a new dataframe with prompts and responses columns
new_data = pd.DataFrame({"prompt": prompts, "response": responses})

#alespalla/chatbot_instruction_prompts
# Write the new dataframe to a csv file
new_data.to_csv("MyData/chatbot_instruction_prompts_train.csv", index=False)
```
The users` prompts in the dataset are appended with the tag <user> and the corresponding responses with the tag <chatbot>.
Check the the modified dataset https://huggingface.co/datasets/ayoolaolafenwa/sft-data .

### Training 

ChatLM was supervised finetuned with pretrained [Falcon 1-Billion parameters model](https://huggingface.co/tiiuae/falcon-rw-1b) trained on 350-Billion tokens 
of RefinedWeb. It was trained with a single H100 GPU for 1 epoch. It achieves Perplexity *1.738*.  Check the full code for supervised finetune 
training on its github repository https://github.com/ayoolaolafenwa/ChatLM/tree/main