ayoolaolafenwa
commited on
Commit
•
ec6bf39
1
Parent(s):
76b9366
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,123 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
---
|
4 |
+
## ChatLM
|
5 |
+
It is a chat Large Language model finetuned with pretrained [Falcon-1B model](https://huggingface.co/tiiuae/falcon-rw-1b)
|
6 |
+
and trained on [chat-bot-instructions prompts dataset](https://huggingface.co/datasets/ayoolaolafenwa/sft-data).
|
7 |
+
ChatLM was trained on a dataset containing normal day to day human conversations, due to limited data used in training
|
8 |
+
it is not suitable for tasks like coding and current affairs.
|
9 |
+
|
10 |
+
## Load Model in bfloatfp16
|
11 |
+
``` python
|
12 |
+
import torch
|
13 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
14 |
+
|
15 |
+
model_path = "ayoolaolafenwa/ChatLM"
|
16 |
+
|
17 |
+
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
18 |
+
|
19 |
+
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code = True,
|
20 |
+
torch_dtype=torch.bfloat16)
|
21 |
+
|
22 |
+
prompt = "<user>: Give me a financial advise on investing in stocks. <chatbot>: "
|
23 |
+
|
24 |
+
tokens = tokenizer(prompt, return_tensors="pt")
|
25 |
+
|
26 |
+
token_ids = tokens.input_ids
|
27 |
+
attention_mask=tokens.attention_mask
|
28 |
+
|
29 |
+
token_ids = token_ids.to(model.device)
|
30 |
+
attention_mask=attention_mask.to(model.device)
|
31 |
+
|
32 |
+
outputs = model.generate(input_ids=token_ids, attention_mask = attention_mask, max_length=2048,do_sample=True,
|
33 |
+
num_return_sequences=1,top_k = 10, temperature = 0.7, eos_token_id=tokenizer.eos_token_id)
|
34 |
+
|
35 |
+
output_text = tokenizer.decode(outputs[0])
|
36 |
+
output_text = output_text.replace("<|endoftext|>", "")
|
37 |
+
|
38 |
+
print(output_text)
|
39 |
+
```
|
40 |
+
|
41 |
+
## Load Model in bfloat16 and int8
|
42 |
+
``` python
|
43 |
+
import torch
|
44 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
45 |
+
|
46 |
+
model_path = "ayoolaolafenwa/ChatLM"
|
47 |
+
|
48 |
+
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
49 |
+
|
50 |
+
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code = True,
|
51 |
+
torch_dtype=torch.bfloat16, load_in_8bit=True)
|
52 |
+
|
53 |
+
prompt = "<user>: Give me a financial advise on investing in stocks. <chatbot>: "
|
54 |
+
|
55 |
+
tokens = tokenizer(prompt, return_tensors="pt")
|
56 |
+
|
57 |
+
token_ids = tokens.input_ids
|
58 |
+
attention_mask=tokens.attention_mask
|
59 |
+
|
60 |
+
token_ids = token_ids.to(model.device)
|
61 |
+
attention_mask=attention_mask.to(model.device)
|
62 |
+
|
63 |
+
outputs = model.generate(input_ids=token_ids, attention_mask = attention_mask, max_length=2048,do_sample=True,
|
64 |
+
num_return_sequences=1,top_k = 10, temperature = 0.7, eos_token_id=tokenizer.eos_token_id)
|
65 |
+
|
66 |
+
output_text = tokenizer.decode(outputs[0])
|
67 |
+
output_text = output_text.replace("<|endoftext|>", "")
|
68 |
+
|
69 |
+
print(output_text)
|
70 |
+
```
|
71 |
+
## Training procedure for Supervised Finetuning
|
72 |
+
|
73 |
+
Chatbot Instructions prompts dataset from https://huggingface.co/datasets/alespalla/chatbot_instruction_prompts/viewer/alespalla--chatbot_instruction_prompts
|
74 |
+
was processed into a supervised finetuning for training a user prompt and corresponding response.
|
75 |
+
|
76 |
+
##### Download Data
|
77 |
+
``` python
|
78 |
+
from datasets import load_dataset
|
79 |
+
|
80 |
+
dataset = load_dataset("alespalla/chatbot_instruction_prompts", split = "train")
|
81 |
+
dataset.save_to_disk('ChatBotInsP')
|
82 |
+
dataset.to_csv('CIPtrain.csv')
|
83 |
+
```
|
84 |
+
|
85 |
+
##### Code to process dataset into Supervised finetuning format
|
86 |
+
``` python
|
87 |
+
# Import pandas library
|
88 |
+
import pandas as pd
|
89 |
+
|
90 |
+
# Read the text dataset from csv file
|
91 |
+
text_data = pd.read_csv("CIPtrain.csv")
|
92 |
+
|
93 |
+
# Create empty lists for prompts and responses
|
94 |
+
prompts = []
|
95 |
+
responses = []
|
96 |
+
|
97 |
+
# Loop through the text data
|
98 |
+
for i in range(len(text_data)):
|
99 |
+
# Get the sender, message, and timestamp of the current row
|
100 |
+
prompt = text_data["prompt"][i]
|
101 |
+
prompt = str(prompt)
|
102 |
+
|
103 |
+
response = text_data["response"][i]
|
104 |
+
response = str(response)
|
105 |
+
|
106 |
+
# Add the message to the prompts list with <user> tag
|
107 |
+
prompts.append("<user>: " + prompt)
|
108 |
+
#elif sender == "bot":
|
109 |
+
# Add the message to the responses list with <chatbot> tag
|
110 |
+
responses.append("<chatbot>: " + response)
|
111 |
+
|
112 |
+
# Create a new dataframe with prompts and responses columns
|
113 |
+
new_data = pd.DataFrame({"prompt": prompts, "response": responses})
|
114 |
+
|
115 |
+
#alespalla/chatbot_instruction_prompts
|
116 |
+
# Write the new dataframe to a csv file
|
117 |
+
new_data.to_csv("MyData/chatbot_instruction_prompts_train.csv", index=False)
|
118 |
+
```
|
119 |
+
I appended the user's prompts in the dataset with the tag <user> and the response with the tag <chatbot>.
|
120 |
+
Check the the modified dataset https://huggingface.co/datasets/ayoolaolafenwa/sft-data .
|
121 |
+
|
122 |
+
ChatLM was trained with preatrained [Falcon-1B model](https://huggingface.co/tiiuae/falcon-rw-1b) and finetuned on the prepared supervised
|
123 |
+
dataset on a single H100 GPU. Check the full code for training on its github repository https://github.com/ayoolaolafenwa/ChatLM/tree/main
|