File size: 4,721 Bytes
e91873d d2cf822 99e005c d2cf822 e91873d ec6bf39 6dbd6e9 ec6bf39 3dfe921 ec6bf39 a14f22f 3dfe921 ec6bf39 3dfe921 ec6bf39 3dfe921 ec6bf39 3dfe921 ec6bf39 6dbd6e9 ec6bf39 3dfe921 6dbd6e9 3dfe921 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 |
---
inference: false
license: apache-2.0
language:
- en
pipeline_tag: conversational
---
## ChatLM
It is a chat Large Language Model finetuned with pretrained [Falcon-1B model](https://huggingface.co/tiiuae/falcon-rw-1b)
and trained on [chat-bot-instructions prompts dataset](https://huggingface.co/datasets/ayoolaolafenwa/sft-data).
ChatLM was trained on a dataset containing normal day to day human conversations, due to limited data used in training
it does not generalize well for tasks like coding and current affairs.
# Have a live chat with ChatLM on space https://huggingface.co/spaces/ayoolaolafenwa/ChatLM
## Load Model in bfloat16
``` python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_path = "ayoolaolafenwa/ChatLM"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code = True,
torch_dtype=torch.bfloat16)
prompt = "<user>: Give me a financial advise on investing in stocks. <chatbot>: "
tokens = tokenizer(prompt, return_tensors="pt")
token_ids = tokens.input_ids
attention_mask=tokens.attention_mask
token_ids = token_ids.to(model.device)
attention_mask=attention_mask.to(model.device)
outputs = model.generate(input_ids=token_ids, attention_mask = attention_mask, max_length=2048,do_sample=True,
num_return_sequences=1,top_k = 10, temperature = 0.7, eos_token_id=tokenizer.eos_token_id)
output_text = tokenizer.decode(outputs[0])
output_text = output_text.replace("<|endoftext|>", "")
print(output_text)
```
## Load Model in bfloat16 and int8
``` python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_path = "ayoolaolafenwa/ChatLM"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code = True,
torch_dtype=torch.bfloat16, load_in_8bit=True)
prompt = "<user>: Give me a financial advise on investing in stocks. <chatbot>: "
tokens = tokenizer(prompt, return_tensors="pt")
token_ids = tokens.input_ids
attention_mask=tokens.attention_mask
token_ids = token_ids.to(model.device)
attention_mask=attention_mask.to(model.device)
outputs = model.generate(input_ids=token_ids, attention_mask = attention_mask, max_length=2048,do_sample=True,
num_return_sequences=1,top_k = 10, temperature = 0.7, eos_token_id=tokenizer.eos_token_id)
output_text = tokenizer.decode(outputs[0])
output_text = output_text.replace("<|endoftext|>", "")
print(output_text)
```
# Training procedure for Supervised Finetuning
## Dataset Preparation
Chatbot Instructions prompts dataset from https://huggingface.co/datasets/alespalla/chatbot_instruction_prompts/viewer/alespalla--chatbot_instruction_prompts
was processed into a supervised finetuning format for training a user prompt and a corresponding response.
##### Download Data
``` python
from datasets import load_dataset
dataset = load_dataset("alespalla/chatbot_instruction_prompts", split = "train")
dataset.save_to_disk('ChatBotInsP')
dataset.to_csv('CIPtrain.csv')
```
##### Code to process dataset into Supervised finetuning format
``` python
# Import pandas library
import pandas as pd
# Read the text dataset from csv file
text_data = pd.read_csv("CIPtrain.csv")
# Create empty lists for prompts and responses
prompts = []
responses = []
# Loop through the text data
for i in range(len(text_data)):
# Get the sender, message, and timestamp of the current row
prompt = text_data["prompt"][i]
prompt = str(prompt)
response = text_data["response"][i]
response = str(response)
# Add the message to the prompts list with <user> tag
prompts.append("<user>: " + prompt)
# Add the message to the responses list with <chatbot> tag
responses.append("<chatbot>: " + response)
# Create a new dataframe with prompts and responses columns
new_data = pd.DataFrame({"prompt": prompts, "response": responses})
#alespalla/chatbot_instruction_prompts
# Write the new dataframe to a csv file
new_data.to_csv("MyData/chatbot_instruction_prompts_train.csv", index=False)
```
The users` prompts in the dataset are appended with the tag <user> and the corresponding responses with the tag <chatbot>.
Check the the modified dataset https://huggingface.co/datasets/ayoolaolafenwa/sft-data .
### Training
ChatLM was supervised finetuned with pretrained [Falcon 1-Billion parameters model](https://huggingface.co/tiiuae/falcon-rw-1b) trained on 350-Billion tokens
of RefinedWeb. It was trained with a single H100 GPU for 1 epoch. It achieves Perplexity *1.738*. Check the full code for supervised finetune
training on its github repository https://github.com/ayoolaolafenwa/ChatLM/tree/main
|