--- license: apache-2.0 datasets: - ayoolaolafenwa/sft-data language: - en --- ## ChatLM It is a chat Large Language Model finetuned with pretrained [Falcon-1B model](https://huggingface.co/tiiuae/falcon-rw-1b) and trained on [chat-bot-instructions prompts dataset](https://huggingface.co/datasets/ayoolaolafenwa/sft-data). ChatLM was trained on a dataset containing normal day to day human conversations, due to limited data used in training it does not generalize well for tasks like coding, current affairs and hallucinations may occur. # Github Repo: https://github.com/ayoolaolafenwa/ChatLM # Have a live chat with ChatLM on space https://huggingface.co/spaces/ayoolaolafenwa/ChatLM # Install Required Packages ``` pip install transformers pip install accelerate pip install einops pip install bitsandbytes ``` ## Load Model in bfloat16 ``` python import torch from transformers import AutoModelForCausalLM, AutoTokenizer model_path = "ayoolaolafenwa/ChatLM" tokenizer = AutoTokenizer.from_pretrained(model_path) model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code = True, torch_dtype=torch.bfloat16).to("cuda") prompt = ": Give me a financial advise on investing in stocks. : " tokens = tokenizer(prompt, return_tensors="pt") token_ids = tokens.input_ids attention_mask=tokens.attention_mask token_ids = token_ids.to(model.device) attention_mask=attention_mask.to(model.device) outputs = model.generate(input_ids=token_ids, attention_mask = attention_mask, max_length=2048,do_sample=True, num_return_sequences=1,top_k = 10, temperature = 0.7, eos_token_id=tokenizer.eos_token_id) output_text = tokenizer.decode(outputs[0]) output_text = output_text.replace("<|endoftext|>", "") print(output_text) ``` ## Load Model in bfloat16 and int8 ``` python import torch from transformers import AutoModelForCausalLM, AutoTokenizer model_path = "ayoolaolafenwa/ChatLM" tokenizer = AutoTokenizer.from_pretrained(model_path) model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code = True, torch_dtype=torch.bfloat16, load_in_8bit=True) prompt = ": Give me a financial advise on investing in stocks. : " tokens = tokenizer(prompt, return_tensors="pt") token_ids = tokens.input_ids attention_mask=tokens.attention_mask token_ids = token_ids.to(model.device) attention_mask=attention_mask.to(model.device) outputs = model.generate(input_ids=token_ids, attention_mask = attention_mask, max_length=2048,do_sample=True, num_return_sequences=1,top_k = 10, temperature = 0.7, eos_token_id=tokenizer.eos_token_id) output_text = tokenizer.decode(outputs[0]) output_text = output_text.replace("<|endoftext|>", "") print(output_text) ``` # Training procedure for Supervised Finetuning ## Dataset Preparation Chatbot Instructions prompts dataset from https://huggingface.co/datasets/alespalla/chatbot_instruction_prompts/viewer/alespalla--chatbot_instruction_prompts was processed into a supervised finetuning format for training a user prompt and a corresponding response. ##### Download Data ``` python from datasets import load_dataset dataset = load_dataset("alespalla/chatbot_instruction_prompts", split = "train") dataset.save_to_disk('ChatBotInsP') dataset.to_csv('CIPtrain.csv') ``` ##### Code to process dataset into Supervised finetuning format ``` python # Import pandas library import pandas as pd # Read the text dataset from csv file text_data = pd.read_csv("CIPtrain.csv") # Create empty lists for prompts and responses prompts = [] responses = [] # Loop through the text data for i in range(len(text_data)): # Get the sender, message, and timestamp of the current row prompt = text_data["prompt"][i] prompt = str(prompt) response = text_data["response"][i] response = str(response) # Add the message to the prompts list with tag prompts.append(": " + prompt) # Add the message to the responses list with tag responses.append(": " + response) # Create a new dataframe with prompts and responses columns new_data = pd.DataFrame({"prompt": prompts, "response": responses}) #alespalla/chatbot_instruction_prompts # Write the new dataframe to a csv file new_data.to_csv("MyData/chatbot_instruction_prompts_train.csv", index=False) ``` The users` prompts in the dataset are appended with the tag and the corresponding responses with the tag . Check the the modified dataset https://huggingface.co/datasets/ayoolaolafenwa/sft-data . ### Training ChatLM was supervised finetuned with pretrained [Falcon 1-Billion parameters model](https://huggingface.co/tiiuae/falcon-rw-1b) trained on 350-Billion tokens of RefinedWeb. It was trained with a single H100 GPU for 1 epoch. It achieves Perplexity *1.738*. Check the full code for supervised finetune training on its github repository https://github.com/ayoolaolafenwa/ChatLM/tree/main