How to load and quantize the fine tuned model in google colab or kaggle?

#5
by EviIgenius - opened

Hey Abhishek , firstly thank you so much for your tutorial.
Noob Alert!!

I have fine tuned the LlaMA and mistral sharded model for fine tuning on google colab and saved the same to hf. Now I am totally clueless about how can I run my fine tune model in google colab, also how can I conver the same into ggml/gguf format and quantize it into 4 bits.

I don't expect a full tutorial or answer , but please do give me some resource or reference :)

First run this:
from peft import PeftConfig, PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

hf_token = "your_hf_token"

base_model_name = "abhishek/llama-2-7b-hf-small-shards" #path/to/your/model/or/name/on/hub"
adapter_model_name = "your repo"

model = AutoModelForCausalLM.from_pretrained(base_model_name)
model = PeftModel.from_pretrained(model, adapter_model_name, use_auth_token=hf_token)

tokenizer = AutoTokenizer.from_pretrained(base_model_name)

Then run this:

Example text input

input_text = "your message"
input_ids = tokenizer.encode(input_text, return_tensors='pt')

Generate text using keyword arguments

outputs = model.generate(
input_ids=input_ids,
max_length=200, # You can increase this value
no_repeat_ngram_size=2,
early_stopping=True,
num_return_sequences=1
)

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(generated_text)

Sign up or log in to comment