metadata
base_model: unsloth/llama-3.2-3b-instruct-bnb-4bit
tags:
- text-generation-inference
- transformers
- unsloth
- llama
- trl
- sft
license: apache-2.0
language:
- en
datasets:
- BAAI/Infinity-Instruct
Fine-tune Llama 3.2 3B Using Unsloth and BAAI/Infinity-Instruct Dataset
This model uses the "0625" version, but there will be a fine-tuned model trained with the "7M" version as well.
Uploaded Model
- Developed by: MateoRov
- License: apache-2.0
- Fine-tuned from model: unsloth/llama-3.2-3b-instruct-bnb-4bit
Usage
Check my full repo on github for better undestanding: https://github.com/Mateorovere/FineTuning-LLM-Llama3.2-3b
But with the proper dependencies you can run the model with the following code:
from unsloth.chat_templates import get_chat_template
from unsloth import FastLanguageModel
# Get the chat template
tokenizer = get_chat_template(
tokenizer,
chat_template="llama-3.1",
)
model = "MateoRov/Llama3.2-3b-SFF-Infinity-MateoRovere"
# Enable native 2x faster inference
FastLanguageModel.for_inference(model)
# Define the input message
messages = [
{"role": "user", "content": "Continue the Fibonacci sequence: 1, 1, 2, 3, 5, 8,"},
]
# Prepare the inputs
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True, # Must add for generation
return_tensors="pt",
).to("cuda")
# Generate the output
outputs = model.generate(
input_ids=inputs,
max_new_tokens=64,
use_cache=True,
temperature=1.5,
min_p=0.1,
)
# Decode the outputs
result = tokenizer.batch_decode(outputs)
print(result)
To get the generation token by token:
from unsloth.chat_templates import get_chat_template
from unsloth import FastLanguageModel
from transformers import TextStreamer
model = "MateoRov/Llama3.2-3b-SFF-Infinity-MateoRovere"
# Enable native 2x faster inference
FastLanguageModel.for_inference(model)
# Get the chat template
tokenizer = get_chat_template(
tokenizer,
chat_template="llama-3.1",
)
# Define the input message
messages = [
{"role": "user", "content": "Continue the Fibonacci sequence: 1, 1, 2, 3, 5, 8,"},
]
# Prepare the inputs
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True, # Must add for generation
return_tensors="pt",
).to("cuda")
# Initialize the text streamer
text_streamer = TextStreamer(tokenizer, skip_prompt=True)
# Generate the output token by token
_ = model.generate(
input_ids=inputs,
streamer=text_streamer,
max_new_tokens=128,
use_cache=True,
temperature=1.5,
min_p=0.1,
)