MateoRov's picture
Update README.md (#2)
6c87550 verified
metadata
base_model: unsloth/llama-3.2-3b-instruct-bnb-4bit
tags:
  - text-generation-inference
  - transformers
  - unsloth
  - llama
  - trl
  - sft
license: apache-2.0
language:
  - en
datasets:
  - BAAI/Infinity-Instruct

Fine-tune Llama 3.2 3B Using Unsloth and BAAI/Infinity-Instruct Dataset

This model uses the "0625" version, but there will be a fine-tuned model trained with the "7M" version as well.

Uploaded Model

  • Developed by: MateoRov
  • License: apache-2.0
  • Fine-tuned from model: unsloth/llama-3.2-3b-instruct-bnb-4bit

Usage

Check my full repo on github for better undestanding: https://github.com/Mateorovere/FineTuning-LLM-Llama3.2-3b

But with the proper dependencies you can run the model with the following code:

from unsloth.chat_templates import get_chat_template
from unsloth import FastLanguageModel

# Get the chat template
tokenizer = get_chat_template(
    tokenizer,
    chat_template="llama-3.1",
)
model = "MateoRov/Llama3.2-3b-SFF-Infinity-MateoRovere"

# Enable native 2x faster inference
FastLanguageModel.for_inference(model)

# Define the input message
messages = [
    {"role": "user", "content": "Continue the Fibonacci sequence: 1, 1, 2, 3, 5, 8,"},
]

# Prepare the inputs
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,  # Must add for generation
    return_tensors="pt",
).to("cuda")

# Generate the output
outputs = model.generate(
    input_ids=inputs,
    max_new_tokens=64,
    use_cache=True,
    temperature=1.5,
    min_p=0.1,
)

# Decode the outputs
result = tokenizer.batch_decode(outputs)
print(result)

To get the generation token by token:


from unsloth.chat_templates import get_chat_template
from unsloth import FastLanguageModel
from transformers import TextStreamer

model = "MateoRov/Llama3.2-3b-SFF-Infinity-MateoRovere"

# Enable native 2x faster inference
FastLanguageModel.for_inference(model)

# Get the chat template
tokenizer = get_chat_template(
    tokenizer,
    chat_template="llama-3.1",
)

# Define the input message
messages = [
    {"role": "user", "content": "Continue the Fibonacci sequence: 1, 1, 2, 3, 5, 8,"},
]

# Prepare the inputs
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,  # Must add for generation
    return_tensors="pt",
).to("cuda")

# Initialize the text streamer
text_streamer = TextStreamer(tokenizer, skip_prompt=True)

# Generate the output token by token
_ = model.generate(
    input_ids=inputs,
    streamer=text_streamer,
    max_new_tokens=128,
    use_cache=True,
    temperature=1.5,
    min_p=0.1,
)