Model Information

The GRPO-llama3.1-reasoning is a reasoning trained model of the meta-Llama-3.1-8B-Instruct. The GRPOTrainer supports using custom reward functions instead of dense reward models.

Base model: meta-llama/meta-Llama-3.1-8B-Instruct

How to use

Starting with Unsloth, reducing memory usage by 80%.

import torch
from huggingface_hub import snapshot_download
from unsloth import FastLanguageModel, PatchFastRL
PatchFastRL("GRPO", FastLanguageModel)

max_seq_length = 512
lora_rank = 32

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "meta-llama/meta-Llama-3.1-8B-Instruct",
    max_seq_length = max_seq_length,
    load_in_4bit = True,
    fast_inference = True,
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.6,
)

SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""

model_id = "nirusanan/GRPO-llama3.1-reasoning"

snapshot_download(repo_id=model_id, local_dir="llama-grpo_saved_lora",
                  local_dir_use_symlinks=False, revision="main")

model.load_adapter("/content/llama-grpo_saved_lora")

text = tokenizer.apply_chat_template([
    {"role" : "system", "content" : SYSTEM_PROMPT},
    {"role" : "user", "content" : "Calculate pi."},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    text,
    sampling_params = sampling_params,
)[0].outputs[0].text

output