Model Information
The GRPO-llama3.1-reasoning is a reasoning trained model of the meta-Llama-3.1-8B-Instruct. The GRPOTrainer supports using custom reward functions instead of dense reward models.
- Base model: meta-llama/meta-Llama-3.1-8B-Instruct
How to use
Starting with Unsloth, reducing memory usage by 80%.
import torch
from huggingface_hub import snapshot_download
from unsloth import FastLanguageModel, PatchFastRL
PatchFastRL("GRPO", FastLanguageModel)
max_seq_length = 512
lora_rank = 32
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "meta-llama/meta-Llama-3.1-8B-Instruct",
max_seq_length = max_seq_length,
load_in_4bit = True,
fast_inference = True,
max_lora_rank = lora_rank,
gpu_memory_utilization = 0.6,
)
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""
model_id = "nirusanan/GRPO-llama3.1-reasoning"
snapshot_download(repo_id=model_id, local_dir="llama-grpo_saved_lora",
local_dir_use_symlinks=False, revision="main")
model.load_adapter("/content/llama-grpo_saved_lora")
text = tokenizer.apply_chat_template([
{"role" : "system", "content" : SYSTEM_PROMPT},
{"role" : "user", "content" : "Calculate pi."},
], tokenize = False, add_generation_prompt = True)
from vllm import SamplingParams
sampling_params = SamplingParams(
temperature = 0.8,
top_p = 0.95,
max_tokens = 1024,
)
output = model.fast_generate(
text,
sampling_params = sampling_params,
)[0].outputs[0].text
output
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
HF Inference deployability: The model has no library tag.