Llama 3-8B Finetuned with GRPO

Model Name: yuxiang204/llama3-8b-finetuned
Base Model: meta-llama/Meta-Llama-3.1-8B
Fine-tuned with: Unsloth + GRPO (Guided Reward Policy Optimization)
Quantization: Available in FP16, Q4_K_M, Q5_K_M, and Q8_0 (GGUF)
License: MIT

πŸ“Œ Model Overview

This is a fine-tuned version of Meta's Llama 3.1-8B, trained with GRPO using the Unsloth framework. The fine-tuning process focused on enhancing structured reasoning and improving response quality.

It includes:

  • FP16 Safetensors for Hugging Face Transformers
  • GGUF quantized versions for fast inference in llama.cpp, Ollama, and KoboldAI
  • LoRA adapters for further fine-tuning

πŸ›  Training Details

  • Fine-tuning Method: GRPO (Guided Reward Policy Optimization)
  • Training Duration: ~10 hours
  • Dataset: Custom instructional dataset (mainly reasoning-based tasks)
  • GPU Used: A100 (80GB)

The fine-tuning aimed at improving logical reasoning, mathematical accuracy, and structured responses.

πŸ“Š Performance Comparison

Before Fine-tuning (Base Model):

β€œWhich is bigger? 9.11 or 9.9?”

β†’ Inconsistent or incomplete reasoning

After GRPO Fine-tuning:

**β€œBoth are not equal! Since 9.11 has a slightly larger decimal part than 9.9, 9.11 is actually bigger.”**

β†’ More structured and detailed response

Downloads last month
144
Safetensors
Model size
8.03B params
Tensor type
FP16
Β·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.