Llama 3-8B Finetuned with GRPO
Model Name: yuxiang204/llama3-8b-finetuned
Base Model: meta-llama/Meta-Llama-3.1-8B
Fine-tuned with: Unsloth + GRPO (Guided Reward Policy Optimization)
Quantization: Available in FP16, Q4_K_M, Q5_K_M, and Q8_0 (GGUF)
License: MIT
π Model Overview
This is a fine-tuned version of Meta's Llama 3.1-8B, trained with GRPO using the Unsloth
framework. The fine-tuning process focused on enhancing structured reasoning and improving response quality.
It includes:
- FP16 Safetensors for Hugging Face Transformers
- GGUF quantized versions for fast inference in
llama.cpp
,Ollama
, andKoboldAI
- LoRA adapters for further fine-tuning
π Training Details
- Fine-tuning Method: GRPO (Guided Reward Policy Optimization)
- Training Duration: ~10 hours
- Dataset: Custom instructional dataset (mainly reasoning-based tasks)
- GPU Used: A100 (80GB)
The fine-tuning aimed at improving logical reasoning, mathematical accuracy, and structured responses.
π Performance Comparison
Before Fine-tuning (Base Model):
βWhich is bigger? 9.11 or 9.9?β
β Inconsistent or incomplete reasoning
After GRPO Fine-tuning:
**βBoth are not equal! Since 9.11 has a slightly larger decimal part than 9.9, 9.11 is actually bigger.β**
β More structured and detailed response
- Downloads last month
- 144
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The model has no library tag.