customer-support-grpo-v5

A hierarchical multi-agent Reinforcement Learning model trained with GRPO for realistic customer support scenarios.

This model was developed as part of the Meta OpenEnv Hackathon Round 2 (April 2026).

Model Description

customer-support-grpo-v5 is a fine-tuned Llama-3.1-8B model trained using Unsloth + GRPO (Group Relative Policy Optimization). It powers a 3-level hierarchical multi-agent system designed to simulate and improve real-world customer support in Indian enterprise environments.

Key Features

Hierarchical Agents: L1 Support Agent, L2 Supervisor, L3 Manager with escalation logic
Progressive Curriculum Learning: 5 stages from basic to nightmare difficulty
Hybrid Reward System:
- Rule-based: VADER sentiment, efficiency, accuracy
- LLM-as-Judge: empathy, policy adherence, resolution quality
Grounded Responses: NoSQL DB integration for user/order context
Realistic Challenges:
- Policy drift mid-conversation
- SLA pressure
- Hinglish users
- Multi-turn coordination

Training Details

Base Model: unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit
Method: GRPO with LoRA
Hardware: Hugging Face Jobs (A100 / L40S)
Steps: 150
Training Time: ~7 hours
Framework: Unsloth + TRL + custom rollout + hybrid reward engine
Date: April 26, 2026

This was the 5th training attempt after multiple failures (credits, timeouts, infra issues).

Intended Use

Customer support simulation
Multi-agent coordination experiments
Instruction-following research
Long-horizon reasoning

Live Demo:
https://huggingface.co/spaces/lebiraja/customer-support-env

How to Use

from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model="lebiraja/customer-support-grpo-v5",
    device="cuda",
    torch_dtype="auto"
)

messages = [
    {"role": "system", "content": "You are a professional customer support agent..."},
    {"role": "user", "content": "I was charged twice for my order ORD-EC-1202"}
]

response = pipe(messages, max_new_tokens=512, temperature=0.7)
print(response[0]["generated_text"][-1]["content"])

Use the system prompt from the training environment for best results.

Performance Highlights

Strong reward improvement across curriculum
Effective hierarchical coordination
Reduced hallucinations via grounding + penalties
Good handling of Hinglish + frustrated users

Limitations

Trained on simulated data (not production)
Context length constrained
May produce invalid actions in edge cases
Requires correct system prompt

Repository Links

Live Demo: https://huggingface.co/spaces/lebiraja/customer-support-env
Full Project: https://github.com/lebiraja/meta_hack
Previous Versions: v2, v3, v4

License

Apache-2.0

Citation

@misc{customer-support-grpo-v5,
  author       = {Lebi Raja and team},
  title        = {customer-support-grpo-v5: Hierarchical Multi-Agent RL for Customer Support},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/lebiraja/customer-support-grpo-v5}},
  note         = {Meta OpenEnv Hackathon Round 2}
}

Built with ❤️ using Unsloth, Hugging Face, and a lot of late-night debugging.

Downloads last month: 217

Safetensors

Model size

8B params

Tensor type

BF16

Video Preview

Reinforcement Learning

Model tree for lebiraja/customer-support-grpo-v5

Base model

meta-llama/Llama-3.1-8B

Finetuned

meta-llama/Llama-3.1-8B-Instruct

Quantized

unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit