customer-support-grpo-v5

A hierarchical multi-agent Reinforcement Learning model trained with GRPO for realistic customer support scenarios.

This model was developed as part of the Meta OpenEnv Hackathon Round 2 (April 2026).


Model Description

customer-support-grpo-v5 is a fine-tuned Llama-3.1-8B model trained using Unsloth + GRPO (Group Relative Policy Optimization). It powers a 3-level hierarchical multi-agent system designed to simulate and improve real-world customer support in Indian enterprise environments.

Key Features

  • Hierarchical Agents: L1 Support Agent, L2 Supervisor, L3 Manager with escalation logic
  • Progressive Curriculum Learning: 5 stages from basic to nightmare difficulty
  • Hybrid Reward System:
    • Rule-based: VADER sentiment, efficiency, accuracy
    • LLM-as-Judge: empathy, policy adherence, resolution quality
  • Grounded Responses: NoSQL DB integration for user/order context
  • Realistic Challenges:
    • Policy drift mid-conversation
    • SLA pressure
    • Hinglish users
    • Multi-turn coordination

Training Details

  • Base Model: unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit
  • Method: GRPO with LoRA
  • Hardware: Hugging Face Jobs (A100 / L40S)
  • Steps: 150
  • Training Time: ~7 hours
  • Framework: Unsloth + TRL + custom rollout + hybrid reward engine
  • Date: April 26, 2026

This was the 5th training attempt after multiple failures (credits, timeouts, infra issues).


Intended Use

  • Customer support simulation
  • Multi-agent coordination experiments
  • Instruction-following research
  • Long-horizon reasoning

Live Demo:
https://huggingface.co/spaces/lebiraja/customer-support-env


How to Use

from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model="lebiraja/customer-support-grpo-v5",
    device="cuda",
    torch_dtype="auto"
)

messages = [
    {"role": "system", "content": "You are a professional customer support agent..."},
    {"role": "user", "content": "I was charged twice for my order ORD-EC-1202"}
]

response = pipe(messages, max_new_tokens=512, temperature=0.7)
print(response[0]["generated_text"][-1]["content"])

Use the system prompt from the training environment for best results.


Performance Highlights

  • Strong reward improvement across curriculum
  • Effective hierarchical coordination
  • Reduced hallucinations via grounding + penalties
  • Good handling of Hinglish + frustrated users

Limitations

  • Trained on simulated data (not production)
  • Context length constrained
  • May produce invalid actions in edge cases
  • Requires correct system prompt

Repository Links


License

Apache-2.0


Citation

@misc{customer-support-grpo-v5,
  author       = {Lebi Raja and team},
  title        = {customer-support-grpo-v5: Hierarchical Multi-Agent RL for Customer Support},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/lebiraja/customer-support-grpo-v5}},
  note         = {Meta OpenEnv Hackathon Round 2}
}

Built with ❤️ using Unsloth, Hugging Face, and a lot of late-night debugging.

Downloads last month
217
Safetensors
Model size
8B params
Tensor type
BF16
·
Video Preview
loading

Model tree for lebiraja/customer-support-grpo-v5

Space using lebiraja/customer-support-grpo-v5 1