Transformers
Safetensors
trl
grpo
arabic-poetry
classical-arabic
lora

Shaer-adapters-grpo-vnext

This repo is the first patched rerun after Shaer-AI/Shaer-adapters-grpo was reclassified as reward hacked.

Current Status As Of 2026-04-13

This repo is still an important transition result, but it is no longer the current direction.

The best completed GRPO stage is Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v2, and the current unlaunched next-step candidate is the dual-judge v5 setup prepared locally in /root/workspace/Shaer/grpo.

Place In The Story

Project sequence:

  1. Shaer-AI/Shaer-adapters clean SFT baseline
  2. Shaer-AI/Shaer-adapters-grpo historically strong-looking but reward-hacked GRPO run
  3. Shaer-AI/Shaer-adapters-grpo-vnext stricter anti-template and artifact-filtering GRPO rerun
  4. Shaer-AI/Shaer-adapters-grpo-friend-v1 first judge-centered rerun
  5. Shaer-AI/Shaer-adapters-grpo-friend-v1-easyfirst easier judge-centered rerun
  6. Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v2 weighted short-subset rerun

What Data It Used

  • base starting adapter: Shaer-AI/Shaer-adapters
  • GRPO dataset artifact: Shaer-AI/ashaar-enhanced-desc-baseform-final-sft-lte20-min500-splits-grpo-meter-count-v1
  • source poetry dataset: Shaer-AI/ashaar-with-enhanced-descriptions-baseform-final-sft-lte20-min500-splits
  • train subset: dropped-trio curated subset, cap 3000 per surviving meter
  • eval bank: full 13-meter eval bank, 104 rows total
  • local run dir: /root/workspace/Shaer/grpo/outputs/train/shaer_grpo_20260412_104406

Reward Used Here

This run introduced the stricter structure-side reward patch that was designed to kill the old hacked behavior:

reward_total = meter * count_adherence * arabic_clean * repeat_penalty

with much stronger internals for:

  • artifact-free Arabic filtering
  • lexical plausibility
  • near-duplicate detection
  • opening diversity
  • distinct-2 phrase diversity

This stage still did not use a semantic judge inside the optimized reward. It was mainly a structure-side cleanup stage.

Best Tracked Checkpoint

  • step: 500
  • eval total: 0.1937
  • eval meter: 0.5652
  • eval count adherence: 0.9099
  • eval judge diagnostic: 0.3774
  • eval repeat penalty: 0.5577
  • eval arabic clean: 0.8750

What This Run Proved

This stage was important because it showed the patched anti-template reward was much better at rejecting the old hacked outputs.

But it still was not the final answer:

  • tracked reward was much lower than the old hacked run
  • generation quality was still not strong enough
  • semantic quality still needed to be modeled more directly

Current Interpretation

For the paper story, this repo is the first serious repair stage after the hacked run. It is useful because it separates two claims:

  • yes, better anti-template and contamination logic matters
  • no, structure-only reward repair still does not solve meaning and relevance

Why We Moved On

This repo motivated the next shift: bring in a focused Arabic semantic judge that scores whether the poem:

  • has meaning
  • is not garbage
  • is relevant to the description

That next stage was published as Shaer-AI/Shaer-adapters-grpo-friend-v1.

Recommended Use

Use this repo as the first serious post-hack reward patch, not as the final recommended GRPO model.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Shaer-AI-2/Shaer-adapters-grpo-vnext

Adapter
(13)
this model

Dataset used to train Shaer-AI-2/Shaer-adapters-grpo-vnext