Shaer-adapters-grpo-vnext

This repo is the first patched rerun after Shaer-AI/Shaer-adapters-grpo was reclassified as reward hacked.

Current Status As Of 2026-04-13

This repo is still an important transition result, but it is no longer the current direction.

The best completed GRPO stage is Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v2, and the current unlaunched next-step candidate is the dual-judge v5 setup prepared locally in /root/workspace/Shaer/grpo.

Place In The Story

Project sequence:

Shaer-AI/Shaer-adapters clean SFT baseline
Shaer-AI/Shaer-adapters-grpo historically strong-looking but reward-hacked GRPO run
Shaer-AI/Shaer-adapters-grpo-vnext stricter anti-template and artifact-filtering GRPO rerun
Shaer-AI/Shaer-adapters-grpo-friend-v1 first judge-centered rerun
Shaer-AI/Shaer-adapters-grpo-friend-v1-easyfirst easier judge-centered rerun
Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v2 weighted short-subset rerun

What Data It Used

base starting adapter: Shaer-AI/Shaer-adapters
GRPO dataset artifact: Shaer-AI/ashaar-enhanced-desc-baseform-final-sft-lte20-min500-splits-grpo-meter-count-v1
source poetry dataset: Shaer-AI/ashaar-with-enhanced-descriptions-baseform-final-sft-lte20-min500-splits
train subset: dropped-trio curated subset, cap 3000 per surviving meter
eval bank: full 13-meter eval bank, 104 rows total
local run dir: /root/workspace/Shaer/grpo/outputs/train/shaer_grpo_20260412_104406

Reward Used Here

This run introduced the stricter structure-side reward patch that was designed to kill the old hacked behavior:

reward_total = meter * count_adherence * arabic_clean * repeat_penalty

with much stronger internals for:

artifact-free Arabic filtering
lexical plausibility
near-duplicate detection
opening diversity
distinct-2 phrase diversity

This stage still did not use a semantic judge inside the optimized reward. It was mainly a structure-side cleanup stage.

Best Tracked Checkpoint

step: 500
eval total: 0.1937
eval meter: 0.5652
eval count adherence: 0.9099
eval judge diagnostic: 0.3774
eval repeat penalty: 0.5577
eval arabic clean: 0.8750

What This Run Proved

This stage was important because it showed the patched anti-template reward was much better at rejecting the old hacked outputs.

But it still was not the final answer:

tracked reward was much lower than the old hacked run
generation quality was still not strong enough
semantic quality still needed to be modeled more directly

Current Interpretation

For the paper story, this repo is the first serious repair stage after the hacked run. It is useful because it separates two claims:

yes, better anti-template and contamination logic matters
no, structure-only reward repair still does not solve meaning and relevance

Why We Moved On

This repo motivated the next shift: bring in a focused Arabic semantic judge that scores whether the poem:

has meaning
is not garbage
is relevant to the description

That next stage was published as Shaer-AI/Shaer-adapters-grpo-friend-v1.

Recommended Use

Use this repo as the first serious post-hack reward patch, not as the final recommended GRPO model.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Shaer-AI-2/Shaer-adapters-grpo-vnext

Base model

humain-ai/ALLaM-7B-Instruct-preview

Finetuned

Navid-AI/Yehia-7B-preview

Adapter

(13)

this model

Shaer-AI-2
/

Shaer-adapters-grpo-vnext