Instructions to use Shaer-AI-2/Shaer-adapters-grpo-vnext with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Shaer-AI-2/Shaer-adapters-grpo-vnext with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Shaer-AI-2/Shaer-adapters-grpo-vnext", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Shaer-adapters-grpo-vnext
This repo is the first patched rerun after Shaer-AI/Shaer-adapters-grpo was reclassified as reward hacked.
Current Status As Of 2026-04-13
This repo is still an important transition result, but it is no longer the current direction.
The best completed GRPO stage is Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v2, and the current unlaunched next-step candidate is the dual-judge v5 setup prepared locally in /root/workspace/Shaer/grpo.
Place In The Story
Project sequence:
Shaer-AI/Shaer-adaptersclean SFT baselineShaer-AI/Shaer-adapters-grpohistorically strong-looking but reward-hacked GRPO runShaer-AI/Shaer-adapters-grpo-vnextstricter anti-template and artifact-filtering GRPO rerunShaer-AI/Shaer-adapters-grpo-friend-v1first judge-centered rerunShaer-AI/Shaer-adapters-grpo-friend-v1-easyfirsteasier judge-centered rerunShaer-AI/Shaer-adapters-grpo-short1k-no-trio-v2weighted short-subset rerun
What Data It Used
- base starting adapter:
Shaer-AI/Shaer-adapters - GRPO dataset artifact:
Shaer-AI/ashaar-enhanced-desc-baseform-final-sft-lte20-min500-splits-grpo-meter-count-v1 - source poetry dataset:
Shaer-AI/ashaar-with-enhanced-descriptions-baseform-final-sft-lte20-min500-splits - train subset:
dropped-trio curated subset, cap
3000per surviving meter - eval bank:
full
13-meter eval bank,104rows total - local run dir:
/root/workspace/Shaer/grpo/outputs/train/shaer_grpo_20260412_104406
Reward Used Here
This run introduced the stricter structure-side reward patch that was designed to kill the old hacked behavior:
reward_total = meter * count_adherence * arabic_clean * repeat_penalty
with much stronger internals for:
- artifact-free Arabic filtering
- lexical plausibility
- near-duplicate detection
- opening diversity
- distinct-2 phrase diversity
This stage still did not use a semantic judge inside the optimized reward. It was mainly a structure-side cleanup stage.
Best Tracked Checkpoint
- step:
500 - eval total:
0.1937 - eval meter:
0.5652 - eval count adherence:
0.9099 - eval judge diagnostic:
0.3774 - eval repeat penalty:
0.5577 - eval arabic clean:
0.8750
What This Run Proved
This stage was important because it showed the patched anti-template reward was much better at rejecting the old hacked outputs.
But it still was not the final answer:
- tracked reward was much lower than the old hacked run
- generation quality was still not strong enough
- semantic quality still needed to be modeled more directly
Current Interpretation
For the paper story, this repo is the first serious repair stage after the hacked run. It is useful because it separates two claims:
- yes, better anti-template and contamination logic matters
- no, structure-only reward repair still does not solve meaning and relevance
Why We Moved On
This repo motivated the next shift: bring in a focused Arabic semantic judge that scores whether the poem:
- has meaning
- is not garbage
- is relevant to the description
That next stage was published as Shaer-AI/Shaer-adapters-grpo-friend-v1.
Recommended Use
Use this repo as the first serious post-hack reward patch, not as the final recommended GRPO model.
Model tree for Shaer-AI-2/Shaer-adapters-grpo-vnext
Base model
humain-ai/ALLaM-7B-Instruct-preview