YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

payelb/HHRLHF_TinyLlama-1.1B_aligned_with_semantic_MARS_deberta_RM

Base model: TinyLlama/TinyLlama-1.1B-Chat-v1.0

Alignment dataset: Anthropic/hh-rlhf

Reward model: payelb/HHRLHF_reward-model-deberta-v3-base_1k_fixed_MARS_semantic_distance_synth

Method: PPO alignment with LoRA adapters.

Reward model type: semantic-distance-aware MARS DeBERTa-v3-base reward model.

Training details:

  • NUM_TRAIN_SAMPLES: 1000
  • MAX_PROMPT_TOKENS: 256
  • MIN_NEW_TOKENS: 32
  • MAX_NEW_TOKENS: 64
  • TOTAL_PPO_STEPS: 250
  • PPO_EPOCHS: 2
  • LR: 5e-06
  • Batch size: 16
  • Mini-batch size: 4
  • Gradient accumulation: 4
  • INIT_KL_COEF: 0.02
  • TARGET_KL: 6.0
  • ADAP_KL_CTRL: True
  • Reward normalization and clipping enabled, clip=5.0
  • LoRA r=16, alpha=32, dropout=0.05
  • Generation during PPO: do_sample=True, top_p=0.9, temperature=0.7
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support