--- library_name: peft --- First iteration of the default generator LoRa for [MiniHF](https://github.com/JD-P/minihf). This model still functions as a base model while writing more coherent text. ## Training procedure This model was trained starting from the [MiniHF Mistral SFT evaluator](https://huggingface.co/jdpressman/minihf_evaluator_mistral_7b_v0.1/blob/main/README.md). It was created using the MiniHF Reinforcement Learning From AI Feedback pipeline: `accelerate launch rlaif_generator.py --resume minihf_evaluator_mistral_7b_v0.1 --output-path mistral_h_eval --kl-weight 1.0 --constitution hermes/hermes_constitution.txt --prompts hermes/hermes_prompts.txt --length 256 --batch-size 4 --grad-accum-steps 8` The tuning script was modified to use the AdamW optimizer with weight decay: `opt = optim.AdamW(model.parameters(), lr=1e-5, weight_decay=1e-2, betas=(0.9, 0.98))` This weight decay is based on the observation that [RL tuning mode collapse](https://www.greaterwrong.com/posts/t9svvNPNmFf5Qa3TA/mysteries-of-mode-collapse) can be undone by interpolating the weights of the base model with that of the RL tuned model. Here the specific recipe was to start from the MiniHF SFT evaluator, then apply weight decay and the KL penalty towards the base model weights to inject entropy back into the policy. ### Prompt Bank and Constitution The prompt bank using during tuning is in the `hermes_prompts.txt` file found in this repo, the constitution in `hermes_constitution.txt` ### Configuration The following `bitsandbytes` quantization config was used during training: - quant_method: bitsandbytes - load_in_8bit: False - load_in_4bit: True - llm_int8_threshold: 6.0 - llm_int8_skip_modules: None - llm_int8_enable_fp32_cpu_offload: False - llm_int8_has_fp16_weight: False - bnb_4bit_quant_type: nf4 - bnb_4bit_use_double_quant: True - bnb_4bit_compute_dtype: float16 ### Framework versions - PEFT 0.5.0 - PEFT 0.5.0 - PEFT 0.5.0