This is a concise version of Salesforce/SFR-Iterative-DPO-LLaMA-3-8B-R. In the training, a concise penalty is applied.