Iterative Reasoning Preference Optimization
Paper
•
2404.19733
•
Published
•
41
Note LoRA fine-tune on judge LM, using dataset from Prometheus's 10K feedback dataset. Turn LLM into a classifier to increase 'overfitting' and get a slightly better performing model based on Phi-3 (which arguably already have a stronger performance than Mistral) Not that surprising, and using large dataset to fine-tune on human preference is boring. They did release code for the experiment which is nice to have. The real gem is efficient alignment.