Camels in a Changing Climate: Enhancing LM Adaptation with Tulu 2

Published on Nov 17, 2023
Β· Featured in Daily Papers on Nov 20, 2023


Since the release of T\"ULU [Wang et al., 2023b], open resources for instruction tuning have developed quickly, from better base models to new finetuning techniques. We test and incorporate a number of these advances into T\"ULU, resulting in T\"ULU 2, a suite of improved T\"ULU models for advancing the understanding and best practices of adapting pretrained language models to downstream tasks and user preferences. Concretely, we release: (1) T\"ULU-V2-mix, an improved collection of high-quality instruction datasets; (2) T\"ULU 2, LLAMA-2 models finetuned on the V2 mixture; (3) T\"ULU 2+DPO, T\"ULU 2 models trained with direct preference optimization (DPO), including the largest DPO-trained model to date (T\"ULU 2+DPO 70B); (4) CODE T\"ULU 2, CODE LLAMA models finetuned on our V2 mix that outperform CODE LLAMA and its instruction-tuned variant, CODE LLAMA-Instruct. Our evaluation from multiple perspectives shows that the T\"ULU 2 suite achieves state-of-the-art performance among open models and matches or exceeds the performance of GPT-3.5-turbo-0301 on several benchmarks. We release all the checkpoints, data, training and evaluation code to facilitate future open efforts on adapting large language models.


I really love this kind of research or paper! I also agree with the opinion of the paper that future works should be conducted to analyze the behind things of several datasets, training methods, and base models. I hope that more research like this will progress!

Great to see the Zephyr recipe being battle tested on larger models!

Screenshot 2023-12-29 at 09.17.34.png

Hello! @Muennighoff pointed out to me that your Zephyr (and Xwin) results on AlpacaEval differ from those on the public leaderboard. In particular, we reported a win rate of 90.60% for Zephyr, but your table has 86.3%

Is this simply due to a different choice of generation parameters, i.e. did you use a different config to the one we added in the AlpacaEval repo (

Thanks! cc @hamishivi @yizhongw do you know why?

@lewtun Yeah, this is with greedy decoding, rather than 0.7 temperature - we used this for all the models we tested. I'll also note that I think there can be some variation (around 1-2 points) just from rerunning eval, probably due to GPT annotation non-determinism.

Sign up or log in to comment

Models citing this paper 61

Browse 61 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite in a dataset to link it from this page.

Spaces citing this paper 51

Collections including this paper 6