Post
3182
Here is a thought, instead of telling LLMs what to do, show them! π
Language models are aligned to emulate the collective voice of many, resulting in outputs that align with no one in particular. π£οΈπ
DITTO from Stanford University proposes that LLMs can be tuned with less than 10 samples! π€―
What's DITTO? Demonstration ITerated Task Optimization (definitely came up with the acronym first! π)
Here is the step-by-step implementation: π οΈ
Initialization: Start with a reference language model (LM), a set of expert demonstrations, a sample size, and a frequency of sampling. π
Supervised Fine-Tuning (SFT): Begin by fine-tuning the reference LM on the set of expert demonstrations to create an initial policy P0. ποΈ
Iterative Comparison Sampling: For each iteration t:
Sample multiple completions from the policy Pt for each demonstration to create a new dataset Dt.
Construct a batch of comparisons where the demonstrations are ranked higher than all sampled model outputs from the current and previous iterations. π
Policy Update:
Update the policy Pt using a Direct Preference Optimization (DPO) algorithm, which incorporates feedback from the batch of comparisons.
Increment the iteration and repeat the sampling and updating process until convergence. βοΈ
Result: The final policy P after sufficient iterations aligns more closely with the expert demonstrations, effectively tuning the LM to reflect user-specific preferences and behaviors. π―
DITTO outperforms few-shot prompting. π
Paper: Show, Don't Tell: Aligning Language Models with Demonstrated Feedback (2406.00888) π
Language models are aligned to emulate the collective voice of many, resulting in outputs that align with no one in particular. π£οΈπ
DITTO from Stanford University proposes that LLMs can be tuned with less than 10 samples! π€―
What's DITTO? Demonstration ITerated Task Optimization (definitely came up with the acronym first! π)
Here is the step-by-step implementation: π οΈ
Initialization: Start with a reference language model (LM), a set of expert demonstrations, a sample size, and a frequency of sampling. π
Supervised Fine-Tuning (SFT): Begin by fine-tuning the reference LM on the set of expert demonstrations to create an initial policy P0. ποΈ
Iterative Comparison Sampling: For each iteration t:
Sample multiple completions from the policy Pt for each demonstration to create a new dataset Dt.
Construct a batch of comparisons where the demonstrations are ranked higher than all sampled model outputs from the current and previous iterations. π
Policy Update:
Update the policy Pt using a Direct Preference Optimization (DPO) algorithm, which incorporates feedback from the batch of comparisons.
Increment the iteration and repeat the sampling and updating process until convergence. βοΈ
Result: The final policy P after sufficient iterations aligns more closely with the expert demonstrations, effectively tuning the LM to reflect user-specific preferences and behaviors. π―
DITTO outperforms few-shot prompting. π
Paper: Show, Don't Tell: Aligning Language Models with Demonstrated Feedback (2406.00888) π