@m-ric on Hugging Face: "𝐍𝐞𝐰 𝐝𝐞𝐜𝐨𝐝𝐢𝐧𝐠 𝐭𝐞𝐜𝐡𝐧𝐢𝐪𝐮𝐞 𝐢𝐧 𝐭𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐞𝐫𝐬…"

Post

3236

𝐍𝐞𝐰 𝐝𝐞𝐜𝐨𝐝𝐢𝐧𝐠 𝐭𝐞𝐜𝐡𝐧𝐢𝐪𝐮𝐞 𝐢𝐧 𝐭𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐞𝐫𝐬 𝐬𝐢𝐠𝐧𝐢𝐟𝐢𝐜𝐚𝐧𝐭𝐥𝐲 𝐫𝐞𝐝𝐮𝐜𝐞𝐬 𝐡𝐚𝐥𝐥𝐮𝐜𝐢𝐧𝐚𝐭𝐢𝐨𝐧𝐬 👏

DoLa decoding, which made a conference paper at ICLR '24, has just been merged in Transformers by @joaogante and Yung-Sung Chuang.
This new decoding method is simple yet extremely impressive!

Reminder: Decoder LLMs (the GPT kind of LLM, the most common one) generate their outputs one token at a time: at each step, given a current text, they compute a logit for each token in their vocabulary that should represent the probability of this token coming next.

Then they either pick the highest logit token (greedy decoding) or sample one with a probability defined by the logits (sampling).

The authors of DoLa wanted to improve that simple method.

They knew this established fact that transformer LMs encode low-level info (like base syntax) in early layers and more high-level info like knowledge in the later layers.

💡 This gave them their key idea: During decoding, rather than picking the token with the highest logit, 𝘄𝗵𝘆 𝗻𝗼𝘁 𝗽𝗶𝗰𝗸 𝘁𝗵𝗲 𝘁𝗼𝗸𝗲𝗻 𝘄𝗶𝘁𝗵 𝘁𝗵𝗲 𝗺𝗼𝘀𝘁 𝗶𝗺𝗽𝗿𝗲𝘀𝘀𝗶𝘃𝗲 𝗶𝗻𝗰𝗿𝗲𝗮𝘀𝗲 𝗶𝗻 𝗹𝗼𝗴𝗶𝘁 𝗮𝗰𝗿𝗼𝘀𝘀 𝗹𝗮𝘆𝗲𝗿𝘀?

This gives impressive results:
🚀 𝟱% - 𝟮𝟬% 𝗯𝗮𝘀𝗲 𝗽𝗼𝗶𝗻𝘁𝘀 𝗶𝗻𝗰𝗿𝗲𝗮𝘀𝗲 𝗮𝗰𝗿𝗼𝘀𝘀 𝘁𝗵𝗲 𝗯𝗲𝗻𝗰𝗵𝗺𝗮𝗿𝗸𝘀
🚀 For instance on TruthfulQA / Open-ended, across all model sizes the increase in truthfulness is 14 base points, which is 𝗮𝗿𝗼𝘂𝗻𝗱 𝟰𝟬% 𝗶𝗺𝗽𝗿𝗼𝘃𝗲𝗺𝗲𝗻𝘁 𝗰𝗼𝗺𝗽𝗮𝗿𝗲𝗱 𝘁𝗼 𝘀𝘁𝗮𝗻𝗱𝗮𝗿𝗱 𝗱𝗲𝗰𝗼𝗱𝗶𝗻𝗴!

🤔 Wouldn't decoding take longer because of this added contrasting step? 👉 𝗧𝗵𝗲 𝗿𝘂𝗻𝘁𝗶𝗺𝗲 𝗶𝗻𝗰𝗿𝗲𝗮𝘀𝗲 𝗶𝘀 𝗻𝗲𝗴𝗹𝗶𝗴𝗶𝗯𝗹𝗲, 𝟭 𝘁𝗼 𝟴% 𝗼𝗻𝗹𝘆.

Paper added to my collection 👉 m-ric/optimization-mechanics-661d543a5fc6ca1dc84284a0

so how will we itilize the feature as this talk talk talk al the time with papers and no implementations !

Join the conversation