Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
joaoganteΒ 
posted an update Jan 17
Post
Up to 3x faster LLM generation with no extra resources/requirements - ngram speculation has landed in πŸ€— transformers! πŸŽοΈπŸ’¨

All you need to do is to add prompt_lookup_num_tokens=10 to your generate call, and you'll get faster LLMs πŸ”₯


How does it work? πŸ€”

Start with assisted generation, where a smaller model generates candidate sequences. The net result is a significant speedup if the model agrees with the candidate sequences! However, we do require a smaller model trained similarly πŸ˜•

The idea introduced (and implemented) by Apoorv Saxena consists of gathering the candidate sequences from the input text itself. If the latest generated ngram is in the input, use the continuation therein as a candidate! No smaller model is required while still achieving significant speedups πŸ”₯

In fact, the penalty of gathering and testing the candidates is so small that you should use this technique whenever possible!

Here is the code example that produces the outputs shown in the video: https://pastebin.com/bms6XtR4

Have fun πŸ€—

Hi @joaogante
Is there any accuracy penalty for using ngrams speculate? Or a higher/lower number for the n? I am going to test this in TGI today.

Β·

@MaziyarPanahi no accuracy penalty at all :) The only catch on the transformers side is that you are limited to a batch size of one (and even that is not a technical limitation of the technique -- we simply haven't built that code path yet)