Condense/contextualize description
Browse files
README.md
CHANGED
@@ -5,12 +5,11 @@ license: llama2
|
|
5 |
## Description
|
6 |
|
7 |
This model is intended to be used as an accelerator for [llama 13B (chat)](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf) and takes inspiration
|
8 |
-
from the Medusa architecture
|
9 |
-
a single token in the draft
|
10 |
-
from the prior stage (the base model can be considered stage 0).
|
11 |
-
|
12 |
-
|
13 |
-
We sample multiple tokens at each stage, and emit a tree of candidate suffixes to evaluate in parallel.
|
14 |
|
15 |
## Code
|
16 |
|
|
|
5 |
## Description
|
6 |
|
7 |
This model is intended to be used as an accelerator for [llama 13B (chat)](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf) and takes inspiration
|
8 |
+
from the Medusa speculative decoding architecture. This accelerator modifies the MLP into a multi-stage MLP, where each stage predicts
|
9 |
+
a single token in the draft based on both a state vector and sampled token
|
10 |
+
from the prior stage (the base model can be considered stage 0).
|
11 |
+
The state vector from the base model provides contextual information to the accelerator,
|
12 |
+
while conditioning on prior sampled tokens allows it to produce higher-quality draft n-grams.
|
|
|
13 |
|
14 |
## Code
|
15 |
|