Nafnlaus
/

Wide-Sheared-LLaMA-290M-GGUF

Inference Endpoints

Model card Files Files and versions Community

Nafnlaus commited on Jul 4, 2024

Commit

95c58aa

•

1 Parent(s): 5ca4056

Update README.md

Files changed (1) hide show

README.md +12 -3

README.md CHANGED Viewed

@@ -1,3 +1,12 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+---
+This is a GGUF conversion of https://huggingface.co/minghaoyan/Wide-Sheared-LLaMA-290M, based on the paper "Decoding Speculative Decoding" by Minghao Yan, Saurabh Agarwal, and Shivaram Venkataraman.
+https://arxiv.org/pdf/2402.01528
+For those not familiar with speculative decoding, it is a technique to accelerate inference of larger models by pairing them with a smaller draft model; the draft model is used to rapidly generate many likely tokens, which the large model then simultaneously verifies. Wherever the drafted token sequence would differ from what the large model would have generated, the large model's token is used instead, and the small model then drafts new tokens from that point, with the process repeating.  As a result, the same sequence is generated, but at a significantly accelerated rate.
+The wide sheared LLaMA models by minghaoyan are optimized for use as speculative decoding draft models. To use these with llama.cpp, use the "-md <GGUF_filename>" option, and consider tuning the --draft parameter.