|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- togethercomputer/RedPajama-Data-1T-Sample |
|
language: |
|
- en |
|
tags: |
|
- llama |
|
- llama 2 |
|
- smol_llama |
|
--- |
|
# smol_llama-220M-GQA-32k-theta |
|
|
|
Experimental model meant to serve as a long-context speculative decoding model. |
|
|
|
Created using [BEE-spoke-data/smol_llama-220M-GQA](https://huggingface.co/BEE-spoke-data/smol_llama-220M-GQA) and further pretraining at 32768 context length on [togethercomputer/RedPajama-Data-1T-Sample](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T-Sample). |
|
|
|
This variant uses the rope theta (rope frequency base) method for context extension. |
|
|
|
Wikitext Perplexity (64 rows) as evaluated by [exllamav2](https://github.com/turboderp/exllamav2): |
|
``` |
|
Base Model |
|
2048: 20.2193 |
|
4096: 102.6928 |
|
8192: 235.5210 |
|
16384: 390.7198 |
|
32768: 515.8053 |
|
|
|
32k - Linear Rope Scale 16.0 |
|
2048: 25.7148 |
|
4096: 23.4461 |
|
8192: 22.3326 |
|
16384: 21.6744 |
|
32768: 21.4317 |
|
|
|
32k - Rope Theta 1000000.0 |
|
2048: 20.2158 |
|
4096: 18.3868 |
|
8192: 17.5976 |
|
16384: 17.1462 |
|
32768: 16.6989 |
|
``` |