--- license: apache-2.0 --- # ggml versions of LLongMA-3b - Announcement: [Tweet by @EnricoShippole](https://twitter.com/EnricoShippole/status/1677346578720256000) - Model: [conceptofmind/LLongMA-3b](https://huggingface.co/conceptofmind/LLongMA-3b) (license not specified) - Base Model: [openlm-research/open_llama_3b](https://huggingface.co/openlm-research/open_llama_3b), project: [OpenLLaMA: An Open Reproduction of LLaMA](https://github.com/openlm-research/open_llama) (Apache 2.0) - Tuning dataset: [togethercomputer/RedPajama-Data-1T](https://huggingface.co/togethercomputer/RedPajama-Data-1T) (various licenses) - [llama.cpp](https://github.com/ggerganov/llama.cpp): 3B model size: build 607(ffb06a3) or later, extended context: N/A - Context length: 8192 token extended length model. - Type: foundational ## Extended context This model uses an extended context by [scaling the position index](https://kaiokendev.github.io/context) in the RoPE algorithm by 1/4 to extend it from 2048 tokens of the original LLaMA models to 8192 tokens. For the best results the model should undergo an additional finetuning training step. This was achieved with training with 1 billion tokens of the RedPajama-1T dataset (OpenLLaMA 3B full training was 1 trillion tokens). To enable this in llama.cpp is an ongoing development effort. You can track it in PR [#2054](https://github.com/ggerganov/llama.cpp/pull/2054). It should be enabled with the flags `-c 8192 --rope-freq-scale 0.25` if it is correct. ## K-quants There are now more quantization types in llama.cpp, some lower than 4 bits. Currently these are not well supported because of technical reasons. If you want to use them, you have to build llama.cpp (from build 829 (ff5d58f)) with the `LLAMA_QKK_64` Make or CMake variable enabled (see PR [#2001](https://github.com/ggerganov/llama.cpp/pull/2001)). Then you can quantize the F16 or maybe Q8_0 version to what you want.