coverblew commited on
Commit
eaf5549
Β·
verified Β·
1 Parent(s): ba1bd66

Add model card for llamita.cpp

Browse files
Files changed (1) hide show
  1. README.md +62 -0
README.md ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - llama-cpp
4
+ - jetson-nano
5
+ - cuda-10
6
+ - 1-bit
7
+ - bonsai
8
+ - edge-ai
9
+ - gguf
10
+ - nvidia
11
+ - tegra
12
+ - maxwell
13
+ - quantization
14
+ - arm64
15
+ license: mit
16
+ pipeline_tag: text-generation
17
+ ---
18
+
19
+ # llamita.cpp
20
+
21
+ > Run an 8-billion-parameter 1-bit LLM in 1.1 GB on a $99 Jetson Nano.
22
+
23
+ **llamita.cpp** is a patched fork of [PrismML's llama.cpp](https://github.com/PrismML-Eng/llama.cpp) that enables [Bonsai](https://huggingface.co/collections/prism-ml/bonsai) 1-bit models (Q1_0_g128) to compile and run with **CUDA 10.2** on the **NVIDIA Jetson Nano** (SM 5.3 Maxwell, 4 GB RAM).
24
+
25
+ ## Results
26
+
27
+ | Model | Size on disk | RAM used | Prompt | Generation | Board |
28
+ |-------|-------------|----------|--------|------------|-------|
29
+ | [Bonsai-8B](https://huggingface.co/prism-ml/Bonsai-8B-gguf) | 1.1 GB | 2.5 GB | 2.1 tok/s | 1.1 tok/s | Jetson Nano 4GB |
30
+ | [Bonsai-4B](https://huggingface.co/prism-ml/Bonsai-4B-gguf) | 546 MB | ~1.5 GB | 3.6 tok/s | 1.6 tok/s | Jetson Nano 4GB |
31
+
32
+ An 8-billion-parameter model running on a board with 128 CUDA cores and 4 GB of shared RAM.
33
+
34
+ ## What Was Changed
35
+
36
+ 27 files modified, ~3,200 lines of patches across 7 categories:
37
+
38
+ 1. **C++17 to C++14** β€” `if constexpr`, `std::is_same_v`, structured bindings, fold expressions
39
+ 2. **CUDA 10.2 API stubs** β€” `nv_bfloat16` type stub, `cooperative_groups/reduce.h`, `CUDA_R_16BF`
40
+ 3. **SM 5.3 Maxwell** β€” Warp size macros, MMQ params, flash attention disabled with stubs
41
+ 4. **ARM NEON GCC 8** β€” Custom struct types for broken `vld1q_*_x*` intrinsics
42
+ 5. **Linker** β€” `-lstdc++fs` for `std::filesystem`
43
+ 6. **Critical correctness fix** β€” `binbcast.cu` fold expression silently computing nothing
44
+ 7. **Build system** β€” `CUDA_STANDARD 14`, flash attention template exclusion
45
+
46
+ ## The Bug That Broke Everything
47
+
48
+ During the C++14 port, a fold expression in `binbcast.cu` was replaced with `(void)0`. This silently broke ALL binary operations (add, multiply, subtract, divide). The model loaded, allocated memory, ran inference β€” and produced complete garbage. The fix was one line.
49
+
50
+ ## Links
51
+
52
+ - **GitHub**: [coverblew/llamita.cpp](https://github.com/coverblew/llamita.cpp)
53
+ - **Blog post**: [An 8B Model on a $99 Board](https://coverblew.github.io/llamita.cpp/)
54
+ - **Patch documentation**: [PATCHES.md](https://github.com/coverblew/llamita.cpp/blob/main/PATCHES.md)
55
+ - **Build guide**: [BUILD-JETSON.md](https://github.com/coverblew/llamita.cpp/blob/main/BUILD-JETSON.md)
56
+ - **Benchmarks**: [jetson-nano-4gb.md](https://github.com/coverblew/llamita.cpp/blob/main/benchmarks/jetson-nano-4gb.md)
57
+
58
+ ## Credits
59
+
60
+ - [ggml-org/llama.cpp](https://github.com/ggml-org/llama.cpp) β€” Original llama.cpp (MIT)
61
+ - [PrismML-Eng/llama.cpp](https://github.com/PrismML-Eng/llama.cpp) β€” Q1_0_g128 support (MIT)
62
+ - [PrismML Bonsai models](https://huggingface.co/collections/prism-ml/bonsai) β€” 1-bit LLMs (Apache 2.0)