ezrkllm-collection / README.md
Pelochus's picture
Added Qwen 1.5 4B
575a297 verified
|
raw
history blame
2.73 kB
---
license: mit
tags:
- rockchip
- rk3588
- rkllm
- text-generation-inference
pipeline_tag: text-generation
---
# ezrkllm-collection
Collection of LLMs compatible with Rockchip's chips using their rkllm-toolkit.
This repo contains the converted models for running on the RK3588 NPU found in SBCs like Orange Pi 5, NanoPi R6 and Radxa Rock 5.
Check the main repo on GitHub for how to install and use: https://github.com/Pelochus/ezrknpu
## Available LLMs
Before running any LLM, take into account that the required RAM is between 1.5-3 times the model size (this is an estimation, haven't done extensive testing yet).
Right now, only converted the following models:
| LLM | Parameters | Link |
| --------------------- | ----------- | ---------------------------------------------------------- |
| Qwen Chat | 1.8B | https://huggingface.co/Pelochus/qwen-1_8B-rk3588 |
| Microsoft Phi-2 | 2.7B | https://huggingface.co/Pelochus/phi-2-rk3588 |
| Llama 2 7B | 7B | https://huggingface.co/Pelochus/llama2-chat-7b-hf-rk3588 |
| Llama 2 13B | 13B | https://huggingface.co/Pelochus/llama2-chat-13b-hf-rk3588 |
| Qwen 1.5 Chat | 4B | https://huggingface.co/Pelochus/qwen1.5-chat-4B-rk3588 |
| TinyLlama v1 (broken) | 1.1B | https://huggingface.co/Pelochus/tinyllama-v1-rk3588 |
However, RKLLM also supports Qwen 2 (supossedly). Llama 2 was converted using Azure servers.
For reference, converting Phi-2 peaked at about 15 GBs of RAM + 25 GBs of swap (counting OS, but that was using about 2 GBs max).
Converting Llama 2 7B peaked at about 32 GBs of RAM + 50 GB of swap.
## Downloading a model
Use:
`git clone LINK_FROM_PREVIOUS_TABLE_HERE`
And then (may not be necessary):
`git lfs pull`
If the first clone gives you problems (takes too long) you can also:
`GIT_LFS_SKIP_SMUDGE=1 git clone LINK_FROM_PREVIOUS_TABLE_HERE`
And then 'git lfs pull' inside the cloned folder to download the full model.
## RKLLM parameters used
RK3588 **only supports w8a8 quantization**, so that was the selected quantization for ALL models.
Aside from that, RKLLM toolkit allows for no optimization (0) and optimization (1).
All models are optimized.
## Future additions
- [x] Converting Llama 2 (70B currently in conversion, but that won't run even with 32GB RAM)
- [x] Converting Qwen 1.5 (from 0.5 to 7B, except 4B, already converted)
- [ ] Adding other compatible Rockchip's SoCs
## More info
- My fork for rknn-llm: https://github.com/Pelochus/ezrknn-llm
- Original Rockchip's LLMs repo: https://github.com/airockchip/rknn-llm