💻MAC os Compatible💻

Llama 2 ko 7B - GGUF

Model creator: Meta
Original model: Llama 2 7B Chat
Reference: Llama 2 7B GGUF

Download

pip3 install huggingface-hub>=0.17.1

Then you can download any individual model file to the current directory, at high speed, with a command like this:

huggingface-cli download 24bean/Llama-2-ko-7B-Chat-GGUF llama-2-ko-7b-chat-q8-0.gguf --local-dir . --local-dir-use-symlinks False

Or you can download llama-2-ko-7b.gguf, non-quantized model by

huggingface-cli download 24bean/Llama-2-ko-7B-Chat-GGUF llama-2-ko-7b-chat.gguf --local-dir . --local-dir-use-symlinks False

Example `llama.cpp` command

Make sure you are using llama.cpp from commit d0cee0d36d5be95a0d9088b674dbb27354107221 or later.

./main -ngl 32 -m llama-2-ko-7b-chat-q8-0.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "{prompt}"

How to run from Python code

You can use GGUF models from Python using the llama-cpp-python or ctransformers libraries.

How to load this model from Python using ctransformers

First install the package

# Base ctransformers with no GPU acceleration
pip install ctransformers>=0.2.24
# Or with CUDA GPU acceleration
pip install ctransformers[cuda]>=0.2.24
# Or with ROCm GPU acceleration
CT_HIPBLAS=1 pip install ctransformers>=0.2.24 --no-binary ctransformers
# Or with Metal GPU acceleration for macOS systems
CT_METAL=1 pip install ctransformers>=0.2.24 --no-binary ctransformers

Simple example code to load one of these GGUF models

from ctransformers import AutoModelForCausalLM

# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
llm = AutoModelForCausalLM.from_pretrained("24bean/Llama-2-ko-7B-Chat-GGUF", model_file="llama-2-7b-chat-q8-0.gguf", model_type="llama", gpu_layers=50)

print(llm("인공지능은"))

How to use with LangChain

Here's guides on using llama-cpp-python or ctransformers with LangChain: