danielpark's picture
Update README.md
ab174eb verified
metadata
license: llama2

Sample repository

Development Status :: 2 - Pre-Alpha
Developed by MinWoo Park, 2023, Seoul, South Korea. Contact: parkminwoo1991@gmail.com.

Hits

What is GGML?

GGML is a tensor library for machine learning to enable large models and high performance on commodity hardware. Development of ggml is underway for a more efficient format and new k-quant method, so it is not stable. Read more at GGUF documentation.

Model Weights Offered

Model Size(GB) Description Performance
jindo-7b-instruct 12.6 original model weight
jindo-7b-instruct.ggmlv3.f16.bin 12.5 model weight converted to ggml f16 format
jindo-7b-instruct.ggmlv3.q4_0.bin 3.73 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. Legacy
jindo-7b-instruct.ggmlv3.q4_k_m.bin 3.98 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. Medium, balanced quality
jindo-7b-instruct.ggmlv3.q5_k_m.bin 4.67 5-bit quantization. Same super-block structure as GGML_TYPE_Q4_K resulting in 5.5 bpw. Large, very low quality loss

Prompt template: None

{prompt}



Inference

To perform inference using the danielpark/ko-llama-2-jindo-7b-instruct-ggml weights fine-tuned with llama2 on CPU or GPU, you need to set up the appropriate installation and configuration on your system. Please refer to the llama.cpp repository, the langchain's documentation, and follow the guides for various dependency software as needed. Open In Colab

Using LLaMA CPP module in Lang-Chain

$ pip install langchain ctransformers llama-cpp-python
from langchain.llms import LlamaCpp
from langchain import PromptTemplate, LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

CPU

# Make sure the model path is correct for your system!
llm = LlamaCpp(
    model_path="./models/jindo-7b-instruct-ggml-model-f16.bin",
    input={"temperature": 0.75, "max_length": 2000, "top_p": 1},
    callback_manager=callback_manager,
    verbose=True,
)

GPU

If the installation with BLAS backend was correct, you will see an BLAS = 1 indicator in model properties.

Two of the most important parameters for use with GPU are:

  • n_gpu_layers - determines how many layers of the model are offloaded to your GPU.
  • n_batch - how many tokens are processed in parallel.
n_gpu_layers = 40  # Change this value based on your model and your GPU VRAM pool.
n_batch = 512  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.

# Make sure the model path is correct for your system!
llm = LlamaCpp(
    model_path="./models/jindo-7b-instruct-ggml-model-f16.bin",
    n_gpu_layers=n_gpu_layers,
    n_batch=n_batch,
    callback_manager=callback_manager,
    verbose=True,
)

Metal

n_gpu_layers = 1  # Metal set to 1 is enough.
n_batch = 512  # Should be between 1 and n_ctx, consider the amount of RAM of your Apple Silicon Chip.

# Make sure the model path is correct for your system!
llm = LlamaCpp(
    model_path="./models/jindo-7b-instruct-ggml-model-f16.bin",
    n_gpu_layers=n_gpu_layers,
    n_batch=n_batch,
    f16_kv=True,  # MUST set to True, otherwise you will run into problem after a couple of calls
    callback_manager=callback_manager,
    verbose=True,
)

Using C Transformers module in Lang-Chain

from langchain.llms import CTransformers

llm = CTransformers(model="./models/jindo-7b-instruct-ggml-model-f16.bin", model_type='llama')
print(llm('LLM Jindo is going to'))

Web Demo

I implement the web demo using several popular tools that allow us to rapidly create web UIs.

model web ui quantinized
danielpark/ko-llama-2-jindo-7b-instruct. using gradio on colab -
danielpark/ko-llama-2-jindo-7b-instruct-4bit-128g-gptq using text-generation-webui on colab gptq
danielpark/ko-llama-2-jindo-7b-instruct-ggml koboldcpp-v1.38 ggml

Tools

See more...
Name Description
KoboldCpp A powerful GGML web UI with full GPU acceleration out of the box. Especially good for story-telling.
LoLLMS Web UI A great web UI with GPU acceleration via the c_transformers backend.
LM Studio A fully featured local GUI. Supports full GPU acceleration on macOS. Also supports Windows, without GPU accel.
text-generation-webui The most popular web UI. Requires extra steps to enable GPU accel via the llama.cpp backend.
ctransformers A Python library with LangChain support and OpenAI-compatible AI server.
llama-cpp-python A Python library with OpenAI-compatible API server.

CLI Inference using Quntinized Weight

To use the program with the desired settings, execute the following command:

./main -t <number_of_cpu_cores> -ngl <number_of_layers_to_offload> -m ko-llama-2-jindo-7b-instruct-ggml.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: Write a story about llamas\n### Response:"

Please make the following changes:

  • Replace <number_of_cpu_cores> with the number of physical CPU cores you have. For example, if your system has 8 cores/16 threads, use -t 8.
  • Replace <number_of_layers_to_offload> with the number of layers to offload to the GPU. If you don't have GPU acceleration, you can remove the -ngl argument.
  • If you want to have a chat-style conversation, replace the -p "<PROMPT>" argument with -i -ins.

Check for more details at llama.cpp, llama-cpp-python, llama2.c

See more...

Quant Types

Quantization Type Description Bits per Weight (bpw)
GGML_TYPE_Q2_K "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Block scales and mins are quantized with 4 bits. 2.5625
GGML_TYPE_Q3_K "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. 3.4375
GGML_TYPE_Q4_K "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. 4.5
GGML_TYPE_Q5_K "type-1" 5-bit quantization. Same super-block structure as GGML_TYPE_Q4_K resulting in 5.5 bpw. 5.5
GGML_TYPE_Q6_K "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. 6.5625
GGML_TYPE_Q8_K "type-0" 8-bit quantization. Only used for quantizing intermediate results. Block size is 256. All 2-6 bit dot products are implemented for this quantization type. Not specified
Model Description Recommendation
Q4_0 Small, very high-quality loss Legacy, prefer Q3_K_M
Q4_1 Small, substantial quality loss Legacy, prefer Q3_K_L
Q5_0 Medium, balanced quality Legacy, prefer Q4_K_M
Q5_1 Medium, low quality loss Legacy, prefer Q5_K_M
Q2_K Smallest, extreme quality loss Not recommended
Q3_K Alias for Q3_K_M
Q3_K_S Very small, very high-quality loss
Q3_K_M Very small, very high-quality loss
Q3_K_L Small, substantial quality loss
Q4_K Alias for Q4_K_M
Q4_K_S Small, significant quality loss
Q4_K_M Medium, balanced quality Recommended
Q5_K Alias for Q5_K_M
Q5_K_S Large, low quality loss Recommended
Q5_K_M Large, very low quality loss Recommended
Q6_K Very large, extremely low quality loss
Q8_0 Very large, extremely low quality loss Not recommended
F16 Extremely large, virtually no quality loss Not recommended
F32 Absolutely huge, lossless Not recommended

Performance

LLaMA 2 / 7B

name +ppl +ppl 13b to 7b % size size 16bit % +ppl per -1G
q2_k 0.8698 133.344% 2.67GB 20.54% 0.084201
q3_ks 0.5505 84.394% 2.75GB 21.15% 0.053707
q3_km 0.2437 37.360% 3.06GB 23.54% 0.024517
q3_kl 0.1803 27.641% 3.35GB 25.77% 0.018684
q4_0 0.2499 38.311% 3.50GB 26.92% 0.026305
q4_1 0.1846 28.300% 3.90GB 30.00% 0.020286
q4_ks 0.1149 17.615% 3.56GB 27.38% 0.012172
q4_km 0.0535 8.202% 3.80GB 29.23% 0.005815
q5_0 0.0796 12.203% 4.30GB 33.08% 0.009149
q5_1 0.0415 6.362% 4.70GB 36.15% 0.005000
q5_ks 0.0353 5.412% 4.33GB 33.31% 0.004072
q5_km 0.0142 2.177% 4.45GB 34.23% 0.001661
q6_k 0.0044 0.675% 5.15GB 39.62% 0.000561
q8_0 0.0004 0.061% 6.70GB 51.54% 0.000063

LLaMA 2 / 13B

name +ppl +ppl 13b to 7b % size size 16bit % +ppl per -1G
q2_k 0.6002 92.013% 5.13GB 20.52% 0.030206
q3_ks 0.3490 53.503% 5.27GB 21.08% 0.017689
q3_km 0.1955 29.971% 5.88GB 23.52% 0.010225
q3_kl 0.1520 23.302% 6.45GB 25.80% 0.008194
q4_0 0.1317 20.190% 6.80GB 27.20% 0.007236
q4_1 0.1065 16.327% 7.60GB 30.40% 0.006121
q4_ks 0.0861 13.199% 6.80GB 27.20% 0.004731
q4_km 0.0459 7.037% 7.32GB 29.28% 0.002596
q5_0 0.0313 4.798% 8.30GB 33.20% 0.001874
q5_1 0.0163 2.499% 9.10GB 36.40% 0.001025
q5_ks 0.0242 3.710% 8.36GB 33.44% 0.001454
q5_km 0.0095 1.456% 8.60GB 34.40% 0.000579
q6_k 0.0025 0.383% 9.95GB 39.80% 0.000166
q8_0 0.0005 0.077% 13.00GB 52.00% 0.000042

Reference Model Cards

The model card of the repository TheBloke/Llama-2-13B-GGML where LLaMA2 has been converted to GGML. llama.cpp pull request #1687 for quantinized weight performance.

Note

  • Simply download the single GGML file format weight. The other files are for reference purposes only during development. After conducting several experiments, we will provide the final GGML weight file separately.