Buggy GGUF Output

#38
by mattjcly - opened

TL;DR

I believe there may be an issue with the gemma-7b-it GGUF file. gemma-7b-it.gguf is buggy in llama.cpp, while gemma-2b-it.gguf works fine, and gemma-7b-it is fine in Inference API.

Overview

gemma-7b-it.gguf shows odd mis-spellings and mis-alignments when running with llama.cpp (https://github.com/ggerganov/llama.cpp/commit/4cb4d8b22d4fda971621a68c570ce84d66897c37):

~\Desktop\lmstudio\llama.cpp\build\bin\Release
.\main.exe -p "<start_of_turn>user\nGive me a solid plan to write a production application in c++<end_of_turn>\n<start_of_turn>model\n" -m "C:\Users\User\.cache\models\google\gemma-7b-it-GGUF\gemma-7b-it.gguf"
Log start
main: build = 2240 (4cb4d8b2)
main: built with MSVC 19.29.30153.0 for x64
main: seed  = 1708625481
llama_model_loader: loaded meta data with 19 key-value pairs and 254 tensors from C:\Users\User\.cache\models\google\gemma-7b-it-GGUF\gemma-7b-it.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma
llama_model_loader: - kv   1:                               general.name str              = gemma-7b-it
llama_model_loader: - kv   2:                       gemma.context_length u32              = 8192
llama_model_loader: - kv   3:                          gemma.block_count u32              = 28
llama_model_loader: - kv   4:                     gemma.embedding_length u32              = 3072
llama_model_loader: - kv   5:                  gemma.feed_forward_length u32              = 24576
llama_model_loader: - kv   6:                 gemma.attention.head_count u32              = 16
llama_model_loader: - kv   7:              gemma.attention.head_count_kv u32              = 16
llama_model_loader: - kv   8:                 gemma.attention.key_length u32              = 256
llama_model_loader: - kv   9:               gemma.attention.value_length u32              = 256
llama_model_loader: - kv  10:     gemma.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  13:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  14:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  15:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,256128]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,256128]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,256128]  = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - type  f32:  254 tensors
llm_load_vocab: mismatch in special tokens definition ( 544/256128 vs 388/256128 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = gemma
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 256128
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 16
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_rot            = 192
llm_load_print_meta: n_embd_head_k    = 256
llm_load_print_meta: n_embd_head_v    = 256
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 24576
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = all F32 (guessed)
llm_load_print_meta: model params     = 8.54 B
llm_load_print_meta: model size       = 31.81 GiB (32.00 BPW)
llm_load_print_meta: general.name     = gemma-7b-it
llm_load_print_meta: BOS token        = 2 '<bos>'
llm_load_print_meta: EOS token        = 1 '<eos>'
llm_load_print_meta: UNK token        = 3 '<unk>'
llm_load_print_meta: PAD token        = 0 '<pad>'
llm_load_print_meta: LF token         = 227 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.10 MiB
llm_load_tensors:        CPU buffer size = 32570.17 MiB
......................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   224.00 MiB
llama_new_context_with_model: KV self size  =  224.00 MiB, K (f16):  112.00 MiB, V (f16):  112.00 MiB
llama_new_context_with_model:        CPU input buffer size   =     8.01 MiB
llama_new_context_with_model:        CPU compute buffer size =   506.25 MiB
llama_new_context_with_model: graph splits (measure): 1

system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 1


<start_of_turn>user\nGive me a solid plan to write a production application in c++<end_of_turn>\n<start_of_turn>model\n**Step 1: Define Requirements and Goals:**

- Identify the purpose of your Production Application (PA).
  * Determine its target audience, platform(s), hardware requirements.


<strong> Step2 : Design Data Structures And Algorithms**:   </strong>    




 - Choose data structures like arrays , linked lists or trees based on application needs . 



**Step3: Plan Out The Main Structure:**

- Create a layered architecture with main components (e g., UI layer, business logic  layer)
 Define interfaces for each component to ensure decoupling.


<strong> Step4 : Write Modular Components**:   </strong>




 - Break down the PA into smaller modules like data access , user

Things like <strong>and space between needs and period in needs .

This does not occur in the HF Inference API:

<start_of_turn>user
Give me a solid plan to write a production application in c++<end_of_turn>
<start_of_turn>model
*Step 1: Define Requirements and Scope**

* Identify the purpose of the production application and its target audience.
* Determine the key features and functionalities.
* Establish performance and scalability requirements.
* Define data models and data structures.

**Step 2: Design and Architecture**

* Choose a suitable software development methodology (e.g., Agile, Waterfall).
* Design the overall architecture and data flow.
* Select appropriate technologies and tools for development.
* Create a class hierarchy and define interfaces for modularity.

**Step 3: Coding**

* Write clear and concise code using C++ best practices.
* Use object-oriented principles (OOP) to encapsulate functionality and data.
* Implement algorithms and data structures efficiently.
* Use modern C++ features such as smart pointers and lambda expressions.

**Step 4: Testing and Debugging**

* Create unit tests to verify functionality and performance.
* Use debugging tools to identify

The 2B model works fine in llama.cpp (note that this is a q8_0 quantization, am downloading the full precision now, but q8_0 quantization does not fix the 7B model):

.\main.exe -p "<start_of_turn>user\nGive me a solid plan to write a production application in c++<end_of_turn>\n<start_of_turn>model\n" -m "C:\Users\User\.cache\models\lmstudio-ai\gemma-2b-it-GGUF\gemma-2b-it-q8_0.gguf"
Log start
main: build = 2240 (4cb4d8b2)
main: built with MSVC 19.29.30153.0 for x64
main: seed  = 1708625873
llama_model_loader: loaded meta data with 21 key-value pairs and 164 tensors from C:\Users\User\.cache\models\lmstudio-ai\gemma-2b-it-GGUF\gemma-2b-it-q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma
llama_model_loader: - kv   1:                               general.name str              = gemma-2b-it
llama_model_loader: - kv   2:                       gemma.context_length u32              = 8192
llama_model_loader: - kv   3:                          gemma.block_count u32              = 18
llama_model_loader: - kv   4:                     gemma.embedding_length u32              = 2048
llama_model_loader: - kv   5:                  gemma.feed_forward_length u32              = 16384
llama_model_loader: - kv   6:                 gemma.attention.head_count u32              = 8
llama_model_loader: - kv   7:              gemma.attention.head_count_kv u32              = 1
llama_model_loader: - kv   8:                 gemma.attention.key_length u32              = 256
llama_model_loader: - kv   9:               gemma.attention.value_length u32              = 256
llama_model_loader: - kv  10:     gemma.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  13:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  14:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  15:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,256128]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,256128]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,256128]  = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - kv  20:                          general.file_type u32              = 7
llama_model_loader: - type  f32:   37 tensors
llama_model_loader: - type q8_0:  127 tensors
llm_load_vocab: mismatch in special tokens definition ( 544/256128 vs 388/256128 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = gemma
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 256128
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_head           = 8
llm_load_print_meta: n_head_kv        = 1
llm_load_print_meta: n_layer          = 18
llm_load_print_meta: n_rot            = 256
llm_load_print_meta: n_embd_head_k    = 256
llm_load_print_meta: n_embd_head_v    = 256
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 256
llm_load_print_meta: n_embd_v_gqa     = 256
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 16384
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 2B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 2.51 B
llm_load_print_meta: model size       = 2.48 GiB (8.50 BPW)
llm_load_print_meta: general.name     = gemma-2b-it
llm_load_print_meta: BOS token        = 2 '<bos>'
llm_load_print_meta: EOS token        = 1 '<eos>'
llm_load_print_meta: UNK token        = 3 '<unk>'
llm_load_print_meta: PAD token        = 0 '<pad>'
llm_load_print_meta: LF token         = 227 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.06 MiB
llm_load_tensors:        CPU buffer size =  2539.93 MiB
.............................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =     9.00 MiB
llama_new_context_with_model: KV self size  =    9.00 MiB, K (f16):    4.50 MiB, V (f16):    4.50 MiB
llama_new_context_with_model:        CPU input buffer size   =     6.01 MiB
llama_new_context_with_model:        CPU compute buffer size =   504.25 MiB
llama_new_context_with_model: graph splits (measure): 1

system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 1


<start_of_turn>user\nGive me a solid plan to write a production application in c++<end_of_turn>\n<start_of_turn>model\n**Step 1: Define the Application Requirements**

* Determine the functionalities and features of the production application.
* Identify the target audience and their needs.
* Gather any existing specifications or design documents.

**Step 2: Choose a Development Framework**

* Consider using established frameworks like Qt, wxWidgets, or OpenGL for cross-platform development.
* Choose a framework based on the project requirements, development expertise, and future maintenance.

**Step 3: Design the User Interface (UI)**

* Create a wireframe of the application's layout and components.
* Use clear and consistent labeling to guide users.
* Design for accessibility and responsiveness.

**Step 4: Develop the Core Logic**

* Write the main application loop that handles user interactions, data flow, and overall execution.
* Implement business rules and error handling mechanisms.
* Connect to external systems or databases for data access.

This leads me to believe there may be an issue with this 7b gguf file.

A few more simple examples of buggy output with the GGUF:

what are the benefits of python 3?
Benefits Python-2 and -4:
...
How many days are in August?
There is a total of 31 Days in the month Of Augest.

Hi, my name is George
HelloGeorge! 👋 It's a pleasure to hear from you. What would like me do today?

Why will Google release this bad model? it is too bad?

Try to add the following arguments to your main command: -e --temp 0 --repeat-penalty 1.0 --no-penalize-nl

  • -e - escape newlines (\n)
  • --temp 0 - pick most probable tokens
  • --repeat-penalty 1.0 - disable repetition penalty (it's never a good idea to have this with instruction tuned models)
  • --no-penalize-nl - do not penalize repeating newlines

Example on M2 Ultra:

./main \
  -m ~/Downloads/gemma-7b-it.gguf \
  -p "<start_of_turn>user\nGive me a solid plan to write a production application in c++<end_of_turn>\n<start_of_turn>model\n" \
  -e --temp 0 --repeat-penalty 1.0 --no-penalize-nl
system_info: n_threads = 16 / 24 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | 
sampling: 
    repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
    top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.000
    mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 1


<start_of_turn>user
Give me a solid plan to write a production application in c++<end_of_turn>
<start_of_turn>model
**Step 1: Define Requirements**

* Identify the purpose of the application and its target audience.
* Determine the specific features and functionality required.
* Establish performance and scalability goals.
* Define data structures and algorithms needed.

**Step 2: Design the Application**

* Create a high-level design outlining the overall architecture and components.
* Choose appropriate data structures and algorithms for each component.
* Design the user interface (UI) and user experience (UX).
* Determine the necessary dependencies and libraries.

**Step 3: Write the Code**

* Implement the application using C++ language standards.
* Follow best practices for coding style, modularity, and reusability.
* Use appropriate tools and frameworks to simplify development.
* Write unit tests to ensure code quality and functionality.

**Step 4: Integrate and Test**

* Integrate the application with necessary dependencies and libraries.
* Test the application thoroughly to identify and fix bugs.
* Perform performance testing to evaluate its scalability and responsiveness.

**Step 5: Deployment and Maintenance**

* Deploy the application to the target environment.
* Monitor the application for performance and stability.
* Implement a maintenance plan for bug fixes and updates.

**Additional Tips:**

* Use a version control system (VCS) to track changes and collaborate.
* Follow a structured coding process, such as Agile or Waterfall.
* Use documentation tools to create clear and concise documentation.
* Seek feedback from peers and mentors to identify areas for improvement.
* Stay up-to-date with C++ best practices and new technologies.

**Timeline:**

* The timeline for writing a production application in C++ will vary based on the complexity of the project and the team's experience.
* A typical timeline might range from a few weeks to several months.

**Tools and Technologies:**

* Visual Studio or other IDE
* C++ Compiler and Linker
* Unit Testing Frameworks (e.g., Google Test)
* Version Control Systems (e.g., Git)
* Frameworks and Libraries (e.g., Qt, Boost) [end of text]

llama_print_timings:        load time =    1296.52 ms
llama_print_timings:      sample time =      76.40 ms /   443 runs   (    0.17 ms per token,  5798.35 tokens per second)
llama_print_timings: prompt eval time =      98.35 ms /    22 tokens (    4.47 ms per token,   223.70 tokens per second)
llama_print_timings:        eval time =   23091.99 ms /   442 runs   (   52.24 ms per token,    19.14 tokens per second)
llama_print_timings:       total time =   23527.31 ms /   464 tokens

I concur, the 7B GGUF is completelly unusable, random words, random spaces, and is much worse on non english languages, i've got loops and random utf emoticons spamming on a first one shot prompt

@ggerganov thanks for the pointers! those command line arguments seem to work pretty good. for my simple examples, the key seems to be --repeat-penalty. with that set to 1, the output is pretty good.

Google org

Should we surface this in READMEs or in other documentation?

@ggerganov Thanks for the CLI arguments, your example works well.

Are there any way to make gemma-7b-it working with llama.cpp in interactive mode?

@zbruceli One way to do it is with:

./main -m models/gemma-7b-it/ggml-model-f16.gguf -e --in-prefix "<start_of_turn>user\n" --in-suffix "<end_of_turn>\n<start_of_turn>model\n" --temp 0 --repeat-penalty 1.0 --no-penalize-nl -ngl 99 -ins -c 4096 --verbose-prompt
system_info: n_threads = 16 / 24 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | 

main: prompt: ''
main: number of tokens in prompt = 1
     2 -> ''

main: interactive mode on.
Reverse prompt: '### Instruction:

'
 43774 -> ' ###'
 36142 -> ' Instruction'
235292 -> ':'
   109 -> '

'
Input prefix: '<start_of_turn>user
'
     2 -> ''
   106 -> '<start_of_turn>'
  1645 -> 'user'
   108 -> '
'
Input suffix: '<end_of_turn>
<start_of_turn>model
'
   107 -> '<end_of_turn>'
   108 -> '
'
   106 -> '<start_of_turn>'
  2516 -> 'model'
   108 -> '
'
sampling: 
    repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
    top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.000
    mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 4096, n_batch = 512, n_predict = -1, n_keep = 1


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.


> <start_of_turn>user
Which are the 3 biggest countries in Europe?
<end_of_turn>
<start_of_turn>model
The 3 biggest countries in Europe are:

1. Russia
2. Ukraine
3. France

> <start_of_turn>user
What are their capitals?
<end_of_turn>
<start_of_turn>model
1. Russia - Moscow
2. Ukraine - Kiev
3. France - Paris

> <start_of_turn>user
What is their population?
<end_of_turn>
<start_of_turn>model
1. Russia - 1.4 billion
2. Ukraine - 43 million
3. France - 54 million

> <start_of_turn>user


llama_print_timings:        load time =     426.82 ms
llama_print_timings:      sample time =      13.23 ms /    68 runs   (    0.19 ms per token,  5137.89 tokens per second)
llama_print_timings: prompt eval time =     522.78 ms /    80 tokens (    6.53 ms per token,   153.03 tokens per second)
llama_print_timings:        eval time =    2561.63 ms /    68 runs   (   37.67 ms per token,    26.55 tokens per second)
llama_print_timings:       total time =   34928.60 ms /   148 tokens

@ggerganov Thank you. But I want to use quantized version with server (through api_like_OAI.py). How do I pass these --in-prefix --in-suffix in python code, or however I can use it with server (server is usually running at 0.0.0.0). My question is general (for all models) as to how to give chat template to them in python code (when api_like_OAI is running)?

In general, you should add the prefixes and suffixes that @ggerganov adds: "<start_of_turn>user\n" before your prompt and "<end_of_turn>\n<start_of_turn>model\n" after your prompt. This is the instruction formatting we've trained our model with (similar to ChatML) -- does that help?

@suryabhupa greg wrote that example for main, but how is it done with server? (specially when using openai? like

import openai

client = openai.OpenAI(
    base_url="http://localhost:8080/v1", # "http://<Your api-server IP>:port"
    api_key = "sk-no-key-required"
)

completion = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
    {"role": "system", "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."},
    {"role": "user", "content": "Write a limerick about python exceptions"}
]
)

print(completion.choices[0].message)```

just replace system content and user content (and if it is required for this model gemma 7b it then also add assistant content). Thank you

Thank you for sharing the settings, @ggerganov I am attempting to run the llama.cpp HTTP server with the settings you provided, however, I am unable to locate the option for -e - escape newlines (\n) for the HTTP server. Could you kindly inform me on how to enable this setting for the server?

In Chinese Q&A, different "torch_dtype=dtype" can have an impact on the results.
For Example:
(1)dtype = torch.bfloat16 , answer is good
(2)dtype = torch.float16, almost all answer token is pad

@suryabhupa greg wrote that example for main, but how is it done with server? (specially when using openai? like

import openai

client = openai.OpenAI(
    base_url="http://localhost:8080/v1", # "http://<Your api-server IP>:port"
    api_key = "sk-no-key-required"
)

completion = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
    {"role": "system", "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."},
    {"role": "user", "content": "Write a limerick about python exceptions"}
]
)

print(completion.choices[0].message)```

just replace system content and user content (and if it is required for this model gemma 7b it then also add assistant content). Thank you

You will need to create a payload, something like this:
data = {
"mode": "chat-instruct",
"character": character,
"messages": history,
"user_bio": 'Ab',
"user_name" : 'Ab',
"temperature": 0.0,
"frequency_penalty": 1.0
}

    resp_json = post_message(config["URL"], headers, data)

Sign up or log in to comment