INSAIT-Institute
/

BgGPT-Gemma-2-9B-IT-v1.0

@@ -56,15 +56,52 @@ including Alibaba’s Qwen 2.5 72B and Meta’s Llama3.1 70B. Further, both BgGP
 Finally, our models retain the **excellent English performance** inherited from the original Google Gemma 2 models upon which they are based.
 # Instruction format
 In order to leverage instruction fine-tuning, your prompt should begin with a beginning-of-sequence token `<bos>` and be formatted in the Gemma 2 chat template. `<bos>` should only be the first token in a chat sequence.
 E.g.
 ```
-<bos><start_of_turn>user\n
-Кога е основан Софийският университет?<end_of_turn>\n
-<start_of_turn>model\n
 ```
 This format is also available as a [chat template](https://huggingface.co/docs/transformers/main/chat_templating) via the `apply_chat_template()` method:
@@ -80,39 +117,15 @@ messages = [
 ]
 input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", return_dict=True).to("cuda")
-outputs = model.generate(**input_ids, max_new_tokens=256)
 print(tokenizer.decode(outputs[0]))
 ```
-# Recommended Parameters
-For optimal performance, we recommend the following parameters for text generation, as we have extensively tested our model with them:
-```python
-generation_params = {
-    "temperature": 0.1
-    "top_k": 20,
-    "repetition_penalty": 1.1
-}
-```
-In principle, increasing temperature should work adequately as well.
-# Use in 🤗 Transformers
-First install the latest version of the transformers library:
-```
-pip install -U 'transformers[torch]'
-```
-Then load the model in transformers:
-```python
-model = AutoModelForCausalLM.from_pretrained(
-    "INSAIT-Institute/BgGPT-Gemma-2-9B-IT-v1.0",
-    torch_dtype=torch.bfloat16,
-    attn_implementation="eager",
-    device_map="auto",
-)
-```
 **Important Note:** Models based on Gemma 2 such as BgGPT-Gemma-2-9B-IT-v1.0 do not support flash attention. Using it results in degraded performance.
 # Use with GGML / llama.cpp

 Finally, our models retain the **excellent English performance** inherited from the original Google Gemma 2 models upon which they are based.
+# Use in 🤗 Transformers
+First install the latest version of the transformers library:
+```
+pip install -U 'transformers[torch]'
+```
+Then load the model in transformers:
+```python
+from transformers import AutoModelForCausalLM
+model = AutoModelForCausalLM.from_pretrained(
+    "INSAIT-Institute/BgGPT-Gemma-2-9B-IT-v1.0",
+    torch_dtype=torch.bfloat16,
+    attn_implementation="eager",
+    device_map="auto",
+)
+```
+# Recommended Parameters
+For optimal performance, we recommend the following parameters for text generation, as we have extensively tested our model with them:
+```python
+from transformers import GenerationConfig
+generation_params = GenerationConfig(
+    max_new_tokens=2048,              # Choose maximum generation tokens
+    temperature=0.1,
+    top_k=25,
+    top_p=1
+    repetition_penalty=1.1
+    eos_token_id=[1,107]
+)
+```
+In principle, increasing temperature should work adequately as well.
 # Instruction format
 In order to leverage instruction fine-tuning, your prompt should begin with a beginning-of-sequence token `<bos>` and be formatted in the Gemma 2 chat template. `<bos>` should only be the first token in a chat sequence.
 E.g.
 ```
+<bos><start_of_turn>user
+Кога е основан Софийският университет?<end_of_turn>
+<start_of_turn>model
 ```
 This format is also available as a [chat template](https://huggingface.co/docs/transformers/main/chat_templating) via the `apply_chat_template()` method:
 ]
 input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", return_dict=True).to("cuda")
+outputs = model.generate(
+  **input_ids,
+  generation_config=generation_params
+)
 print(tokenizer.decode(outputs[0]))
 ```
 **Important Note:** Models based on Gemma 2 such as BgGPT-Gemma-2-9B-IT-v1.0 do not support flash attention. Using it results in degraded performance.
 # Use with GGML / llama.cpp