lordjimen commited on
Commit
3e0bab9
1 Parent(s): f15b66b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +44 -31
README.md CHANGED
@@ -56,15 +56,52 @@ including Alibaba’s Qwen 2.5 72B and Meta’s Llama3.1 70B. Further, both BgGP
56
  Finally, our models retain the **excellent English performance** inherited from the original Google Gemma 2 models upon which they are based.
57
 
58
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
  # Instruction format
60
 
61
  In order to leverage instruction fine-tuning, your prompt should begin with a beginning-of-sequence token `<bos>` and be formatted in the Gemma 2 chat template. `<bos>` should only be the first token in a chat sequence.
62
 
63
  E.g.
64
  ```
65
- <bos><start_of_turn>user\n
66
- Кога е основан Софийският университет?<end_of_turn>\n
67
- <start_of_turn>model\n
 
68
  ```
69
 
70
  This format is also available as a [chat template](https://huggingface.co/docs/transformers/main/chat_templating) via the `apply_chat_template()` method:
@@ -80,39 +117,15 @@ messages = [
80
  ]
81
  input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", return_dict=True).to("cuda")
82
 
83
- outputs = model.generate(**input_ids, max_new_tokens=256)
 
 
 
84
  print(tokenizer.decode(outputs[0]))
85
 
86
  ```
87
 
88
- # Recommended Parameters
89
-
90
- For optimal performance, we recommend the following parameters for text generation, as we have extensively tested our model with them:
91
-
92
- ```python
93
- generation_params = {
94
- "temperature": 0.1
95
- "top_k": 20,
96
- "repetition_penalty": 1.1
97
- }
98
- ```
99
-
100
- In principle, increasing temperature should work adequately as well.
101
 
102
- # Use in 🤗 Transformers
103
- First install the latest version of the transformers library:
104
- ```
105
- pip install -U 'transformers[torch]'
106
- ```
107
- Then load the model in transformers:
108
- ```python
109
- model = AutoModelForCausalLM.from_pretrained(
110
- "INSAIT-Institute/BgGPT-Gemma-2-9B-IT-v1.0",
111
- torch_dtype=torch.bfloat16,
112
- attn_implementation="eager",
113
- device_map="auto",
114
- )
115
- ```
116
  **Important Note:** Models based on Gemma 2 such as BgGPT-Gemma-2-9B-IT-v1.0 do not support flash attention. Using it results in degraded performance.
117
 
118
  # Use with GGML / llama.cpp
 
56
  Finally, our models retain the **excellent English performance** inherited from the original Google Gemma 2 models upon which they are based.
57
 
58
 
59
+ # Use in 🤗 Transformers
60
+ First install the latest version of the transformers library:
61
+ ```
62
+ pip install -U 'transformers[torch]'
63
+ ```
64
+ Then load the model in transformers:
65
+ ```python
66
+ from transformers import AutoModelForCausalLM
67
+
68
+ model = AutoModelForCausalLM.from_pretrained(
69
+ "INSAIT-Institute/BgGPT-Gemma-2-9B-IT-v1.0",
70
+ torch_dtype=torch.bfloat16,
71
+ attn_implementation="eager",
72
+ device_map="auto",
73
+ )
74
+ ```
75
+
76
+ # Recommended Parameters
77
+
78
+ For optimal performance, we recommend the following parameters for text generation, as we have extensively tested our model with them:
79
+
80
+ ```python
81
+ from transformers import GenerationConfig
82
+
83
+ generation_params = GenerationConfig(
84
+ max_new_tokens=2048, # Choose maximum generation tokens
85
+ temperature=0.1,
86
+ top_k=25,
87
+ top_p=1
88
+ repetition_penalty=1.1
89
+ eos_token_id=[1,107]
90
+ )
91
+ ```
92
+
93
+ In principle, increasing temperature should work adequately as well.
94
+
95
  # Instruction format
96
 
97
  In order to leverage instruction fine-tuning, your prompt should begin with a beginning-of-sequence token `<bos>` and be formatted in the Gemma 2 chat template. `<bos>` should only be the first token in a chat sequence.
98
 
99
  E.g.
100
  ```
101
+ <bos><start_of_turn>user
102
+ Кога е основан Софийският университет?<end_of_turn>
103
+ <start_of_turn>model
104
+
105
  ```
106
 
107
  This format is also available as a [chat template](https://huggingface.co/docs/transformers/main/chat_templating) via the `apply_chat_template()` method:
 
117
  ]
118
  input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", return_dict=True).to("cuda")
119
 
120
+ outputs = model.generate(
121
+ **input_ids,
122
+ generation_config=generation_params
123
+ )
124
  print(tokenizer.decode(outputs[0]))
125
 
126
  ```
127
 
 
 
 
 
 
 
 
 
 
 
 
 
 
128
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
129
  **Important Note:** Models based on Gemma 2 such as BgGPT-Gemma-2-9B-IT-v1.0 do not support flash attention. Using it results in degraded performance.
130
 
131
  # Use with GGML / llama.cpp