aashish1904 commited on
Commit
5372336
1 Parent(s): fe8a645

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +163 -0
README.md ADDED
@@ -0,0 +1,163 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+
4
+ library_name: transformers
5
+ tags:
6
+ - gemma2
7
+ - instruct
8
+ - bggpt
9
+ - insait
10
+ license: gemma
11
+ language:
12
+ - bg
13
+ - en
14
+ base_model:
15
+ - google/gemma-2-2b-it
16
+ - google/gemma-2-2b
17
+ pipeline_tag: text-generation
18
+
19
+ ---
20
+
21
+ [![QuantFactory Banner](https://lh7-rt.googleusercontent.com/docsz/AD_4nXeiuCm7c8lEwEJuRey9kiVZsRn2W-b4pWlu3-X534V3YmVuVc2ZL-NXg2RkzSOOS2JXGHutDuyyNAUtdJI65jGTo8jT9Y99tMi4H4MqL44Uc5QKG77B0d6-JfIkZHFaUA71-RtjyYZWVIhqsNZcx8-OMaA?key=xt3VSDoCbmTY7o-cwwOFwQ)](https://hf.co/QuantFactory)
22
+
23
+
24
+ # QuantFactory/BgGPT-Gemma-2-2.6B-IT-v1.0-GGUF
25
+ This is quantized version of [INSAIT-Institute/BgGPT-Gemma-2-2.6B-IT-v1.0](https://huggingface.co/INSAIT-Institute/BgGPT-Gemma-2-2.6B-IT-v1.0) created using llama.cpp
26
+
27
+ # Original Model Card
28
+
29
+
30
+ # INSAIT-Institute/BgGPT-Gemma-2-2.6B-IT-v1.0
31
+
32
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/637e1f8cf7e01589cc17bf7e/p6d0YFHjWCQ3S12jWqO1m.png)
33
+
34
+ INSAIT introduces **BgGPT-Gemma-2-2.6B-IT-v1.0**, a state-of-the-art Bulgarian language model based on **google/gemma-2-2b** and **google/gemma-2-2b-it**.
35
+ BgGPT-Gemma-2-2.6B-IT-v1.0 is **free to use** and distributed under the [Gemma Terms of Use](https://ai.google.dev/gemma/terms).
36
+ This model was created by [`INSAIT`](https://insait.ai/), part of Sofia University St. Kliment Ohridski, in Sofia, Bulgaria.
37
+
38
+
39
+ # Model description
40
+
41
+ The model was built on top of Google’s Gemma 2 2B open models.
42
+ It was continuously pre-trained on around 100 billion tokens (85 billion in Bulgarian) using the Branch-and-Merge strategy INSAIT presented at [EMNLP’24](https://aclanthology.org/2024.findings-emnlp.1000/),
43
+ allowing the model to gain outstanding Bulgarian cultural and linguistic capabilities while retaining its English performance.
44
+ During the pre-training stage, we use various datasets, including Bulgarian web crawl data, freely available datasets such as Wikipedia, a range of specialized Bulgarian datasets sourced by the INSAIT Institute,
45
+ and machine translations of popular English datasets.
46
+ The model was then instruction-fine-tuned on a newly constructed Bulgarian instruction dataset created using real-world conversations.
47
+ For more information check our [blogpost](https://models.bggpt.ai/blog/).
48
+
49
+ # Benchmarks and Results
50
+
51
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65fefdc282708115868203aa/9pp8aD1yvoW-cJWzhbHXk.png)
52
+
53
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65fefdc282708115868203aa/33CjjtmCeAcw5qq8DEtJj.png)
54
+
55
+ We evaluate our models on a set of standard English benchmarks, a translated version of them in Bulgarian, as well as, Bulgarian specific benchmarks we collected:
56
+
57
+ - **Winogrande challenge**: testing world knowledge and understanding
58
+ - **Hellaswag**: testing sentence completion
59
+ - **ARC Easy/Challenge**: testing logical reasoning
60
+ - **TriviaQA**: testing trivia knowledge
61
+ - **GSM-8k**: solving multiple-choice questions in high-school mathematics
62
+ - **Exams**: solving high school problems from natural and social sciences
63
+ - **MON**: contains exams across various subjects for grades 4 to 12
64
+
65
+ These benchmarks test logical reasoning, mathematics, knowledge, language understanding and other skills of the models and are provided at https://github.com/insait-institute/lm-evaluation-harness-bg.
66
+ The graphs above show the performance of BgGPT 2.6B compared to other small open language models such as Microsoft's Phi 3.5 and Alibaba's Qwen 2.5 3B.
67
+ The BgGPT model not only surpasses them, but also **retains English performance** inherited from the original Google Gemma 2 models upon which it is based.
68
+
69
+ # Use in 🤗 Transformers
70
+ First install the latest version of the transformers library:
71
+ ```
72
+ pip install -U 'transformers[torch]'
73
+ ```
74
+ Then load the model in transformers:
75
+ ```python
76
+ from transformers import AutoModelForCausalLM
77
+
78
+ model = AutoModelForCausalLM.from_pretrained(
79
+ "INSAIT-Institute/BgGPT-Gemma-2-2.6B-IT-v1.0",
80
+ torch_dtype=torch.bfloat16,
81
+ attn_implementation="eager",
82
+ device_map="auto",
83
+ )
84
+ ```
85
+
86
+ # Recommended Parameters
87
+
88
+ For optimal performance, we recommend the following parameters for text generation, as we have extensively tested our model with them:
89
+
90
+ ```python
91
+ from transformers import GenerationConfig
92
+
93
+ generation_params = GenerationConfig(
94
+ max_new_tokens=2048, # Choose maximum generation tokens
95
+ temperature=0.1,
96
+ top_k=25,
97
+ top_p=1,
98
+ repetition_penalty=1.1,
99
+ eos_token_id=[1,107]
100
+ )
101
+ ```
102
+
103
+ In principle, increasing temperature should work adequately as well.
104
+
105
+ # Instruction format
106
+
107
+ In order to leverage instruction fine-tuning, your prompt should begin with a beginning-of-sequence token `<bos>` and be formatted in the Gemma 2 chat template. `<bos>` should only be the first token in a chat sequence.
108
+
109
+ E.g.
110
+ ```
111
+ <bos><start_of_turn>user
112
+ Кога е основан Софийският университет?<end_of_turn>
113
+ <start_of_turn>model
114
+
115
+ ```
116
+
117
+ This format is also available as a [chat template](https://huggingface.co/docs/transformers/main/chat_templating) via the `apply_chat_template()` method:
118
+
119
+ ```python
120
+ tokenizer = AutoTokenizer.from_pretrained(
121
+ "INSAIT-Institute/BgGPT-Gemma-2-2.6B-IT-v1.0",
122
+ use_default_system_prompt=False,
123
+ )
124
+
125
+ messages = [
126
+ {"role": "user", "content": "Кога е основан Софийският университет?"},
127
+ ]
128
+
129
+ input_ids = tokenizer.apply_chat_template(
130
+ messages,
131
+ return_tensors="pt",
132
+ add_generation_prompt=True,
133
+ return_dict=True
134
+ )
135
+
136
+ outputs = model.generate(
137
+ **input_ids,
138
+ generation_config=generation_params
139
+ )
140
+ print(tokenizer.decode(outputs[0]))
141
+
142
+ ```
143
+
144
+ **Important Note:** Models based on Gemma 2 such as BgGPT-Gemma-2-2.6B-IT-v1.0 do not support flash attention. Using it results in degraded performance.
145
+
146
+ # Use with GGML / llama.cpp
147
+
148
+ The model and instructions for usage in GGUF format are available at [INSAIT-Institute/BgGPT-Gemma-2-2.6B-IT-v1.0-GGUF](https://huggingface.co/INSAIT-Institute/BgGPT-Gemma-2-2.6B-IT-v1.0-GGUF).
149
+
150
+ # Community Feedback
151
+
152
+ We welcome feedback from the community to help improve BgGPT. If you have suggestions, encounter any issues, or have ideas for improvements, please:
153
+ - Share your experience using the model through Hugging Face's community discussion feature or
154
+ - Contact us at [bggpt@insait.ai](mailto:bggpt@insait.ai)
155
+
156
+ Your real-world usage and insights are valuable in helping us optimize the model's performance and behaviour for various use cases.
157
+
158
+ # Summary
159
+ - **Finetuned from:** [google/gemma-2-2b-it](https://huggingface.co/google/gemma-2-2b-it); [google/gemma-2-2b](https://huggingface.co/google/gemma-2-2b);
160
+ - **Model type:** Causal decoder-only transformer language model
161
+ - **Language:** Bulgarian and English
162
+ - **Contact:** [bggpt@insait.ai](mailto:bggpt@insait.ai)
163
+ - **License:** BgGPT is distributed under [Gemma Terms of Use](https://huggingface.co/INSAIT-Institute/BgGPT-Gemma-2-2.6B-IT-v1.0/raw/main/LICENSE)