draganjovanovich
commited on
Commit
โข
9b63a5f
1
Parent(s):
1887ca1
Upload folder using huggingface_hub
Browse files- .gitattributes +4 -0
- README.md +66 -0
- prodigy-sm-base-v0.1-Q4_K_M.gguf +3 -0
- prodigy-sm-base-v0.1-Q5_K_M.gguf +3 -0
- prodigy-sm-base-v0.1-Q8_K_M.gguf +3 -0
- prodigy-sm-base-v0.1.gguf +3 -0
.gitattributes
CHANGED
@@ -33,3 +33,7 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
36 |
+
prodigy-sm-base-v0.1-Q4_K_M.gguf filter=lfs diff=lfs merge=lfs -text
|
37 |
+
prodigy-sm-base-v0.1-Q5_K_M.gguf filter=lfs diff=lfs merge=lfs -text
|
38 |
+
prodigy-sm-base-v0.1-Q8_K_M.gguf filter=lfs diff=lfs merge=lfs -text
|
39 |
+
prodigy-sm-base-v0.1.gguf filter=lfs diff=lfs merge=lfs -text
|
README.md
ADDED
@@ -0,0 +1,66 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
- sr
|
6 |
+
- hr
|
7 |
+
- bs
|
8 |
+
---
|
9 |
+
# Prodigy SM Base v0.1
|
10 |
+
|
11 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/617bbeec14572ebe9e6ea83f/4p2zaOWu6kTS3fcbevHef.png" width="70%" height="70%">
|
12 |
+
|
13 |
+
In our latest endeavour, we performed continued pre-training of a large language model (Mistral-7b-v0.1) to understand and generate text in new languages, including **Serbian**, **Bosnian** and **Croatian** using an innovative approach.
|
14 |
+
|
15 |
+
Rather than depending only on extensive datasets in the target language, our method utilizes a more compact set of both synthetic and human-curated data along with some mixture of CC Web data, which is implemented in two strategic phases:
|
16 |
+
|
17 |
+
1. Establishing a comprehensive demonstration of all grammatical and orthographic rules pertinent to the language.
|
18 |
+
2. Supplying a diverse array of examples that not only reinforce these rules but also integrate a wide range of linguistic nuances.
|
19 |
+
|
20 |
+
While our approach is uniquely tailored to our objectives, we have drawn some inspiration from recent advancements in language model training. Specifically, the conceptual strategies discussed in the paper [ADAPTING LARGE LANGUAGE MODELS VIA READING COMPREHENSION](https://arxiv.org/pdf/2309.09530.pdf) provided valuable insights, though our methods diverge significantly in practice. By adopting this inspired approach, we aim to efficiently teach the model new languages with a balanced blend of accuracy and linguistic diversity.
|
21 |
+
|
22 |
+
So... Did it work?!
|
23 |
+
|
24 |
+
# **Yes!**
|
25 |
+
See the benchmark results, or even better, download the model and try it yourself. As you know by now, there's no better benchmark than a quick 'try it yourself' vibe check. :)
|
26 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/617bbeec14572ebe9e6ea83f/C9m_OjnYEpQo43VCrwz4A.png" width="100%" height="100%">
|
27 |
+
|
28 |
+
Here, we demonstrate results of benchmark that is not frequently performed, yet equally important: how adapting the model for a new language impacted its original English-only performance.
|
29 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/617bbeec14572ebe9e6ea83f/IPY0myfQI-Ne5x6b11glz.png" width="100%" height="100%">
|
30 |
+
|
31 |
+
*All evals are performed in zero shot manner.
|
32 |
+
*Also bear in mind that llama-2-7b, llama-3-8b and mistral-7b models compared to Prodigy SM base aren't trained on extensive Serbian language datasets, and these benchmarks demonstrate that primarily English models can be adapted to other languages.
|
33 |
+
|
34 |
+
So, as you can see, we successfully improved the original model's performance for Serbian language use cases while retaining or even slightly improving its performance for English language.
|
35 |
+
|
36 |
+
### Training results
|
37 |
+
Training results of continued pre-training of [mistral-7b-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
|
38 |
+
|
39 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/617bbeec14572ebe9e6ea83f/5xeJ-vfWk4RhJNC7t5I0g.png" width="70%" height="70%">
|
40 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/617bbeec14572ebe9e6ea83f/R4R8ai8LaN3WlYCOenUyb.png" width="70%" height="70%">
|
41 |
+
|
42 |
+
As last experimental step we merged produced model with **Mistral-7B-v0.1** and two earlier checkpoints from **prodigy-sm-base** using [Model Stock](https://arxiv.org/abs/2403.19522) method.
|
43 |
+
|
44 |
+
# Notes
|
45 |
+
As this is base model, there is no chat template or strict chat following capabilities, this model is best candidate for further pre-train on Serbian language as there is a lot more room for improvement (you can hit sweet spot), or next step in the pipeline, such as some form of chat or instruct tuning.
|
46 |
+
|
47 |
+
If you want model that is already instruction tuned we did that too, check **Prodigy SM Instruct v0.1**
|
48 |
+
# Prodigy SM Instruct v0.1
|
49 |
+
๐[prodigy-sm-instruct]() **COMING SOON**
|
50 |
+
|
51 |
+
And stay tuned for:
|
52 |
+
[prodigy-sm-base (llama-3)]() **COMING SOON**
|
53 |
+
[prodigy-sm-instruct (llama-3)]() **COMING SOON**
|
54 |
+
|
55 |
+
๐ข Also we are excited to announce that [iskon.ai](https://Iskon.ai) will soon launch an API platform featuring advanced **Prodigy** series of models, advanced AI tools and much more! ๐
|
56 |
+
|
57 |
+
|
58 |
+
# Thanks
|
59 |
+
- [gordicaleksa/serbian-llm-eval](https://github.com/gordicaleksa/serbian-llm-eval) and his community for curating translations and adaptation of [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)
|
60 |
+
that we used to perform benchmarks.
|
61 |
+
- [jondurbin](https://huggingface.co/jondurbin) for amazing airoboros framework
|
62 |
+
- [teknium](https://huggingface.co/teknium) for various insights shared on discord and twitter aka x.com
|
63 |
+
- [Eric](https://twitter.com/erhartford) for various insights shared on discord and twitter aka x.com
|
64 |
+
- [mergekit](https://github.com/arcee-ai/mergekit) for model merging tools
|
65 |
+
|
66 |
+
*Huge thanks to Redmond.ai for generous DGX cloud credits* [redmond.ai]( https://redmond.ai)
|
prodigy-sm-base-v0.1-Q4_K_M.gguf
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:b1986511afe78b7087fb7a185bc947926b5a0dee84019d1e9f8c513c07439346
|
3 |
+
size 4368439008
|
prodigy-sm-base-v0.1-Q5_K_M.gguf
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:8340e398b03c37f5822c014d006ce0501e20b4d96fbefffe3a4a6aec46dce75e
|
3 |
+
size 5131409120
|
prodigy-sm-base-v0.1-Q8_K_M.gguf
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:f986dae1498f242b7b569d5cf1327be826d1217a681e8f7dd2d9c64c005d614f
|
3 |
+
size 7695857376
|
prodigy-sm-base-v0.1.gguf
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:eddb08d54eb92cc91cf075d0edf837b167dd8afa9d1d86c5f8456bd466f592c7
|
3 |
+
size 14484731584
|