chrisociepa commited on
Commit
cadc6fe
1 Parent(s): 6fb9f0c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +51 -41
README.md CHANGED
@@ -13,11 +13,11 @@ pipeline_tag: text-generation
13
 
14
  # APT-1B-Base
15
 
16
- ### Introduction
17
 
18
  At [Azurro](https://azurro.pl), we consistently place importance on using the Open Source technologies, both while working on the projects and in our everyday lives. We have decided to share a base language model trained by us. We are confident that smaller language models have great potential, and direct access to them for all people that are interested in such models democratizes this significant and dynamically changing field even more.
19
 
20
- ### Statements
21
 
22
  Training large language models requires a lot of computing power and it is meant for the major players on the market. However, does it mean that individuals or small companies cannot train language models capable of performing specific tasks? We decided to answer this question and train our own language model from scratch.
23
  We have made the following statements:
@@ -35,7 +35,7 @@ All the currently available language models have been trained mainly with Englis
35
 
36
  It is important to remember that models are as good as the data with which they are trained. Having regard to the small size of the model, we trained it with carefully selected texts. This is why we have not used corpora such as Common Crawl that contain a lot of poor quality data. Our team has prepared a set of sources that then have been processed and used for training the model.
37
 
38
- ### Model
39
 
40
  APT-1B-Base is a base model introducing a new series of the APT (Azurro Pretrained Transformer) models. It has been trained with the use of an original open source framework called [ALLaMo](https://github.com/chrisociepa/allamo). This framework allows the user to train language models similar to the Meta AI’s LLaMA models quickly and efficiently.
41
 
@@ -45,48 +45,58 @@ APT-1B-Base is an autoregressive language model based on the architecture of a t
45
 
46
  A special tokenizer has been prepared and trained for the purpose of training the model.
47
 
48
- Model description:
49
 
50
- * developed by [Azurro](https://azurro.pl)
51
- * language: Polish
52
- * model type: causal decoder-only
53
- * license: CC BY NC 4.0 (non-commercial use)
54
 
55
- Model details:
56
 
57
- * model parameters: 1060M
58
- * sequence length: 2048
59
- * vocabulary size: 8000
60
- * layers: 20
61
- * heads: 16
62
- * d_head: 128
63
- * d_model: 2048
64
- * dropout: 0.0
65
- * no bias
66
- * positional encoding: RoPE
67
- * activation function: SwiGLU
68
- * normalizing function: RMSNorm
69
- * intermediate size: 5632
70
- * norm epsilon: 1e-06
71
-
72
- Training hyperparameters:
73
-
74
- * micro batch size: 1
75
- * gradient accumulation steps: 264
76
- * batch size: 540672 = 1 * 264 * 2048
77
- * learning rate: 3e-04
78
- * optimizer: AdamW, (β1, β2) = (0.9, 0.95), adam_eps = 1e−8
79
- * weight decay: 0.1
80
- * grad clip: 1.0
81
-
82
- Tokenizer details:
83
 
84
  * type: BPE
85
  * special tokens: 7
86
  * alphabet size: 112
87
  * vocabulary size: 8000
88
 
89
- ### Training dataset
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
90
 
91
  Collecting a large amount of high quality training data is a great challenge. Over the past years at Azurro, we have done a lot of projects connected with processing Big Data. Therefore, with our extensive experience, we have been able to prepare carefully selected training dataset quickly and efficiently.
92
 
@@ -96,7 +106,7 @@ Our training dataset contains:
96
  * Polish Wikipedia: 970 million tokens
97
  * web crawl data: 813 million tokens
98
 
99
- # How to Use
100
 
101
  Our model is fully compatible with HuggingFace - you can use it right away.
102
 
@@ -114,21 +124,21 @@ import transformers
114
  model = transformers.AutoModelForCausalLM.from_pretrained('Azurro/APT-1B-Base', torch_dtype=torch.bfloat16)
115
  ```
116
 
117
- ### Limitations and Biases
118
 
119
  APT-1B-Base is not intended for deployment without fine-tuning. It should not be used for human-facing interactions without further guardrails and user consent.
120
 
121
  APT-1B-Base can produce factually incorrect output, and should not be relied on to produce factually accurate information. APT-1B-Base was trained on various public datasets. While great efforts have been taken to clean the pretraining data, it is possible that this model could generate lewd, biased or otherwise offensive outputs.
122
 
123
- ### License
124
 
125
  Because of an unclear legal situation, we have decided to publish the model under CC BY NC 4.0 license - it allows for non-commercial use. The model can be used for scientific purposes and privately, as long as the license conditions are met.
126
 
127
- ### Disclaimer
128
 
129
  The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model.
130
 
131
- ### Citation
132
  Please cite this model using the following format:
133
 
134
  ```
 
13
 
14
  # APT-1B-Base
15
 
16
+ ## Introduction
17
 
18
  At [Azurro](https://azurro.pl), we consistently place importance on using the Open Source technologies, both while working on the projects and in our everyday lives. We have decided to share a base language model trained by us. We are confident that smaller language models have great potential, and direct access to them for all people that are interested in such models democratizes this significant and dynamically changing field even more.
19
 
20
+ ## Statements
21
 
22
  Training large language models requires a lot of computing power and it is meant for the major players on the market. However, does it mean that individuals or small companies cannot train language models capable of performing specific tasks? We decided to answer this question and train our own language model from scratch.
23
  We have made the following statements:
 
35
 
36
  It is important to remember that models are as good as the data with which they are trained. Having regard to the small size of the model, we trained it with carefully selected texts. This is why we have not used corpora such as Common Crawl that contain a lot of poor quality data. Our team has prepared a set of sources that then have been processed and used for training the model.
37
 
38
+ ## Model
39
 
40
  APT-1B-Base is a base model introducing a new series of the APT (Azurro Pretrained Transformer) models. It has been trained with the use of an original open source framework called [ALLaMo](https://github.com/chrisociepa/allamo). This framework allows the user to train language models similar to the Meta AI’s LLaMA models quickly and efficiently.
41
 
 
45
 
46
  A special tokenizer has been prepared and trained for the purpose of training the model.
47
 
48
+ ### Model description:
49
 
50
+ * **Developed by:** [Azurro](https://azurro.pl)
51
+ * **Language:** Polish
52
+ * **Model type:** causal decoder-only
53
+ * **License:** CC BY NC 4.0 (non-commercial use)
54
 
 
55
 
56
+ ### Model details:
57
+
58
+ | **Hyperparameter** | **Value** |
59
+ |--------------------|-------------|
60
+ | Model Parameters | 1060M |
61
+ | Sequence Length | 2048 |
62
+ | Vocabulary Size | 8000 |
63
+ | Layers | 20 |
64
+ | Heads | 16 |
65
+ | d_head | 128 |
66
+ | d_model | 2048 |
67
+ | Dropout | 0.0 |
68
+ | Bias | No |
69
+ | Positional Encoding | RoPE |
70
+ | Activation Function | SwiGLU |
71
+ | Normalizing Function | RMSNorm |
72
+ | Intermediate Size | 5632 |
73
+ | Norm Epsilon | 1e-06 |
74
+
75
+
76
+ ### Tokenizer details:
 
 
 
 
 
77
 
78
  * type: BPE
79
  * special tokens: 7
80
  * alphabet size: 112
81
  * vocabulary size: 8000
82
 
83
+ ## Training
84
+
85
+ ### Training hyperparameters:
86
+
87
+ | **Hyperparameter** | **Value** |
88
+ |-----------------------------|------------------|
89
+ | Micro Batch Size | 1 |
90
+ | Gradient Accumulation Steps | 264 |
91
+ | Batch Size | 540672 |
92
+ | Learning Rate | 3e-04 |
93
+ | Optimizer | AdamW |
94
+ | β1, β2 | 0.9, 0.95 |
95
+ | Adam_eps | 1e−8 |
96
+ | Weight Decay | 0.1 |
97
+ | Grad Clip | 1.0 |
98
+
99
+ ### Dataset
100
 
101
  Collecting a large amount of high quality training data is a great challenge. Over the past years at Azurro, we have done a lot of projects connected with processing Big Data. Therefore, with our extensive experience, we have been able to prepare carefully selected training dataset quickly and efficiently.
102
 
 
106
  * Polish Wikipedia: 970 million tokens
107
  * web crawl data: 813 million tokens
108
 
109
+ ## How to Use
110
 
111
  Our model is fully compatible with HuggingFace - you can use it right away.
112
 
 
124
  model = transformers.AutoModelForCausalLM.from_pretrained('Azurro/APT-1B-Base', torch_dtype=torch.bfloat16)
125
  ```
126
 
127
+ ## Limitations and Biases
128
 
129
  APT-1B-Base is not intended for deployment without fine-tuning. It should not be used for human-facing interactions without further guardrails and user consent.
130
 
131
  APT-1B-Base can produce factually incorrect output, and should not be relied on to produce factually accurate information. APT-1B-Base was trained on various public datasets. While great efforts have been taken to clean the pretraining data, it is possible that this model could generate lewd, biased or otherwise offensive outputs.
132
 
133
+ ## License
134
 
135
  Because of an unclear legal situation, we have decided to publish the model under CC BY NC 4.0 license - it allows for non-commercial use. The model can be used for scientific purposes and privately, as long as the license conditions are met.
136
 
137
+ ## Disclaimer
138
 
139
  The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model.
140
 
141
+ ## Citation
142
  Please cite this model using the following format:
143
 
144
  ```