Text Generation
Transformers
Safetensors
English
falcon_mamba
Eval Results
Inference Endpoints
JingweiZuo commited on
Commit
124f971
1 Parent(s): a011b2a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -3
README.md CHANGED
@@ -154,12 +154,13 @@ print(tokenizer.decode(outputs[0]))
154
  ## Training Data
155
 
156
  Falcon-Mamba has been trained with ~ 5,500 GT mainly coming from [Refined-Web](https://huggingface.co/datasets/tiiuae/falcon-refinedweb), a large volume web-only dataset filtered and deduplicated.
157
- Similar to the others [Falcon](https://huggingface.co/tiiuae/falcon-11B) suite models, Falcon-Mamba has been trained leveraging a multi-stage training strategy to increase the context-length training from 2,048 up to 8,192.
 
158
  Note that at inference the context-length is not relevant as the Mamba architecture has no limit on long range dependency.
159
  At the last training stage, small portion of high-quality curated data was used to further enhance performance.
160
 
161
- Overall, the data sources included RefinedWeb-English, high quality technical data, code data and conversational data extracted from public sources.
162
- In particular, we used samples coming from [Fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu).
163
 
164
  The data was tokenized with the Falcon-[7B](https://huggingface.co/tiiuae/falcon-7B)/[11B](https://huggingface.co/tiiuae/falcon-11B) tokenizer.
165
 
 
154
  ## Training Data
155
 
156
  Falcon-Mamba has been trained with ~ 5,500 GT mainly coming from [Refined-Web](https://huggingface.co/datasets/tiiuae/falcon-refinedweb), a large volume web-only dataset filtered and deduplicated.
157
+ Similar to the others [Falcon](https://huggingface.co/tiiuae/falcon-11B) suite models, Falcon-Mamba has been trained leveraging a multi-stage training strategy to increase the context-length (from 2,048 to 8,192).
158
+ Moreover, inspired by the Curriculum Learning concept, we carefully choose data mixtures along the training stages, on both data diversity and complexity.
159
  Note that at inference the context-length is not relevant as the Mamba architecture has no limit on long range dependency.
160
  At the last training stage, small portion of high-quality curated data was used to further enhance performance.
161
 
162
+ Overall, the data sources included RefinedWeb-English, high quality technical data, code data and math data extracted from public sources.
163
+ In particular, we used samples coming from [Fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) during our last training stage.
164
 
165
  The data was tokenized with the Falcon-[7B](https://huggingface.co/tiiuae/falcon-7B)/[11B](https://huggingface.co/tiiuae/falcon-11B) tokenizer.
166