Update README.md
Browse files
README.md
CHANGED
@@ -99,7 +99,7 @@ gen_tokens = model.generate(input_ids, do_sample=True, max_length=400)
|
|
99 |
print("-"*20 + "Output for model" + 20 * '-')
|
100 |
print(tokenizer.batch_decode(gen_tokens)[0])
|
101 |
```
|
102 |
-
## CrystalChat DataMix
|
103 |
| Subset | Tokens (Billion) |
|
104 |
| ----------- | ----------- |
|
105 |
| OASST1-guanaco | 4.46 |
|
@@ -114,13 +114,12 @@ print(tokenizer.batch_decode(gen_tokens)[0])
|
|
114 |
| HTML Instruction | 43.67 |
|
115 |
| General Textbooks | 85.59 |
|
116 |
| Programming Books | 395.63 |
|
117 |
-
| Total | 1102.52 |
|
118 |
|
119 |
# Evaluation
|
120 |
|
121 |
Coming Soon!
|
122 |
|
123 |
-
|
124 |
# Bias, Risks, and Limitations
|
125 |
CrystalChat has not been aligned to human preferences for safety within the RLHF phase or deployed with in-the-loop filtering of responses like ChatGPT, so the model can produce problematic outputs (especially when prompted to do so). The training data is known and made available [here](https://huggingface.co/datasets/LLM360/CrystalCoderDatasets). It primarily consists of SlimPajama, StarCoder, and WebCrawl dataset.
|
126 |
|
|
|
99 |
print("-"*20 + "Output for model" + 20 * '-')
|
100 |
print(tokenizer.batch_decode(gen_tokens)[0])
|
101 |
```
|
102 |
+
<!-- ## CrystalChat DataMix
|
103 |
| Subset | Tokens (Billion) |
|
104 |
| ----------- | ----------- |
|
105 |
| OASST1-guanaco | 4.46 |
|
|
|
114 |
| HTML Instruction | 43.67 |
|
115 |
| General Textbooks | 85.59 |
|
116 |
| Programming Books | 395.63 |
|
117 |
+
| Total | 1102.52 | -->
|
118 |
|
119 |
# Evaluation
|
120 |
|
121 |
Coming Soon!
|
122 |
|
|
|
123 |
# Bias, Risks, and Limitations
|
124 |
CrystalChat has not been aligned to human preferences for safety within the RLHF phase or deployed with in-the-loop filtering of responses like ChatGPT, so the model can produce problematic outputs (especially when prompted to do so). The training data is known and made available [here](https://huggingface.co/datasets/LLM360/CrystalCoderDatasets). It primarily consists of SlimPajama, StarCoder, and WebCrawl dataset.
|
125 |
|