Clarified 550 Million token corpus size for CPT methodology verification
Browse files
README.md
CHANGED
|
@@ -42,9 +42,10 @@ Chat2Find-CPT is a specialized version of the Qwen 3.5 4B model, enhanced via **
|
|
| 42 |
- **Batch Size:** 2 (local) / 8 (global with Gradient Accumulation)
|
| 43 |
|
| 44 |
### Dataset
|
| 45 |
-
The model
|
| 46 |
-
- **
|
| 47 |
-
- **
|
|
|
|
| 48 |
|
| 49 |
## Capabilities
|
| 50 |
|
|
|
|
| 42 |
- **Batch Size:** 2 (local) / 8 (global with Gradient Accumulation)
|
| 43 |
|
| 44 |
### Dataset
|
| 45 |
+
The model underwent true Continued Pre-Training on a massive 1.38 GB unstructured text corpus. The data was densely packed into:
|
| 46 |
+
- **Size:** ~270,000 packed sequences of 2048 tokens each (**~550 Million total tokens**).
|
| 47 |
+
- **Epochs:** 1 Epoch (Standard pre-training practice to prevent overfitting).
|
| 48 |
+
- **Content:** Sri Lankan News & Media, Cultural Context, and domain-specific raw web data.
|
| 49 |
|
| 50 |
## Capabilities
|
| 51 |
|