Chat2Find commited on
Commit
f073936
·
verified ·
1 Parent(s): f97feb5

Clarified 550 Million token corpus size for CPT methodology verification

Browse files
Files changed (1) hide show
  1. README.md +4 -3
README.md CHANGED
@@ -42,9 +42,10 @@ Chat2Find-CPT is a specialized version of the Qwen 3.5 4B model, enhanced via **
42
  - **Batch Size:** 2 (local) / 8 (global with Gradient Accumulation)
43
 
44
  ### Dataset
45
- The model was trained on a curated corpus of ~270,000 sequences focusing on:
46
- - **Sri Lankan News & Media:** Current events and reporting styles.
47
- - **Cultural Context:** General web-scraped data reflecting local nuances.
 
48
 
49
  ## Capabilities
50
 
 
42
  - **Batch Size:** 2 (local) / 8 (global with Gradient Accumulation)
43
 
44
  ### Dataset
45
+ The model underwent true Continued Pre-Training on a massive 1.38 GB unstructured text corpus. The data was densely packed into:
46
+ - **Size:** ~270,000 packed sequences of 2048 tokens each (**~550 Million total tokens**).
47
+ - **Epochs:** 1 Epoch (Standard pre-training practice to prevent overfitting).
48
+ - **Content:** Sri Lankan News & Media, Cultural Context, and domain-specific raw web data.
49
 
50
  ## Capabilities
51