DarwinAnim8or commited on
Commit
50477d9
·
verified ·
1 Parent(s): 7ec021b
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -56,7 +56,7 @@ A histogram of the token counts per sample in the dataset reveals some interesti
56
 
57
  1) Bimodal Distribution: The graph shows a bimodal distribution, with a primary peak around 11-12 tokens per sample and a secondary peak around 15-16 tokens per sample. This suggests there are two distinct groups of samples with different token count characteristics.
58
  2) Long Tail: The distribution has a long tail, with a number of samples containing 25-50 tokens. This indicates the presence of some outlier samples with significantly longer text lengths compared to the bulk of the data.
59
- 3) Vocab Size Limitations: Given the small vocabulary size of 5,200, the tokenizer may struggle to efficiently encode longer or more complex text samples. This could lead to higher token counts for some inputs, as the tokenizer needs to use more tokens to represent the same content.
60
  4) Potential Data Heterogeneity: The bimodal nature of the distribution suggests the dataset may be comprised of different types of text content, with some samples being more concise and others being more verbose. This could be an artifact of the data curation process or the nature of the source material.
61
 
62
  ## Comparison to Other Datasets
 
56
 
57
  1) Bimodal Distribution: The graph shows a bimodal distribution, with a primary peak around 11-12 tokens per sample and a secondary peak around 15-16 tokens per sample. This suggests there are two distinct groups of samples with different token count characteristics.
58
  2) Long Tail: The distribution has a long tail, with a number of samples containing 25-50 tokens. This indicates the presence of some outlier samples with significantly longer text lengths compared to the bulk of the data.
59
+ 3) Vocab Size Limitations: Given the small vocabulary size of 5,100, the tokenizer may struggle to efficiently encode longer or more complex text samples. This could lead to higher token counts for some inputs, as the tokenizer needs to use more tokens to represent the same content.
60
  4) Potential Data Heterogeneity: The bimodal nature of the distribution suggests the dataset may be comprised of different types of text content, with some samples being more concise and others being more verbose. This could be an artifact of the data curation process or the nature of the source material.
61
 
62
  ## Comparison to Other Datasets