Jian-Gang commited on
Commit
1958488
·
verified ·
1 Parent(s): 1b9a809

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -0
README.md CHANGED
@@ -97,6 +97,7 @@ Note:
97
  - All token counts are counted using Llama 3.1 8B Instruct tokenizer
98
  - SEA-LION Pile v1 is processed from Common Crawl WET, which is published [here](https://huggingface.co/datasets/aisingapore/sea-lion-pile). The cutoff date of this version is September 2020.
99
  - SEA-LION Pile v2 is processed from Common Crawl WARC from October 2020 to April 2024.
 
100
  - Tamil news is sourced with permission from [Seithi](https://seithi.mediacorp.sg/)
101
 
102
  ## Call for Contributions
 
97
  - All token counts are counted using Llama 3.1 8B Instruct tokenizer
98
  - SEA-LION Pile v1 is processed from Common Crawl WET, which is published [here](https://huggingface.co/datasets/aisingapore/sea-lion-pile). The cutoff date of this version is September 2020.
99
  - SEA-LION Pile v2 is processed from Common Crawl WARC from October 2020 to April 2024.
100
+ - Tamil data from Sangraha is published [here](https://huggingface.co/datasets/ai4bharat/sangraha). The paper can be found [here](https://arxiv.org/abs/2403.06350).
101
  - Tamil news is sourced with permission from [Seithi](https://seithi.mediacorp.sg/)
102
 
103
  ## Call for Contributions