Update README.md
Browse files
README.md
CHANGED
@@ -97,6 +97,7 @@ Note:
|
|
97 |
- All token counts are counted using Llama 3.1 8B Instruct tokenizer
|
98 |
- SEA-LION Pile v1 is processed from Common Crawl WET, which is published [here](https://huggingface.co/datasets/aisingapore/sea-lion-pile). The cutoff date of this version is September 2020.
|
99 |
- SEA-LION Pile v2 is processed from Common Crawl WARC from October 2020 to April 2024.
|
|
|
100 |
- Tamil news is sourced with permission from [Seithi](https://seithi.mediacorp.sg/)
|
101 |
|
102 |
## Call for Contributions
|
|
|
97 |
- All token counts are counted using Llama 3.1 8B Instruct tokenizer
|
98 |
- SEA-LION Pile v1 is processed from Common Crawl WET, which is published [here](https://huggingface.co/datasets/aisingapore/sea-lion-pile). The cutoff date of this version is September 2020.
|
99 |
- SEA-LION Pile v2 is processed from Common Crawl WARC from October 2020 to April 2024.
|
100 |
+
- Tamil data from Sangraha is published [here](https://huggingface.co/datasets/ai4bharat/sangraha). The paper can be found [here](https://arxiv.org/abs/2403.06350).
|
101 |
- Tamil news is sourced with permission from [Seithi](https://seithi.mediacorp.sg/)
|
102 |
|
103 |
## Call for Contributions
|