Update README.md
Browse files
README.md
CHANGED
@@ -66,7 +66,7 @@ pipe("আমাদের দেশের নাম")
|
|
66 |
|
67 |
## Training Data
|
68 |
|
69 |
-
**Overview:** We have collected a large Bangla raw dataset of text data from a wide variety of sources. Our collected data so far includes a mix of web documents, books, translated text, transliterated text,
|
70 |
|
71 |
Data sources summary:
|
72 |
- Web documents: Extracted, clean, and filtered common crawl data
|
@@ -96,7 +96,7 @@ We evaluated the models on the following datasets:
|
|
96 |
|
97 |
#### English Benchmark datasets
|
98 |
- [MMLU](https://huggingface.co/datasets/cais/mmlu): This is a massive multitask test consisting of multiple-choice questions from various branches of knowledge.
|
99 |
-
- [CommonseQa](https://huggingface.co/datasets/tau/commonsense_qa): CommonsenseQA is a new multiple-choice question-answering dataset that requires different types of commonsense knowledge to predict the correct answers
|
100 |
- [OpenbookQA](https://huggingface.co/datasets/allenai/openbookqa): OpenBookQA aims to promote research in advanced question-answering, probing a deeper understanding of both the topic (with salient facts summarized as an open book, also provided with the dataset) and the language it is expressed in.
|
101 |
- [Piqa](https://huggingface.co/datasets/ybisk/piqa): The PIQA dataset focuses on physical commonsense reasoning, challenging AI to handle everyday situations requiring practical knowledge and unconventional solutions. Inspired by instructables.com, it aims to enhance AI's ability to understand and reason about physical interactions.
|
102 |
- [BoolQ](https://huggingface.co/datasets/google/boolq): BoolQ is a question-answer dataset for yes/no questions containing 15942 examples. These questions are naturally occurring. They are generated in unprompted and unconstrained settings. Each example is a triplet of (question, passage, answer), with the title of the page as optional additional context. The text-pair classification setup is similar to existing natural language inference tasks.
|
|
|
66 |
|
67 |
## Training Data
|
68 |
|
69 |
+
**Overview:** We have collected a large Bangla raw dataset of text data from a wide variety of sources. Our collected data so far includes a mix of web documents, books, translated text, transliterated text, transcribed text, code-mixed text, conversations, and open-sources raw data. The dataset is cleaned and filtered by different filtering criteria to ensure the quality of the data. Our collected data size roughly around 268 GB. We separated __22GB__ data from that using a ratio of the actual data size. Total trained tokens are __6B__ tokens.
|
70 |
|
71 |
Data sources summary:
|
72 |
- Web documents: Extracted, clean, and filtered common crawl data
|
|
|
96 |
|
97 |
#### English Benchmark datasets
|
98 |
- [MMLU](https://huggingface.co/datasets/cais/mmlu): This is a massive multitask test consisting of multiple-choice questions from various branches of knowledge.
|
99 |
+
- [CommonseQa](https://huggingface.co/datasets/tau/commonsense_qa): CommonsenseQA is a new multiple-choice question-answering dataset that requires different types of commonsense knowledge to predict the correct answers.
|
100 |
- [OpenbookQA](https://huggingface.co/datasets/allenai/openbookqa): OpenBookQA aims to promote research in advanced question-answering, probing a deeper understanding of both the topic (with salient facts summarized as an open book, also provided with the dataset) and the language it is expressed in.
|
101 |
- [Piqa](https://huggingface.co/datasets/ybisk/piqa): The PIQA dataset focuses on physical commonsense reasoning, challenging AI to handle everyday situations requiring practical knowledge and unconventional solutions. Inspired by instructables.com, it aims to enhance AI's ability to understand and reason about physical interactions.
|
102 |
- [BoolQ](https://huggingface.co/datasets/google/boolq): BoolQ is a question-answer dataset for yes/no questions containing 15942 examples. These questions are naturally occurring. They are generated in unprompted and unconstrained settings. Each example is a triplet of (question, passage, answer), with the title of the page as optional additional context. The text-pair classification setup is similar to existing natural language inference tasks.
|