isrouush commited on
Commit
5f97d7f
1 Parent(s): c150f48

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -0
README.md CHANGED
@@ -50,6 +50,18 @@ print(response)
50
 
51
  ### **Dataset**
52
 
 
 
 
 
 
 
 
 
 
 
 
 
53
  ### **Disclaimer**
54
 
55
  The generated responses from this AI model are purely algorithmic and should be interpreted with caution. The model's outputs may occasionally exhibit bias, offensive language, or potentially harmful content. It is important to note that these responses do not reflect the personal preferences or viewpoints of the authors or the organization of Naseej.
 
50
 
51
  ### **Dataset**
52
 
53
+ To ensure the diversity of data points and satisfy our purpose of instruction-tuning, we collected, labeled, filtered, and reviewed a set of datasets, each tailored to specific instruction types.
54
+ Noting that all the datasets are in Arabic, they comprise:
55
+
56
+ - [Second version of the Alpaca dataset](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM), generated using GPT4.
57
+ - Self-instruct records, split between samples generated by us using the [self-instruct](https://github.com/yizhongw/self-instruct) framework, and further translated ones.
58
+ - The instructional dataset released by [Databricks](https://github.com/databrickslabs/dolly), which comprises high quality human-generated instructions and responses.
59
+ - [TruthfulQA](https://huggingface.co/datasets/truthful_qa) dataset, to further guide the model on how to truthfully respond to factoid-based questions.
60
+ - [Grade School Math](https://huggingface.co/datasets/gsm8k) dataset, to enhance the model's performance using chain-of-thought mathematical problems.
61
+ - Arabic arithmetic problems, generated by us using ChatGPT for further improvement of the model's ability to solve mathematical problems.
62
+
63
+ The full dataset adds up to over **110K** records.
64
+
65
  ### **Disclaimer**
66
 
67
  The generated responses from this AI model are purely algorithmic and should be interpreted with caution. The model's outputs may occasionally exhibit bias, offensive language, or potentially harmful content. It is important to note that these responses do not reflect the personal preferences or viewpoints of the authors or the organization of Naseej.