File size: 7,276 Bytes

ec12c17
 
c150f48
 
 
8f33c7b
45df940
 
 
 
 
8f33c7b
 
 
 
 
 
 
 
ec12c17
c150f48
b3cfde9
d1c468b
27cbbbb
c150f48
 
 
 
 
 
 
 
 
d1c468b
b21d4b6
e0b8dbd
f51cac5
e0b8dbd
b21d4b6
 
 
 
c150f48
 
 
 
 
 
 
 
 
 
 
 
 
7c9dc0f
c150f48
 
 
 
 
 
 
58f5281
c150f48
 
935473a
58f5281
c150f48
58f5281
c150f48
 
 
 
 
 
aee9afe
 
c150f48
 
5f97d7f
 
 
 
 
 
 
 
 
 
 
 
aee9afe
 
 
 
 
 
 
 
 
 
 
c150f48

---
license: bigscience-bloom-rail-1.0
language:
- ar
- en
pipeline_tag: text-generation
tags:
- instructional
- question-answering
- arabic
widget:
- text: اكتب مقال عن الذكاء الصناعي وتطوراته.
  example_title: Instruction 1
- text: اعط بعض النصائح عن كيفية الحفاظ على حياة صحية.
  example_title: Instruction 2
- text: ماذا تعرف عن فوائد الصيام؟
  example_title: Question 1
- text: قطف إسماعيل 5 تفاحات، وأعطى 2 منها لأخيه، فكم بقي عند إسماعيل من تفاحة؟
  example_title: Question 2
---

<img src="https://i.ibb.co/3NzxfFQ/noon-banner.png" alt="noon-banner" border="0" width="85%" height="85%" style="margin:auto; display:block">

## **Noon - a 7-billion parameter Arabic Large Language Model**

We present the 7-billion parameter variant of **Noon**, an Arabic Large Language model based on **BLOOM**, a foundation model released by the [bigscience](https://huggingface.co/bigscience) workshop.

Noon was trained with the main focus of having a model that responds to various types of instructions and questions (text generation, code generation, mathematical problems, closed/open-book questions, etc.)

We trained the model using the ColossalAI framework which fully supports the HuggingFace library models, and implements different optimization and quantization techniques for billion-scale LLMs.

The training data is a combination of Arabic datasets covering multiple tasks, more details are provided in the dataset section.


مرحبًا بكم في بطاقة نموذج "نون"!

 يحتوي "نون" على أكثر من 7 مليار عامل متغير، مما يجعله أكبر نموذج للغة العربية المطروح حتى الآن. تم تدريب هذا النموذج على أكثر من 110,000 سجل بيانات باللغة العربية، والتي تغطي أكثر من 11 ملايين كلمة، تتنوع ما بين إنتاج النصوص، وإنشاء الشفرات، وحل المسائل الرياضية، والأسئلة المغلقة/المفتوحة. تم تدريب هذا النموذج باستخدام تقنيات تدريب متقدمة مثل التدريب الموزع على عدة وحدات معالجة رسومية، وتكييف LoRA (Low Rank Adaptation)، وتحسين ZeRO (Zero Redundancy Optimization).
 
نحن فخورون بتقديم هذا النموذج الذي يمثل قفزة نوعية في تقنية معالجة اللغة العربية. نقدم في الأقسام التالية مزيد من التفاصيل عن كيفية استخدام نموذج "نون" ومختلف الخصائص التقنية المتعلقة بعملية التدريب.

على أمل أن يكون هذا النموذج خدمةً للطورين والباحثين العلميين في هذا المجال، ولكل الناطقين باللغة العربية.

### **Usage**

The usage of our model only requires the Transformers library, and can be loaded as follows:

```python
from transformers import BloomTokenizerFast, BloomForCausalLM, pipeline


text="اكتب مقالا من عدة أسطر عن الذكاء الصناعي وتطوراته"
prompt = f'Instruction:\n{text}\n\nResponse:'

model = BloomForCausalLM.from_pretrained('Naseej/noon-7b')

tokenizer = BloomTokenizerFast.from_pretrained('Naseej/noon-7b')

generation_pipeline = pipeline("text-generation", model=model, tokenizer=tokenizer)

# We recommend the provided hyperparameters for generation
# But encourage you to try different values
response = generation_pipeline(prompt,
    pad_token_id=tokenizer.eos_token_id,
    do_sample=False,
    num_beams=4,
    max_length=500,
    top_p=0.1,
    top_k=20,
    repetition_penalty = 3.0,
    no_repeat_ngram_size=3)[0]['generated_text']

print(response)
```

### **Training's computational requirements**

Noon-7b was trained on 8-A100 GPUs using Distributed multi-GPU training via the [ColossalAI](https://github.com/hpcaitech/ColossalAI) framework.

### **Dataset**

To ensure the diversity of data points and satisfy our purpose of instruction-tuning, we collected, labeled, filtered, and reviewed a set of datasets, each tailored to specific instruction types.
Noting that all the datasets are in Arabic, they comprise:

- [Second version of the Alpaca dataset](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM), generated using GPT4.
- Self-instruct records, split between samples generated by us using the [self-instruct](https://github.com/yizhongw/self-instruct) framework, and further translated ones.
- The instructional dataset released by [Databricks](https://github.com/databrickslabs/dolly), which comprises high quality human-generated instructions and responses.
- [TruthfulQA](https://huggingface.co/datasets/truthful_qa) dataset, to further guide the model on how to truthfully respond to factoid-based questions.
- [Grade School Math](https://huggingface.co/datasets/gsm8k) dataset, to enhance the model's performance using chain-of-thought mathematical problems.
- Arabic arithmetic problems, generated by us using ChatGPT for further improvement of the model's ability to solve mathematical problems.

The full dataset adds up to over **110K** records.

### **Evaluation**

Throughout a set of over 4000 Arabic data samples, Noon-7b was automatically evaluated using **OpenAI's [GPT3.5 Turbo](https://platform.openai.com/docs/models)** model.

Provided with clear and carefully crafted evaluation criteria (aligning with the model's training objective as well as the syntactic and grammatical rules of the Arabic language), GPT3.5 Turbo was prompted to evaluate each of Noon's responses to an input instruction on a scale of **1 - 5**.

We concluded the evaluation by averaging the provided scores, adding up to an impressive final score of **4.07/5**.

**NOTE:** Although we acknowledge that this proposed framework is not an exact solution and that it remains an ongoing area of research, we hold the belief that it has the potential to replicate human assessments to a reasonably satisfactory extent.


### **Disclaimer**

The generated responses from this AI model are purely algorithmic and should be interpreted with caution. The model's outputs may occasionally exhibit bias, offensive language, or potentially harmful content. It is important to note that these responses do not reflect the personal preferences or viewpoints of the authors or the organization of Naseej.

While every effort is made to mitigate the harmfulness of the model's outputs, it is impossible to guarantee complete elimination of biases or offensive content. The model learns from vast amounts of data and may inadvertently replicate or amplify existing societal biases present in the training data.

Users are advised to critically evaluate and verify the information provided by the model. Exercise discretion when utilizing the model's responses, particularly in sensitive or controversial topics.

We are committed to ongoing research and development to improve the model's performance, minimize biases, and reduce harmful outputs. Your feedback and insights are valuable in helping us achieve these goals.