alsubari
/

aragpt2-mega-pos-msa

@@ -5,9 +5,6 @@ pipeline_tag: text-generation
 ---
 # Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
 ## Model Details
 ### Model Description
@@ -21,174 +18,109 @@ pipeline_tag: text-generation
 ## Uses
-1. The model can be helpful for the arabic langauge students/researchers, since it provide the full sentence anaylsis (اعراب الجملة ) in arabic language.
-2.
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-1. This model can't be use for grammar check, since it dail with high level of arabic correct sentence as input
-2. Don't use arabic dailects in input sentence.
-3.
-4.
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
 ## How to Get Started with the Model
 ```python
 from transformers import GPT2Tokenizer
-from arabert.preprocess import ArabertPreprocessor
 from arabert.aragpt2.grover.modeling_gpt2 import GPT2LMHeadModel
-from pyarabic.araby import strip_tashkeel
-import pyarabic.trans
 model_name='alsubari/aragpt2-mega-pos-msa'
 tokenizer = GPT2Tokenizer.from_pretrained('alsubari/aragpt2-mega-pos-msa')
 model = GPT2LMHeadModel.from_pretrained('alsubari/aragpt2-mega-pos-msa').to("cuda")
-arabert_prep = ArabertPreprocessor(model_name='aubmindlab/aragpt2-mega')
-prml=['اعراب الجملة :', ' صنف الكلمات من الجملة :']
 text='تعلَّمْ من أخطائِكَ'
-text=arabert_prep.preprocess(strip_tashkeel(text))
-generation_args = {
-    'pad_token_id':tokenizer.eos_token_id,
-    'max_length': 256,
-    'num_beams':20,
-    'no_repeat_ngram_size': 3,
-    'top_k': 20,
-    'top_p': 0.1,  # Consider all tokens with non-zero probability
-    'do_sample': True,
-    'repetition_penalty':2.0
-}
-##Pose Tagging
-input_text = f'<|startoftext|>Instruction: {prml[1]} {text}<|pad|>Answer:'
-input_ids = tokenizer.encode(input_text, return_tensors='pt').to("cuda")
-output_ids = model.generate(input_ids=input_ids,**generation_args)
-output_text = tokenizer.decode(output_ids[0],skip_special_tokens=True).split('Answer:')[1]
-answer_pose=pyarabic.trans.delimite_language(output_text, start="<token>", end="</token>")
-print(answer_pose)
-# <token>تعلم : تعلم</token>  : Verb  <token>من : من</token>  : Relative pronoun  <token>أخطائك : اخطا</token>  : Noun  <token>ك</token>  : Personal pronunction
-##Arabic Sentence Analysis
-input_text = f'<|startoftext|>Instruction: {prml[0]} {text}<|pad|>Answer:'
-input_ids = tokenizer.encode(input_text, return_tensors='pt').to("cuda")
-output_ids = model.generate(input_ids=input_ids,**generation_args)
-output_text = tokenizer.decode(output_ids[0],skip_special_tokens=True).split('Answer:')[1]
-print(output_text)
-#تعلم : تعلم : فعل ، مفرد المخاطب للمذكر ، فعل مضارع ، مرفوع من : من : حرف جر أخطائك : اخطا : اسم ، جمع المذكر ، مجرور ك : ضمير ، مفرد المتكلم
 ```
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Data Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
 ### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
 ## Model Card Contact

 ---
 # Model Card for Model ID
 ## Model Details
 ### Model Description
 ## Uses
+1. pose tagging for arabic language and it may use for other languages
+2. The model can be helpful for the arabic langauge students/researchers, since it provide the sentence anaylsis (اعراب الجملة ) in the context.
+3. arabic word toknizer
+4. it may use for translate the arabic dailects to MSA
+## Main Labels
+{'حرف جر': 'preposition',
+ 'اسم': 'noun',
+ 'اسم علم': 'proper noun',
+ 'لام التعريف': 'determiner',
+ 'صفة': 'adjective',
+ 'ضمير': 'personal pronoun',
+ 'فعل': 'verb',
+ 'حرف عطف': 'conjunction',
+ 'اسم موصول': 'relative pronoun',
+ 'حرف نفي': 'negative particle',
+ 'حروف مقطعة': 'quranic initials',
+ 'اسم اشارة': 'demonstrative pronoun',
+ 'حرف استئنافية': 'resumption',
+ 'حرف نصب': 'accusative particle',
+ 'حرف تسوية': 'equalization particle',
+ 'حرف حال': 'circumstantial particle',
+ 'أداة حصر': 'restriction particle',
+ 'ظرف زمان': 'time adverb',
+ 'حرف نهي': 'prohibition particle',
+ 'حرف كاف': 'preventive particle',
+ 'حرف ابتداء': 'inceptive particle',
+ 'حرف زائد': 'supplemental particle',
+ 'حرف استدراك': 'amendment particle',
+ 'حرف مصدري': 'subordinating conjunction',
+ 'حرف استفهام': 'interrogative particle',
+ 'ظرف مكان': 'location adverb',
+ 'حرف شرط': 'conditional particle',
+ 'لام التوكيد': 'emphatic',
+ 'حرف نداء': 'vocative particle',
+ 'حرف واقع في جواب الشرط': 'result particle',
+ 'حرف تفصيل': 'explanation particle',
+ 'أداة استثناء': 'exceptive particle',
+ 'حرف سببية': 'particle of cause',
+ 'التوكيد - النون الثقيلة': 'heavy noon emphesis',
+ 'حرف استقبال': 'future particle',
+ 'حرف تحقيق': 'particle of certainty',
+ 'لام التعليل': 'purpose',
+ 'حرف جواب': 'answer particle',
+ 'حرف اضراب': 'retraction particle',
+ 'حرف تحضيض': 'exhortation particle',
+ 'حرف تفسير': 'particle of interpretation',
+ 'لام الامر': 'imperative',
+ 'واو المعية': 'comitative particle',
+ 'حرف فجاءة': 'surprise particle',
+ 'حرف ردع': 'aversion particle',
+ 'اسم فعل أمر': 'imperative verbal noun'}
 ## How to Get Started with the Model
 ```python
 from transformers import GPT2Tokenizer
+from pyarabic.araby import strip_diacritics,strip_tatweel
 from arabert.aragpt2.grover.modeling_gpt2 import GPT2LMHeadModel
+from transformers import pipeline
 model_name='alsubari/aragpt2-mega-pos-msa'
 tokenizer = GPT2Tokenizer.from_pretrained('alsubari/aragpt2-mega-pos-msa')
 model = GPT2LMHeadModel.from_pretrained('alsubari/aragpt2-mega-pos-msa').to("cuda")
+generator = pipeline("text-generation",model=model,tokenizer=tokenizer,device=0)
+def generate(text):
+    prompt = f'<|startoftext|>Instruction: {text}<|pad|>Answer:'
+    pred_text=  generator(prompt,
+      pad_token_id=tokenizer.eos_token_id,
+      num_beams=20,
+      max_length=256,
+      #min_length = 200,
+      do_sampling=False,
+      top_p=0.5,
+      top_k=1,
+      repetition_penalty = 3.0,
+      # temperature=0.8,
+      no_repeat_ngram_size = 3)[0]['generated_text']
+    try:
+        pred_sentiment = re.findall("Answer:(.*)", pred_text,re.S)[-1]
+    except:
+        pred_sentiment = "None"
+    return pred_sentiment
 text='تعلَّمْ من أخطائِكَ'
+generate(strip_tatweel(strip_diacritics(text)))
+#' تعلم ( تعلم : فعل ) من ( من : حرف جر ) أخطائك ( اخطاء : اسم ، ك : ضمير )'
 ```
 ### Results
+Epoch 	Training Loss 	Validation Loss
+1 	      0.108500 	        0.082612
 ## Model Card Contact