Esmail Atta Gumaan commited on
Commit
bdab5f6
1 Parent(s): 7c8fba4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +45 -0
README.md CHANGED
@@ -16,4 +16,49 @@ In the rapidly evolving landscape of natural language processing (NLP) and machi
16
  Goal:
17
  Develop a specialized language-to-language transformer model that accurately translates from the Arabic language to the English language, ensuring semantic fidelity, contextual awareness, cross-lingual adaptability, and the retention of grammar and style. The model should provide efficient training and inference processes to make it practical and accessible for a wide range of applications, ultimately contributing to the advancement of Arabic-to-English language translation capabilities.
18
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
  ---
 
16
  Goal:
17
  Develop a specialized language-to-language transformer model that accurately translates from the Arabic language to the English language, ensuring semantic fidelity, contextual awareness, cross-lingual adaptability, and the retention of grammar and style. The model should provide efficient training and inference processes to make it practical and accessible for a wide range of applications, ultimately contributing to the advancement of Arabic-to-English language translation capabilities.
18
 
19
+ ---
20
+
21
+ Dataset used:
22
+ from hugging Face huggingface/opus_infopankki
23
+
24
+ ---
25
+
26
+ Configuration:
27
+ this is the settings of the model, You can customize the source and target languages, sequence lengths for each, the number of epochs, batch size, and more.
28
+
29
+ ```python
30
+ def Get_configuration():
31
+ return {
32
+ "batch_size": 8,
33
+ "num_epochs": 30,
34
+ "lr": 10**-4,
35
+ "sequence_length": 100,
36
+ "d_model": 512,
37
+ "datasource": 'opus_infopankki',
38
+ "source_language": "ar",
39
+ "target_language": "en",
40
+ "model_folder": "weights",
41
+ "model_basename": "tmodel_",
42
+ "preload": "latest",
43
+ "tokenizer_file": "tokenizer_{0}.json",
44
+ "experiment_name": "runs/tmodel"
45
+ }
46
+ ```
47
+
48
+ ---
49
+
50
+ Training:
51
+ I used my drive to upload the project and then connected it to the Google Collab to train it:
52
+
53
+ - hours of training: 4 hours.
54
+ - epochs: 20.
55
+ - number of dataset rows: 2,934,399.
56
+ - size of the dataset: 95MB.
57
+ - size of the auto-converted parquet files: 153MB.
58
+ - Arabic tokens: 29999.
59
+ - English tokens: 15697.
60
+ - pre-trained model in collab.
61
+ - BLEU score from Arabic to English: 19.7
62
+
63
+
64
  ---