ryfye181 commited on
Commit
d56d8e4
1 Parent(s): 8b3725b

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +80 -0
README.md ADDED
@@ -0,0 +1,80 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Ancient Language Translation with Meta AI's No Language Left Behind Model (NLLB).
2
+ ## Abstract
3
+
4
+ Hittite is one of the oldest written
5
+ languages, spoken by the ancient Hittites
6
+ with records dating as far back as the 17th
7
+ century B.C.E. in what is now modern-day
8
+ Turkey. However, the language died out
9
+ around the 13th century B.C.E and
10
+ relatively few records of the language
11
+ have been uncovered, making Hittite a low
12
+ data language. The issue is training a
13
+ language model to translate written Hittite
14
+ into written English. A difficult task for
15
+ two main reasons: as mentioned, there is a
16
+ data scarcity when it comes to labeled
17
+ Hittite to English translations. Possibly the
18
+ bigger issue though is the lack of language
19
+ models that support fine-tuning for new
20
+ languages.
21
+
22
+ ## Project Overview
23
+
24
+ This project aims to bridge the gap between the ancient and the modern world by translating the Hittite language into English. At the core of this endeavor lies the utilization of cutting-edge Natural Language Processing (NLP) and machine learning techniques, leveraging a transformer-based model open to the community for advancements and contributions.
25
+
26
+ ### Key Features
27
+
28
+ - **Transformer-Based Model Translation:** Employs a state-of-the-art transformer-based model to understand and translate the Hittite language.
29
+
30
+ - **Custom Supervised Dataset:** Through meticulous data scraping and the development of a dataset builder tool, this project has curated a specialized dataset. This dataset features pairs of English and Hittite translations, tailored to train the translation model effectively.
31
+
32
+ - **Google Colab Integration:** The project is accessible via a Google Colab notebook for ease of use and accessibility. This notebook guides users through the process of tokenization, model fine-tuning, and evaluation, providing an interactive platform for exploring ancient Hittite translations.
33
+
34
+ Hittite To English colab: [https://colab.research.google.com/drive/1fmJe9EuumIo-uwfW4Pp3hgyz3SviomaQ?usp=sharing](https://colab.research.google.com/drive/1fmJe9EuumIo-uwfW4Pp3hgyz3SviomaQ?usp=sharing)
35
+
36
+ - **Performance Metrics:** To ensure the translation model's accuracy and reliability, comprehensive metrics are collected and analyzed.
37
+
38
+ More details can be found in the report document HitToEng_Report.pdf.
39
+ The implementation at nllb_hittite_to_english_finetune.ipynb.
40
+
41
+
42
+ ## Usage
43
+
44
+ **!Must run on a GPU! CPU usage is not supported!**
45
+
46
+ **Load model and tokenizer from Huggingface:**
47
+ - $ model_load_name = "ryfye181/hittite_saved_model"
48
+ - $ model = AutoModelForSeq2SeqLM.from_pretrained(model_load_name).cuda()
49
+ - $ tokenizer = NllbTokenizer.from_pretrained(model_load_name).
50
+
51
+ Using the Model for translating is demonostrated in section 8 of the Google Colab notebook.
52
+
53
+ ## Metrics
54
+
55
+ ### Loss over Time During Training
56
+ ![image](https://github.com/rfeinberg3/Hittite_English_Translation_w-NLLB/assets/95943957/b2101ba5-36f3-4d9a-a3bf-bad2a0d06471)
57
+ ### CHRF2++ Score
58
+ ![image](https://github.com/rfeinberg3/Hittite_English_Translation_w-NLLB/assets/95943957/1b3e6bdf-932d-4a8c-ab49-223bde3be381)
59
+ **https://github.com/mjpost/sacrebleu#chrf--chrf**
60
+
61
+ ## References
62
+
63
+ Hittite Base Form Dictionary:
64
+ * https://lrc.la.utexas.edu/eieol_base_form_dictionary/hitol/11
65
+
66
+ Hittite Lexicons:
67
+ * https://www.assyrianlanguages.org/hittite/en_lexique_hittite.htm#l
68
+ * https://hittitetexts.com/en (where we get eCMD from)
69
+
70
+
71
+ No Language Left Behind GitHub:
72
+ * https://github.com/facebookresearch/fairseq/tree/nllb
73
+ * https://huggingface.co/facebook/nllb-200-1.3B
74
+ <--- Their Model on HuggingFace:
75
+ * https://github.com/facebookresearch/fairseq/blob/nllb/examples/nllb/modeling/README.md
76
+ <--- Info on fine-tuning their model:
77
+
78
+ NLLB New Language Fine-Tuning Original Example:
79
+ * https://cointegrated.medium.com/how-to-fine-tune-a-nllb-200-model-for-translating-a-new-language-a37fc706b865
80
+ * https://colab.research.google.com/drive/1bayEaw2fz_9Mhg9jFFZhrmDlQlBj1YZf?usp=sharing <--- Their original colab