Naela00 commited on
Commit
06de78a
Β·
verified Β·
1 Parent(s): 410c14b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +110 -3
README.md CHANGED
@@ -1,3 +1,110 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - model
4
+ - checkpoints
5
+ - translation
6
+ - latin
7
+ - english
8
+ - mt5
9
+ - mistral
10
+ - multilingual
11
+ - NLP
12
+ language:
13
+ - en
14
+ - la
15
+ license: "cc-by-4.0"
16
+ models:
17
+ - mistralai/Mistral-7B-Instruct-v0.3
18
+ - google/mt5-small
19
+ model_type: "mt5-small"
20
+ training_epochs: 6 (initial pipeline), 30 (final pipeline with optimizations), 100 (fine-tuning on 4750 summaries)
21
+ task_categories:
22
+ - translation
23
+ - summarization
24
+ - multilingual-nlp
25
+ task_ids:
26
+ - en-la-translation
27
+ - la-en-translation
28
+ - text-generation
29
+ pretty_name: "mT5-LatinSummarizerModel"
30
+ storage:
31
+ - git-lfs
32
+ - huggingface-models
33
+ size_categories:
34
+ - 5GB<n<10GB
35
+ ---
36
+ # **mT5-LatinSummarizerModel: Fine-Tuned Model for Latin NLP**
37
+
38
+ [![GitHub Repository](https://img.shields.io/badge/GitHub-Repository-black?logo=github)](https://github.com/AxelDlv00/LatinSummarizer)
39
+ [![Hugging Face Model](https://img.shields.io/badge/Hugging%20Face-Model-blue?logo=huggingface)](https://huggingface.co/LatinNLP/LatinSummarizerModel)
40
+ [![Hugging Face Dataset](https://img.shields.io/badge/Hugging%20Face-Dataset-orange?logo=huggingface)](https://huggingface.co/datasets/LatinNLP/LatinSummarizerDataset)
41
+
42
+ ## **Overview**
43
+ This repository contains the **trained checkpoints and tokenizer files** for the `mT5-LatinSummarizerModel`, which was fine-tuned to improve **Latin summarization and translation**. It is designed to:
44
+ - Translate between **English and Latin**.
45
+ - Summarize Latin texts effectively.
46
+ - Leverage extractive and abstractive summarization techniques.
47
+ - Utilize **curriculum learning** for improved training.
48
+
49
+ ## **Installation & Usage**
50
+ To download and set up the models (mT5-small and Mistral-7B-Instruct), you can directly run:
51
+ ```bash
52
+ bash install_large_models.sh
53
+ ```
54
+
55
+ ## **Project Structure**
56
+ ```
57
+ .
58
+ β”œβ”€β”€ final_pipeline (Trained for 30 light epochs with optimizations, and then finetuned on 100 on the small HQ summaries dataset)
59
+ β”‚ β”œβ”€β”€ no_stanza
60
+ β”‚ β”œβ”€β”€ with_stanza
61
+ β”œβ”€β”€ initial_pipeline (Trained for 6 epochs without optimizations)
62
+ β”‚ β”œβ”€β”€ mt5-small-en-la-translation-epoch5
63
+ β”œβ”€β”€ install_large_models.sh
64
+ └── README.md
65
+ ```
66
+
67
+ ## **Training Methodology**
68
+ We fine-tuned **mT5-small** in three phases:
69
+ 1. **Initial Training Pipeline (6 epochs)**: Used the full dataset without optimizations.
70
+ 2. **Final Training Pipeline (30 light epochs)**: Used **10% of training data per epoch** for efficiency.
71
+ 3. **Fine-Tuning (100 epochs)**: Focused on the **4750 high-quality summaries** for final optimization.
72
+
73
+ #### **Training Configurations:**
74
+ - **Hardware:** 16GB VRAM GPU (lab machines via SSH).
75
+ - **Batch Size:** Adaptive due to GPU memory constraints.
76
+ - **Gradient Accumulation:** Enabled for larger effective batch sizes.
77
+ - **LoRA-based fine-tuning:** LoRA Rank 8, Scaling Factor 32.
78
+ - **Dynamic Sequence Length Adjustment:** Increased progressively.
79
+ - **Learning Rate:** `5 Γ— 10^-4` with warm-up steps.
80
+ - **Checkpointing:** Frequent saves to mitigate power outages.
81
+
82
+ ## **Evaluation & Results**
83
+ We evaluated the model using **ROUGE, BERTScore, and BLEU/chrF scores**.
84
+
85
+ | Metric | Before Fine-Tuning | After Fine-Tuning |
86
+ |--------|-----------------|-----------------|
87
+ | ROUGE-1 | 0.1675 | 0.2541 |
88
+ | ROUGE-2 | 0.0427 | 0.0773 |
89
+ | ROUGE-L | 0.1459 | 0.2139 |
90
+ | BERTScore-F1 | 0.6573 | 0.7140 |
91
+
92
+ - **chrF Score (en→la):** 33.60 (with Stanza tags) vs 18.03 BLEU (without Stanza).
93
+ - **Summarization Density:** Maintained at ~6%.
94
+
95
+ ### **Observations:**
96
+ - Pre-training on **extractive summaries** was crucial.
97
+ - The model retained some **excessive extraction**, indicating room for further improvement.
98
+
99
+ ## **License**
100
+ This model is released under **CC-BY-4.0**.
101
+
102
+ ## **Citation**
103
+ ```bibtex
104
+ @misc{LatinSummarizerModel,
105
+ author = {Axel Delaval, Elsa Lubek},
106
+ title = {Latin-English Summarization Model (mT5)},
107
+ year = {2025},
108
+ url = {https://huggingface.co/LatinNLP/LatinSummarizerModel}
109
+ }
110
+ ```