eligapris commited on
Commit
530ce92
·
verified ·
1 Parent(s): 652696f

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +136 -0
README.md ADDED
@@ -0,0 +1,136 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Kirundi Tokenizer and LoRA Model
2
+
3
+ ## Model Description
4
+
5
+ This repository contains two main components:
6
+ 1. A BPE tokenizer trained specifically for the Kirundi language (ISO code: run)
7
+ 2. A LoRA adapter trained for Kirundi language processing
8
+
9
+ ### Tokenizer Details
10
+ - **Type**: BPE (Byte-Pair Encoding)
11
+ - **Vocabulary Size**: 30,000 tokens
12
+ - **Special Tokens**: [UNK], [CLS], [SEP], [PAD], [MASK]
13
+ - **Pre-tokenization**: Whitespace-based
14
+
15
+ ### LoRA Adapter Details
16
+ - **Base Model**: [To be filled with your chosen base model]
17
+ - **Rank**: 8
18
+ - **Alpha**: 32
19
+ - **Target Modules**: Query and Value attention matrices
20
+ - **Dropout**: 0.05
21
+
22
+ ## Intended Uses & Limitations
23
+
24
+ ### Intended Uses
25
+ - Text processing for Kirundi language
26
+ - Machine translation tasks involving Kirundi
27
+ - Natural language understanding tasks for Kirundi content
28
+ - Foundation for developing Kirundi language applications
29
+
30
+ ### Limitations
31
+ - The tokenizer is trained on a specific corpus and may not cover all Kirundi dialects
32
+ - Limited to the vocabulary observed in the training data
33
+ - Performance may vary on domain-specific text
34
+
35
+ ## Training Data
36
+
37
+ The model components were trained on the Kirundi-English parallel corpus:
38
+ - **Dataset**: eligapris/kirundi-english
39
+ - **Size**: 21.4k sentence pairs
40
+ - **Nature**: Parallel corpus with Kirundi and English translations
41
+ - **Domain**: Mixed domain including religious, general, and conversational text
42
+
43
+ ## Training Procedure
44
+
45
+ ### Tokenizer Training
46
+ - Trained using Hugging Face's Tokenizers library
47
+ - BPE algorithm with a vocabulary size of 30k
48
+ - Includes special tokens for task-specific usage
49
+ - Trained on the Kirundi portion of the parallel corpus
50
+
51
+ ### LoRA Training
52
+ [To be filled with your specific training details]
53
+ - Number of epochs:
54
+ - Batch size:
55
+ - Learning rate:
56
+ - Training hardware:
57
+ - Training time:
58
+
59
+ ## Evaluation Results
60
+
61
+ [To be filled with your evaluation metrics]
62
+ - Coverage statistics:
63
+ - Out-of-vocabulary rate:
64
+ - Task-specific metrics:
65
+
66
+ ## Environmental Impact
67
+
68
+ [To be filled with training compute details]
69
+ - Estimated CO2 emissions:
70
+ - Hardware used:
71
+ - Training duration:
72
+
73
+ ## Technical Specifications
74
+
75
+ ### Model Architecture
76
+ - Tokenizer: BPE-based with custom vocabulary
77
+ - LoRA Configuration:
78
+ - r=8 (rank)
79
+ - α=32 (scaling)
80
+ - Trained on specific attention layers
81
+ - Dropout rate: 0.05
82
+
83
+ ### Software Requirements
84
+ ```python
85
+ dependencies = {
86
+ "transformers": ">=4.30.0",
87
+ "tokenizers": ">=0.13.0",
88
+ "peft": ">=0.4.0"
89
+ }
90
+ ```
91
+
92
+ ## How to Use
93
+
94
+ ### Loading the Tokenizer
95
+ ```python
96
+ from transformers import PreTrainedTokenizerFast
97
+
98
+ tokenizer = PreTrainedTokenizerFast.from_pretrained("path_to_tokenizer")
99
+ ```
100
+
101
+ ### Loading the LoRA Model
102
+ ```python
103
+ from peft import PeftModel, PeftConfig
104
+ from transformers import AutoModelForSequenceClassification
105
+
106
+ config = PeftConfig.from_pretrained("path_to_lora_model")
107
+ model = AutoModelForSequenceClassification.from_pretrained(config.base_model_name_or_path)
108
+ model = PeftModel.from_pretrained(model, "path_to_lora_model")
109
+ ```
110
+
111
+ ## Citation
112
+
113
+ [To be filled with your preferred citation format]
114
+
115
+ ## License
116
+
117
+ [Specify your chosen license]
118
+
119
+ ## Contact
120
+
121
+ [Your contact information or preferred method of contact]
122
+
123
+ ---
124
+
125
+ ## Updates and Versions
126
+
127
+ - v1.0.0 (Initial Release)
128
+ - Base tokenizer and LoRA model
129
+ - Trained on Kirundi-English parallel corpus
130
+ - Basic functionality and documentation
131
+
132
+ ## Acknowledgments
133
+
134
+ - Dataset provided by eligapris
135
+ - Hugging Face's Transformers and Tokenizers libraries
136
+ - PEFT library for LoRA implementation