nicholasKluge commited on
Commit
c883381
1 Parent(s): 693c871

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +177 -0
README.md ADDED
@@ -0,0 +1,177 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - assin2
5
+ language:
6
+ - pt
7
+ metrics:
8
+ - accuracy
9
+ library_name: transformers
10
+ pipeline_tag: text-classification
11
+ tags:
12
+ - textual-entailment
13
+ widget:
14
+ - text: "<s>Batatas estão sendo fatiadas por um homem<s>O homem está fatiando a batata.</s>"
15
+ example_title: Exemplo
16
+ - text: "<s>Uma mulher está misturando ovos.<s>A mulher está bebendo.</s>"
17
+ example_title: Exemplo
18
+ ---
19
+ # TeenyTinyLlama-460m-Assin2
20
+
21
+ TeenyTinyLlama is a series of small foundational models trained in Brazilian Portuguese.
22
+
23
+ This repository contains a version of [TeenyTinyLlama-460m](https://huggingface.co/nicholasKluge/TeenyTinyLlama-460m) (`TeenyTinyLlama-460m-Assin2`) fine-tuned on the [Assin2](https://huggingface.co/datasets/assin2).
24
+
25
+ ## Details
26
+
27
+ - **Number of Epochs:** 3
28
+ - **Batch size:** 16
29
+ - **Optimizer:** `torch.optim.AdamW` (learning_rate = 4e-5, epsilon = 1e-8)
30
+ - **GPU:** 1 NVIDIA A100-SXM4-40GB
31
+
32
+ ## Usage
33
+
34
+ Using `transformers.pipeline`:
35
+
36
+ ```python
37
+ from transformers import pipeline
38
+
39
+ text = "<s>Qual a capital do Brasil?<s>A capital do Brasil é Brasília!</s>"
40
+
41
+ classifier = pipeline("text-classification", model="nicholasKluge/TeenyTinyLlama-460m-Assin2")
42
+ classifier(text)
43
+
44
+ # >>> [{'label': 'ENTAILED', 'score': 0.9392824769020081}]
45
+ ```
46
+
47
+ ## Reproducing
48
+
49
+ To reproduce the fine-tuning process, use the following code snippet:
50
+
51
+ ```python
52
+ # Assin2
53
+ ! pip install transformers datasets evaluate accelerate -q
54
+
55
+ import evaluate
56
+ import numpy as np
57
+ from datasets import load_dataset, Dataset, DatasetDict
58
+ from transformers import AutoTokenizer, DataCollatorWithPadding
59
+ from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
60
+
61
+ # Load the task
62
+ dataset = load_dataset("assin2")
63
+
64
+ # Create a `ModelForSequenceClassification`
65
+ model = AutoModelForSequenceClassification.from_pretrained(
66
+ "nicholasKluge/TeenyTinyLlama-460m",
67
+ num_labels=2,
68
+ id2label={0: "UNENTAILED", 1: "ENTAILED"},
69
+ label2id={"UNENTAILED": 0, "ENTAILED": 1}
70
+ )
71
+
72
+ tokenizer = AutoTokenizer.from_pretrained("nicholasKluge/TeenyTinyLlama-460m")
73
+
74
+ # Format the dataset
75
+ train = dataset['train'].to_pandas()
76
+ train['text'] = tokenizer.bos_token + train['premise'] + tokenizer.bos_token + train['hypothesis'] + tokenizer.eos_token
77
+ train = train[["text", "entailment_judgment"]]
78
+ train.columns = ['text', 'label']
79
+ train.labels = train.label.astype(int)
80
+ train = Dataset.from_pandas(train)
81
+
82
+ test = dataset['test'].to_pandas()
83
+ test['text'] = tokenizer.bos_token + test['premise'] + tokenizer.bos_token + test['hypothesis'] + tokenizer.eos_token
84
+ test = test[["text", "entailment_judgment"]]
85
+ test.columns = ['text', 'label']
86
+ test.labels = test.label.astype(int)
87
+ test = Dataset.from_pandas(test)
88
+
89
+ dataset = DatasetDict({
90
+ "train": train,
91
+ "test": test
92
+ })
93
+
94
+ # Preprocess the dataset
95
+ def preprocess_function(examples):
96
+ return tokenizer(examples["text"], truncation=True)
97
+
98
+ dataset_tokenized = dataset.map(preprocess_function, batched=True)
99
+
100
+ # Create a simple data collactor
101
+ data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
102
+
103
+ # Use accuracy as evaluation metric
104
+ accuracy = evaluate.load("accuracy")
105
+
106
+ # Function to compute accuracy
107
+ def compute_metrics(eval_pred):
108
+ predictions, labels = eval_pred
109
+ predictions = np.argmax(predictions, axis=1)
110
+ return accuracy.compute(predictions=predictions, references=labels)
111
+
112
+ # Define training arguments
113
+ training_args = TrainingArguments(
114
+ output_dir="checkpoints",
115
+ learning_rate=4e-5,
116
+ per_device_train_batch_size=16,
117
+ per_device_eval_batch_size=16,
118
+ num_train_epochs=3,
119
+ weight_decay=0.01,
120
+ evaluation_strategy="epoch",
121
+ save_strategy="epoch",
122
+ load_best_model_at_end=True,
123
+ push_to_hub=True,
124
+ hub_token="your_token_here",
125
+ hub_model_id="username/model-ID",
126
+ )
127
+
128
+ # Define the Trainer
129
+ trainer = Trainer(
130
+ model=model,
131
+ args=training_args,
132
+ train_dataset=dataset_tokenized["train"],
133
+ eval_dataset=dataset_tokenized["test"],
134
+ tokenizer=tokenizer,
135
+ data_collator=data_collator,
136
+ compute_metrics=compute_metrics,
137
+ )
138
+
139
+ # Train!
140
+ trainer.train()
141
+
142
+
143
+ ```
144
+
145
+ ## Fine-Tuning Comparisons
146
+
147
+ | Models | [Assin2](https://huggingface.co/datasets/assin2)|
148
+ |--------------------------------------------------------------------------------------------|-------------------------------------------------|
149
+ | [Bert-large-portuguese-cased](https://huggingface.co/neuralmind/bert-base-portuguese-cased)| 88.97 |
150
+ | [Bert-base-portuguese-cased](https://huggingface.co/neuralmind/bert-base-portuguese-cased) | 87.45 |
151
+ | [Teeny Tiny Llama 460m](https://huggingface.co/nicholasKluge/TeenyTinyLlama-460m) | 86.43 |
152
+ | [Gpt2-small-portuguese](https://huggingface.co/pierreguillou/gpt2-small-portuguese) | 86.11 |
153
+ | [Teeny Tiny Llama 160m](https://huggingface.co/nicholasKluge/TeenyTinyLlama-160m) | 85.78 |
154
+
155
+ ## Cite as 🤗
156
+
157
+ ```latex
158
+
159
+ @misc{nicholas22llama,
160
+ doi = {10.5281/zenodo.6989727},
161
+ url = {https://huggingface.co/nicholasKluge/TeenyTinyLlama-460m},
162
+ author = {Nicholas Kluge Corrêa},
163
+ title = {TeenyTinyLlama},
164
+ year = {2023},
165
+ publisher = {HuggingFace},
166
+ journal = {HuggingFace repository},
167
+ }
168
+
169
+ ```
170
+
171
+ ## Funding
172
+
173
+ This repository was built as part of the RAIES ([Rede de Inteligência Artificial Ética e Segura](https://www.raies.org/)) initiative, a project supported by FAPERGS - ([Fundação de Amparo à Pesquisa do Estado do Rio Grande do Sul](https://fapergs.rs.gov.br/inicial)), Brazil.
174
+
175
+ ## License
176
+
177
+ TeenyTinyLlama-460m-Assin2 is licensed under the Apache License, Version 2.0. See the [LICENSE](LICENSE) file for more details.