nicholasKluge commited on
Commit
f62cb7b
·
verified ·
1 Parent(s): f806770

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +176 -0
README.md ADDED
@@ -0,0 +1,176 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - ruanchaves/hatebr
5
+ language:
6
+ - pt
7
+ metrics:
8
+ - accuracy
9
+ library_name: transformers
10
+ pipeline_tag: text-classification
11
+ tags:
12
+ - hate-speech
13
+ widget:
14
+ - text: "Não concordo com a sua opinião."
15
+ example_title: Exemplo
16
+ - text: "Pega a sua opinião e vai a merda com ela!"
17
+ example_title: Exemplo
18
+ ---
19
+ # TeenyTinyLlama-460m-HateBR
20
+
21
+ TeenyTinyLlama is a series of small foundational models trained in Brazilian Portuguese.
22
+
23
+ This repository contains a version of [TeenyTinyLlama-460m](https://huggingface.co/nicholasKluge/TeenyTinyLlama-160m) (`TeenyTinyLlama-460m-HateBR`) fine-tuned on the [HateBR dataset](https://huggingface.co/datasets/ruanchaves/hatebr).
24
+
25
+ ## Details
26
+
27
+ - **Number of Epochs:** 3
28
+ - **Batch size:** 16
29
+ - **Optimizer:** `torch.optim.AdamW` (learning_rate = 4e-5, epsilon = 1e-8)
30
+ - **GPU:** 1 NVIDIA A100-SXM4-40GB
31
+
32
+ ## Usage
33
+
34
+ Using `transformers.pipeline`:
35
+
36
+ ```python
37
+ from transformers import pipeline
38
+
39
+ text = "Pega a sua opinião e vai a merda com ela!"
40
+
41
+ classifier = pipeline("text-classification", model="nicholasKluge/TeenyTinyLlama-460m-HateBR")
42
+ classifier(text)
43
+
44
+ # >>> [{'label': 'TOXIC', 'score': 0.9998729228973389}]
45
+ ```
46
+
47
+ ## Reproducing
48
+
49
+ To reproduce the fine-tuning process, use the following code snippet:
50
+
51
+ ```python
52
+
53
+ # Hatebr
54
+ ! pip install transformers datasets evaluate accelerate -q
55
+
56
+ import evaluate
57
+ import numpy as np
58
+ from huggingface_hub import login
59
+ from datasets import load_dataset, Dataset, DatasetDict
60
+ from transformers import AutoTokenizer, DataCollatorWithPadding
61
+ from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
62
+
63
+ # Load the task
64
+ dataset = load_dataset("ruanchaves/hatebr")
65
+
66
+ # Format the dataset
67
+ train = dataset['train'].to_pandas()
68
+ train = train[['instagram_comments', 'offensive_language']]
69
+ train.columns = ['text', 'labels']
70
+ train.labels = train.labels.astype(int)
71
+ train = Dataset.from_pandas(train)
72
+
73
+ test = dataset['test'].to_pandas()
74
+ test = test[['instagram_comments', 'offensive_language']]
75
+ test.columns = ['text', 'labels']
76
+ test.labels = test.labels.astype(int)
77
+ test = Dataset.from_pandas(test)
78
+
79
+ dataset = DatasetDict({
80
+ "train": train,
81
+ "test": test
82
+ })
83
+
84
+ # Create a `ModelForSequenceClassification`
85
+ model = AutoModelForSequenceClassification.from_pretrained(
86
+ "nicholasKluge/TeenyTinyLlama-460m",
87
+ num_labels=2,
88
+ id2label={0: "NONTOXIC", 1: "TOXIC"},
89
+ label2id={"NONTOXIC": 0, "TOXIC": 1}
90
+ )
91
+
92
+ tokenizer = AutoTokenizer.from_pretrained("nicholasKluge/TeenyTinyLlama-460m")
93
+
94
+ # Preprocess the dataset
95
+ def preprocess_function(examples):
96
+ return tokenizer(examples["text"], truncation=True)
97
+
98
+ dataset_tokenized = dataset.map(preprocess_function, batched=True)
99
+
100
+ # Create a simple data collactor
101
+ data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
102
+
103
+ # Use accuracy as evaluation metric
104
+ accuracy = evaluate.load("accuracy")
105
+
106
+ # Function to compute accuracy
107
+ def compute_metrics(eval_pred):
108
+ predictions, labels = eval_pred
109
+ predictions = np.argmax(predictions, axis=1)
110
+ return accuracy.compute(predictions=predictions, references=labels)
111
+
112
+ # Define training arguments
113
+ training_args = TrainingArguments(
114
+ output_dir="checkpoints",
115
+ learning_rate=4e-5,
116
+ per_device_train_batch_size=16,
117
+ per_device_eval_batch_size=16,
118
+ num_train_epochs=3,
119
+ weight_decay=0.01,
120
+ evaluation_strategy="epoch",
121
+ save_strategy="epoch",
122
+ load_best_model_at_end=True,
123
+ push_to_hub=True,
124
+ hub_token="your_token_here",
125
+ hub_model_id="username/model-ID",
126
+ )
127
+
128
+ # Define the Trainer
129
+ trainer = Trainer(
130
+ model=model,
131
+ args=training_args,
132
+ train_dataset=dataset_tokenized["train"],
133
+ eval_dataset=dataset_tokenized["test"],
134
+ tokenizer=tokenizer,
135
+ data_collator=data_collator,
136
+ compute_metrics=compute_metrics,
137
+ )
138
+
139
+ # Train!
140
+ trainer.train()
141
+
142
+ ```
143
+
144
+ ## Fine-Tuning Comparisons
145
+
146
+ | Models | [HateBr](https://huggingface.co/datasets/ruanchaves/hatebr) |
147
+ |--------------------------------------------------------------------------------------------|-------------------------------------------------------------|
148
+ | [Teeny Tiny Llama 460m](https://huggingface.co/nicholasKluge/TeenyTinyLlama-460m) | 91.64 |
149
+ | [Bert-large-portuguese-cased](https://huggingface.co/neuralmind/bert-base-portuguese-cased)| 91.57 |
150
+ | [Bert-base-portuguese-cased](https://huggingface.co/neuralmind/bert-base-portuguese-cased) | 91.28 |
151
+ | [Teeny Tiny Llama 160m](https://huggingface.co/nicholasKluge/TeenyTinyLlama-160m) | 90.71 |
152
+ | [Gpt2-small-portuguese](https://huggingface.co/pierreguillou/gpt2-small-portuguese) | 87.42 |
153
+
154
+ ## Cite as 🤗
155
+
156
+ ```latex
157
+
158
+ @misc{nicholas22llama,
159
+ doi = {10.5281/zenodo.6989727},
160
+ url = {https://huggingface.co/nicholasKluge/TeenyTinyLlama-460m},
161
+ author = {Nicholas Kluge Corrêa},
162
+ title = {TeenyTinyLlama},
163
+ year = {2023},
164
+ publisher = {HuggingFace},
165
+ journal = {HuggingFace repository},
166
+ }
167
+
168
+ ```
169
+
170
+ ## Funding
171
+
172
+ This repository was built as part of the RAIES ([Rede de Inteligência Artificial Ética e Segura](https://www.raies.org/)) initiative, a project supported by FAPERGS - ([Fundação de Amparo à Pesquisa do Estado do Rio Grande do Sul](https://fapergs.rs.gov.br/inicial)), Brazil.
173
+
174
+ ## License
175
+
176
+ TeenyTinyLlama-460m-HateBR is licensed under the Apache License, Version 2.0. See the [LICENSE](LICENSE) file for more details.