File size: 3,562 Bytes
bcc4ec8
 
 
 
 
 
 
 
 
 
 
 
653df42
 
 
 
 
bcc4ec8
b67821c
bcc4ec8
b67821c
88658b0
7822c00
81e17c0
792aeca
b67821c
184eab2
 
 
 
 
 
c5747c7
184eab2
c5747c7
184eab2
 
b67821c
184eab2
 
 
b67821c
184eab2
 
 
 
 
b67821c
184eab2
b67821c
184eab2
 
 
 
 
 
 
 
b67821c
184eab2
 
 
 
 
 
 
 
 
 
b67821c
 
 
 
 
 
792aeca
 
184eab2
792aeca
 
d8f7066
184eab2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ed774ef
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
---
license: apache-2.0
datasets:
- christykoh/imdb_pt
language:
- pt
metrics:
- accuracy
library_name: transformers
pipeline_tag: text-classification
tags:
- sentiment-analysis
widget:
- text: "Esqueceram de mim 2 é um dos melhores filmes de natal de todos os tempos."
  example_title: Exemplo
- text: "Esqueceram de mim 2 é o pior filme da franquia inteira."
  example_title: Exemplo
---
# TeenyTinyLlama-162m-IMDB

TeenyTinyLlama is a series of small foundational models trained on Portuguese.

This repository contains a version of [TeenyTinyLlama-162m](https://huggingface.co/nicholasKluge/TeenyTinyLlama-162m) fine-tuned on a translated version of the [IMDB dataset](https://huggingface.co/datasets/christykoh/imdb_pt).

## Reproducing
  
```python
# IMDB
! pip install transformers datasets evaluate accelerate -q

import evaluate
import numpy as np
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

# Load the task
dataset = load_dataset("christykoh/imdb_pt")

# Create a `ModelForSequenceClassification`
model = AutoModelForSequenceClassification.from_pretrained(
    "nicholasKluge/TeenyTinyLlama-162m", 
    num_labels=2, 
    id2label={0: "NEGATIVE", 1: "POSITIVE"}, 
    label2id={"NEGATIVE": 0, "POSITIVE": 1}
)

tokenizer = AutoTokenizer.from_pretrained("nicholasKluge/TeenyTinyLlama-162m")

# Preprocess the dataset
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, max_length=256)

dataset_tokenized = dataset.map(preprocess_function, batched=True)

# Create a simple data collactor
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Use accuracy as an evaluation metric
accuracy = evaluate.load("accuracy")

# Function to compute accuracy
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

# Define training arguments
training_args = TrainingArguments(
    output_dir="checkpoints",
    learning_rate=4e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=True,
    hub_token="your_token_here",
    hub_model_id="username/model-name-imdb"
)

# Define the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset_tokenized["train"],
    eval_dataset=dataset_tokenized["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# Train!
trainer.train()
```

## Results

| Models                                                                                     | [IMDB](https://huggingface.co/datasets/christykoh/imdb_pt) | 
|--------------------------------------------------------------------------------------------|------------------------------------------------------------|
| [Teeny Tiny Llama 162m](https://huggingface.co/nicholasKluge/TeenyTinyLlama-162m)          | 91.14                                                      |
| [Bert-base-portuguese-cased](https://huggingface.co/neuralmind/bert-base-portuguese-cased) | 92.22                                                      |
| [Gpt2-small-portuguese](https://huggingface.co/pierreguillou/gpt2-small-portuguese)        | 91.60                                                      |