File size: 4,070 Bytes
6a85c42
 
 
 
 
 
 
 
 
 
 
 
 
30edfc6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6a85c42
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
---
license: apache-2.0
base_model: distilbert-base-uncased
tags:
- generated_from_trainer
model-index:
- name: results
  results: []
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# Dataset Utilizado

O modelo foi treinado utilizando o dataset IMDB, amplamente utilizado para tarefas de classificação de texto, especialmente para análise de sentimentos. O dataset contém 50.000 revisões de filmes rotuladas, divididas igualmente entre revisões positivas e negativas, com 25.000 exemplos para treinamento e 25.000 para teste.

Para carregar o dataset, é preciso utilizar a biblioteca datasets da Hugging Face:

from datasets import load_dataset
dataset = load_dataset("imdb")

# Como Treinar o Modelo

1. Carregar o dataset:
   
   from datasets import load_dataset
   dataset = load_dataset("imdb")
   
2. Pré-processamento:
   
   from transformers import AutoTokenizer
   tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
   tokenized_datasets = dataset.map(lambda x: tokenizer(x['text'], padding='max_length', truncation=True), batched=True)

3. Definir o Modelo e Argumentos de Treinamento:

   from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
   import numpy as np
    
   model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

   training_args = TrainingArguments(
        output_dir="./results",
        learning_rate=2e-5,
        per_device_train_batch_size=32,
        per_device_eval_batch_size=32,
        num_train_epochs=1,
        weight_decay=0.01,
        evaluation_strategy="epoch",
        push_to_hub=True
    )
    
    def compute_metrics(eval_pred):
        logits, labels = eval_pred
        predictions = np.argmax(logits, axis=-1)
        return {"accuracy": (predictions == labels).mean()}

4. Treinamento:

   small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
   small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(100))
    
   trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=small_train_dataset,
        eval_dataset=small_eval_dataset,
        compute_metrics=compute_metrics
    )
    
   trainer.train()

# Como Utilizar o Modelo

Usando uma Pipeline:
  from transformers import pipeline

  pipe = pipeline("text-classification", model="pedro123483/results")
  
  result = pipe("I loved this movie! It was fantastic and thrilling.")
  print(result)

Carregando o Modelo Diretamente:
  from transformers import AutoTokenizer, AutoModelForSequenceClassification
  import numpy as np
  
  tokenizer = AutoTokenizer.from_pretrained("pedro123483/results")
  model = AutoModelForSequenceClassification.from_pretrained("pedro123483/results")
  
  inputs = tokenizer("I loved this movie! It was fantastic and thrilling.", return_tensors="pt")
  outputs = model(**inputs)
  predictions = np.argmax(outputs.logits.detach().numpy(), axis=-1)
  print(predictions)


# results

This model is a fine-tuned version of [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) on an unknown dataset.

## Model description

More information needed

## Intended uses & limitations

More information needed

## Training and evaluation data

More information needed

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 32
- eval_batch_size: 32
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 1

### Training results

| Training Loss | Epoch | Step | Validation Loss | Accuracy |
|:-------------:|:-----:|:----:|:---------------:|:--------:|
| No log        | 1.0   | 32   | 0.6623          | 0.7      |


### Framework versions

- Transformers 4.41.1
- Pytorch 2.3.0+cu121
- Datasets 2.19.1
- Tokenizers 0.19.1