Dataset Utilizado
O modelo foi treinado utilizando o dataset IMDB, amplamente utilizado para tarefas de classificação de texto, especialmente para análise de sentimentos. O dataset contém 50.000 revisões de filmes rotuladas, divididas igualmente entre revisões positivas e negativas, com 25.000 exemplos para treinamento e 25.000 para teste.
Para carregar o dataset, é preciso utilizar a biblioteca datasets da Hugging Face:
from datasets import load_dataset dataset = load_dataset("imdb")
Como Treinar o Modelo
Carregar o dataset:
from datasets import load_dataset dataset = load_dataset("imdb")
Pré-processamento:
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") tokenized_datasets = dataset.map(lambda x: tokenizer(x['text'], padding='max_length', truncation=True), batched=True)
Definir o Modelo e Argumentos de Treinamento:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer import numpy as np
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
training_args = TrainingArguments( output_dir="./results", learning_rate=2e-5, per_device_train_batch_size=32, per_device_eval_batch_size=32, num_train_epochs=1, weight_decay=0.01, evaluation_strategy="epoch", push_to_hub=True )
def compute_metrics(eval_pred): logits, labels = eval_pred predictions = np.argmax(logits, axis=-1) return {"accuracy": (predictions == labels).mean()}
Treinamento:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000)) small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(100))
trainer = Trainer( model=model, args=training_args, train_dataset=small_train_dataset, eval_dataset=small_eval_dataset, compute_metrics=compute_metrics )
trainer.train()
Como Utilizar o Modelo
Usando uma Pipeline: from transformers import pipeline
pipe = pipeline("text-classification", model="pedro123483/results")
result = pipe("I loved this movie! It was fantastic and thrilling.") print(result)
Carregando o Modelo Diretamente: from transformers import AutoTokenizer, AutoModelForSequenceClassification import numpy as np
tokenizer = AutoTokenizer.from_pretrained("pedro123483/results") model = AutoModelForSequenceClassification.from_pretrained("pedro123483/results")
inputs = tokenizer("I loved this movie! It was fantastic and thrilling.", return_tensors="pt") outputs = model(**inputs) predictions = np.argmax(outputs.logits.detach().numpy(), axis=-1) print(predictions)
results
This model is a fine-tuned version of distilbert-base-uncased on an unknown dataset.
Model description
More information needed
Intended uses & limitations
More information needed
Training and evaluation data
More information needed
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 32
- eval_batch_size: 32
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 1
Training results
Training Loss | Epoch | Step | Validation Loss | Accuracy |
---|---|---|---|---|
No log | 1.0 | 32 | 0.6623 | 0.7 |
Framework versions
- Transformers 4.41.1
- Pytorch 2.3.0+cu121
- Datasets 2.19.1
- Tokenizers 0.19.1
- Downloads last month
- 1