--- license: apache-2.0 base_model: distilbert-base-uncased tags: - generated_from_trainer model-index: - name: results results: [] --- # Dataset Utilizado O modelo foi treinado utilizando o dataset IMDB, amplamente utilizado para tarefas de classificação de texto, especialmente para análise de sentimentos. O dataset contém 50.000 revisões de filmes rotuladas, divididas igualmente entre revisões positivas e negativas, com 25.000 exemplos para treinamento e 25.000 para teste. Para carregar o dataset, é preciso utilizar a biblioteca datasets da Hugging Face: from datasets import load_dataset dataset = load_dataset("imdb") # Como Treinar o Modelo 1. Carregar o dataset: from datasets import load_dataset dataset = load_dataset("imdb") 2. Pré-processamento: from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") tokenized_datasets = dataset.map(lambda x: tokenizer(x['text'], padding='max_length', truncation=True), batched=True) 3. Definir o Modelo e Argumentos de Treinamento: from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer import numpy as np model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2) training_args = TrainingArguments( output_dir="./results", learning_rate=2e-5, per_device_train_batch_size=32, per_device_eval_batch_size=32, num_train_epochs=1, weight_decay=0.01, evaluation_strategy="epoch", push_to_hub=True ) def compute_metrics(eval_pred): logits, labels = eval_pred predictions = np.argmax(logits, axis=-1) return {"accuracy": (predictions == labels).mean()} 4. Treinamento: small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000)) small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(100)) trainer = Trainer( model=model, args=training_args, train_dataset=small_train_dataset, eval_dataset=small_eval_dataset, compute_metrics=compute_metrics ) trainer.train() # Como Utilizar o Modelo Usando uma Pipeline: from transformers import pipeline pipe = pipeline("text-classification", model="pedro123483/results") result = pipe("I loved this movie! It was fantastic and thrilling.") print(result) Carregando o Modelo Diretamente: from transformers import AutoTokenizer, AutoModelForSequenceClassification import numpy as np tokenizer = AutoTokenizer.from_pretrained("pedro123483/results") model = AutoModelForSequenceClassification.from_pretrained("pedro123483/results") inputs = tokenizer("I loved this movie! It was fantastic and thrilling.", return_tensors="pt") outputs = model(**inputs) predictions = np.argmax(outputs.logits.detach().numpy(), axis=-1) print(predictions) # results This model is a fine-tuned version of [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) on an unknown dataset. ## Model description More information needed ## Intended uses & limitations More information needed ## Training and evaluation data More information needed ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 2e-05 - train_batch_size: 32 - eval_batch_size: 32 - seed: 42 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - num_epochs: 1 ### Training results | Training Loss | Epoch | Step | Validation Loss | Accuracy | |:-------------:|:-----:|:----:|:---------------:|:--------:| | No log | 1.0 | 32 | 0.6623 | 0.7 | ### Framework versions - Transformers 4.41.1 - Pytorch 2.3.0+cu121 - Datasets 2.19.1 - Tokenizers 0.19.1