abhi227070
commited on
Commit
•
9e84c78
1
Parent(s):
fe442bb
Update README.md
Browse files
README.md
CHANGED
@@ -11,6 +11,8 @@ model-index:
|
|
11 |
results: []
|
12 |
library_name: adapter-transformers
|
13 |
pipeline_tag: text-classification
|
|
|
|
|
14 |
---
|
15 |
|
16 |
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
|
@@ -58,21 +60,17 @@ The evaluation was performed on a separate validation set derived from the IMDB
|
|
58 |
|
59 |
### Training Procedure
|
60 |
|
61 |
-
|
62 |
-
|
63 |
-
Next, the datasets were converted into the `DatasetDict` format, which is compatible with the Hugging Face Transformers library. This step facilitated the seamless integration of the data with the tokenization and model training pipeline.
|
64 |
-
|
65 |
-
The tokenization process was handled by the `AutoTokenizer` from the Hugging Face library, specifically using the DistilBERT tokenizer. A preprocessing function was defined to tokenize the text data, truncating where necessary to fit the model's input requirements. This function was applied to the entire dataset, transforming the text into tokenized sequences ready for model ingestion.
|
66 |
|
67 |
-
|
68 |
|
69 |
-
|
70 |
|
71 |
-
|
72 |
|
73 |
-
|
74 |
|
75 |
-
|
76 |
|
77 |
### Training hyperparameters
|
78 |
|
@@ -98,4 +96,4 @@ The following hyperparameters were used during training:
|
|
98 |
- Transformers 4.42.4
|
99 |
- Pytorch 2.3.0+cu121
|
100 |
- Datasets 2.20.0
|
101 |
-
- Tokenizers 0.19.1
|
|
|
11 |
results: []
|
12 |
library_name: adapter-transformers
|
13 |
pipeline_tag: text-classification
|
14 |
+
datasets:
|
15 |
+
- proj-persona/PersonaHub
|
16 |
---
|
17 |
|
18 |
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
|
|
|
60 |
|
61 |
### Training Procedure
|
62 |
|
63 |
+
### Training Procedure
|
|
|
|
|
|
|
|
|
64 |
|
65 |
+
The IMDB dataset was loaded and preprocessed to include numerical labels for sentiments (positive and negative). It was then split into training (60%), validation (20%), and test (20%) sets, and the indices were reset for proper formatting.
|
66 |
|
67 |
+
The data was converted into the `DatasetDict` format compatible with the Hugging Face Transformers library. The `AutoTokenizer` for DistilBERT was used to tokenize the text data, truncating where necessary. A preprocessing function applied tokenization to the entire dataset, preparing it for model training.
|
68 |
|
69 |
+
A `DataCollatorWithPadding` was used to handle the variability in sequence lengths during batching, ensuring efficiency and standardization. The `AutoModelForSequenceClassification` with DistilBERT as the base model was set up for binary classification, mapping labels to sentiment classes (positive and negative).
|
70 |
|
71 |
+
Training arguments included learning rate, batch size, number of epochs, evaluation strategy, and logging steps, optimized for effective training. Evaluation metrics, including accuracy and F1 score, were defined to assess model performance, with a function to compute these metrics by comparing predictions with true labels.
|
72 |
|
73 |
+
The `Trainer` class from the Transformers library was used to conduct the training over three epochs, integrating the model, training arguments, tokenized datasets, data collator, and evaluation metrics. This comprehensive approach ensured effective fine-tuning for sentiment analysis on the IMDB dataset, achieving high accuracy and F1 scores.
|
74 |
|
75 |
### Training hyperparameters
|
76 |
|
|
|
96 |
- Transformers 4.42.4
|
97 |
- Pytorch 2.3.0+cu121
|
98 |
- Datasets 2.20.0
|
99 |
+
- Tokenizers 0.19.1
|