stanfordnlp/imdb
Viewer • Updated • 100k • 189k • 371
How to use jongador/bert-imdb-512 with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-classification", model="jongador/bert-imdb-512") # Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("jongador/bert-imdb-512")
model = AutoModelForSequenceClassification.from_pretrained("jongador/bert-imdb-512")bert-base-uncased fine-tuned on the IMDB sentiment classification dataset with max_seq_length=512.
Trained as a victim model for adversarial NLP research (TextBugger / TextFooler / DeepWordBug-style attacks). The longer input window (vs. the typical 128-token TextAttack baselines) prevents truncation of ~95–98% of IMDB reviews and yields a stronger classifier.
bert-base-uncased (12 layers, 768 hidden, 12 heads, ~110M parameters)Trained from bert-base-uncased on the IMDB train split (25,000 examples) using TextAttack 0.3.x.
| Hyperparameter | Value |
|---|---|
| Epochs | 5 |
| Per-device batch size | 8 |
| Gradient accumulation | 2 (effective batch 16) |
| Learning rate | 2e-5 |
| Weight decay | 0.01 |
| Warmup steps | 500 |
| Random seed | 786 |
| Hardware | NVIDIA RTX 3090 (24 GB) |
Training command:
textattack train --model-name-or-path bert-base-uncased \
--dataset imdb \
--model-max-length 512 \
--epochs 5 \
--per-device-train-batch-size 8 \
--gradient-accumulation-steps 2 \
--learning-rate 2e-5 \
--save-last \
--output-dir ./models/bert-imdb-512
Evaluated on the IMDB test split (25,000 examples) at the best epoch checkpoint:
| Metric | Value |
|---|---|
| Accuracy | 94.14% |
For reference, the equivalent TextAttack baseline at 128 tokens (textattack/bert-base-uncased-imdb) reports ~89% on the same test set.
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("jongador/bert-imdb-512")
model = AutoModelForSequenceClassification.from_pretrained("jongador/bert-imdb-512")
inputs = tokenizer("I loved this movie!", return_tensors="pt", truncation=True, max_length=512)
outputs = model(**inputs)
prediction = outputs.logits.argmax(-1).item() # 0 = negative, 1 = positive
MIT