Attention-GRU for Cross-Disciplinary Abstract Classification πΏ
This study has been accepted for publication in Scientific Reports and is currently in the publication process (in press)
A resource-efficient Attention-based Bidirectional GRU with frozen GloVe-300d embeddings, trained on the WOS-46985 benchmark to classify scientific abstracts into 134 fine-grained sub-disciplines (Web of Science Level-2).
The model achieves a Macro-F1 of 0.920, outperforming domain-specific Transformer baselines (BERT, BioBERT, SciBERT) while training in ~10 minutes instead of hours and consuming a fraction of the energy.
π§ Model Description
| Component | Configuration |
|---|---|
| Architecture | Bidirectional GRU + Soft Attention |
| Embeddings | Frozen GloVe-300d (Stanford, 6B tokens) |
| Vocabulary size | 14,541 |
| GRU hidden dim | 256 |
| GRU layers | 2 (bidirectional) |
| Classifier | Linear (dropout 0.5) |
| Output classes | 134 (WOS-46985 Level-2 sub-disciplines) |
| Trainable parameters | ~1.06 M |
| Max sequence length | 250 tokens |
The architecture leverages the semantic stability of scientific terminology, sidestepping the quadratic cost of full Transformer attention while preserving long-range dependency modeling through soft-attention over GRU hidden states.
π Datasets
| Dataset | Abstracts | Classes | Used for |
|---|---|---|---|
| arXiv | ~ | 3 (AI, Economics, Psychology) | Coarse-grained interdisciplinary baseline |
| WOS-11967 | 11,967 | 35 | Mid-grained sub-disciplines (L2) |
| WOS-46985 (this checkpoint) | 46,985 | 134 | Fine-grained sub-disciplines (L2) |
π Performance
State-of-the-Art Comparison on Web of Science
| Model | WOS-11967 (35 classes) F1 | WOS-46985 (134 classes) F1 | Training Time |
|---|---|---|---|
| BERT-Base | 0.903 | 0.850 | ~ Hours |
| BioBERT | 0.903 | 0.856 | ~ Hours |
| SciBERT | 0.921 | 0.867 | ~ Hours |
| Attention-GRU (this model) | 0.953 | 0.920 | ~10 min |
Efficiency Metrics (arXiv benchmark)
| Model | Val. Accuracy | Parameters (M) | Inference (ms) | Energy (kWh) |
|---|---|---|---|---|
| Attention-GRU | 96.8% | 1.06 | 0.36 | 0.15 |
| BERT (Base) | 94.4% | 109.5 | 7.22 | 0.50 |
| RoBERTa | 93.4% | 125.0 | 7.80 | 0.52 |
The Attention-GRU is ~14Γ faster to train and uses ~3Γ less energy than Transformer baselines while achieving higher accuracy on fine-grained taxonomies.
π¦ Files
| File | Description |
|---|---|
attention_gru_wos.pth |
PyTorch checkpoint with encoder_state_dict, classifier_state_dict, and hyperparameters |
word2idx.json |
Vocabulary mapping (14,541 tokens) |
labels.json |
Class-id β discipline-name mapping (134 entries; replace placeholders with real names if needed) |
π οΈ How to Use
import json, re, torch, torch.nn as nn
from huggingface_hub import hf_hub_download
# --- Model definitions (must match training) ---
class Attention(nn.Module):
def __init__(self, hidden_dim):
super().__init__()
self.attention = nn.Linear(hidden_dim, 1, bias=False)
def forward(self, rnn_outputs):
w = torch.softmax(self.attention(rnn_outputs).squeeze(-1), dim=1)
return torch.bmm(w.unsqueeze(1), rnn_outputs).squeeze(1)
class GRUAttentionEncoder(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_dim, num_layers, bidirectional):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
self.gru = nn.GRU(embed_dim, hidden_dim, num_layers=num_layers,
batch_first=True, bidirectional=bidirectional)
self.attention = Attention(hidden_dim * (2 if bidirectional else 1))
def forward(self, x):
out, _ = self.gru(self.embedding(x))
return self.attention(out)
class Classifier(nn.Module):
def __init__(self, input_dim, num_classes, dropout=0.5):
super().__init__()
self.dropout = nn.Dropout(dropout)
self.fc = nn.Linear(input_dim, num_classes)
def forward(self, x):
return self.fc(self.dropout(x))
# --- Load files from this repo ---
REPO = "MAE07/attention-gru-model"
ckpt = torch.load(hf_hub_download(REPO, "attention_gru_wos.pth"), map_location="cpu")
vocab = json.load(open(hf_hub_download(REPO, "word2idx.json")))
labels = {int(k): v for k, v in json.load(open(hf_hub_download(REPO, "labels.json"))).items()}
hp = ckpt["hyperparameters"]
encoder = GRUAttentionEncoder(hp["vocab_size"], hp["embed_dim"], hp["hidden_dim"],
hp["num_layers"], hp["bidirectional"])
clf = Classifier(hp["hidden_dim"] * (2 if hp["bidirectional"] else 1),
hp["num_classes"], hp["fc_dropout"])
encoder.load_state_dict(ckpt["encoder_state_dict"])
clf.load_state_dict(ckpt["classifier_state_dict"])
encoder.eval(); clf.eval()
# --- Inference ---
def predict(text, max_len=250, top_k=5):
ids = [vocab.get(t, vocab["<UNK>"]) for t in re.findall(r"\b\w+\b", text.lower())]
ids = (ids + [0] * max_len)[:max_len]
x = torch.tensor([ids], dtype=torch.long)
with torch.no_grad():
probs = torch.softmax(clf(encoder(x)), dim=1)[0]
conf, idx = torch.topk(probs, k=top_k)
return [(int(i), labels.get(int(i), f"Class {int(i)}"), float(c))
for c, i in zip(conf, idx)]
abstract = "The exponential growth of scholarly literature necessitates automated systems."
for cid, name, conf in predict(abstract):
print(f"{cid:3d} {name:<25s} {conf:.4f}")
A ready-to-use Gradio demo is available at: π https://huggingface.co/spaces/MAE07/abstract-submission
π§ͺ Training Details
- Optimizer: Adam, lr 8e-4, weight decay 1e-4
- Scheduler: ReduceLROnPlateau (factor 0.5, patience 2) on validation accuracy
- Loss: Class-weighted cross-entropy
- Batch size: 64
- Epochs: 20 (with best-model checkpointing on val accuracy)
- Augmentation: WordNet-synonym replacement (2 substitutions per sample), 2Γ data expansion
- Split: 70 / 15 / 15 stratified train/val/test
- Hardware: Single GPU (training completed in ~10 minutes)
β οΈ Limitations
- Vocabulary is frozen at 14,541 GloVe-covered tokens β out-of-vocabulary scientific terms map to
<UNK>and may degrade performance on highly specialized abstracts (e.g., niche biochemistry, novel CS subfields). - Trained on English abstracts only β non-English text is not supported.
- The 134 class IDs follow the WOS-46985 Level-2 ordering used during training; if you need human-readable discipline names you must replace the placeholder values in
labels.jsonwith the official WOS sub-discipline names. - Domain bias: Web of Science indexes lean toward STEM and Anglophone publication venues. Abstracts from underrepresented humanities or non-Anglophone disciplines may be misclassified.
- Soft-attention over GRU outputs is not a direct substitute for self-attention in tasks requiring deep token-token interaction (e.g., NLI, QA).
π± Environmental Impact
This model was specifically designed under the Green AI paradigm. Compared to Transformer baselines on the same task, it consumes ~3Γ less energy during training and ~20Γ less during inference, while achieving higher accuracy on fine-grained taxonomies. Training a single full run requires only minutes on commodity hardware and produces a checkpoint of just ~26 MB.