Attention-GRU for Cross-Disciplinary Abstract Classification 🌿

This study has been accepted for publication in Scientific Reports and is currently in the publication process (in press)

A resource-efficient Attention-based Bidirectional GRU with frozen GloVe-300d embeddings, trained on the WOS-46985 benchmark to classify scientific abstracts into 134 fine-grained sub-disciplines (Web of Science Level-2).

The model achieves a Macro-F1 of 0.920, outperforming domain-specific Transformer baselines (BERT, BioBERT, SciBERT) while training in ~10 minutes instead of hours and consuming a fraction of the energy.

🧠 Model Description

Component	Configuration
Architecture	Bidirectional GRU + Soft Attention
Embeddings	Frozen GloVe-300d (Stanford, 6B tokens)
Vocabulary size	14,541
GRU hidden dim	256
GRU layers	2 (bidirectional)
Classifier	Linear (dropout 0.5)
Output classes	134 (WOS-46985 Level-2 sub-disciplines)
Trainable parameters	~1.06 M
Max sequence length	250 tokens

The architecture leverages the semantic stability of scientific terminology, sidestepping the quadratic cost of full Transformer attention while preserving long-range dependency modeling through soft-attention over GRU hidden states.

📊 Datasets

Dataset	Abstracts	Classes	Used for
arXiv	~	3 (AI, Economics, Psychology)	Coarse-grained interdisciplinary baseline
WOS-11967	11,967	35	Mid-grained sub-disciplines (L2)
WOS-46985 (this checkpoint)	46,985	134	Fine-grained sub-disciplines (L2)

🚀 Performance

State-of-the-Art Comparison on Web of Science

Model	WOS-11967 (35 classes) F1	WOS-46985 (134 classes) F1	Training Time
BERT-Base	0.903	0.850	~ Hours
BioBERT	0.903	0.856	~ Hours
SciBERT	0.921	0.867	~ Hours
Attention-GRU (this model)	0.953	0.920	~10 min

Efficiency Metrics (arXiv benchmark)

Model	Val. Accuracy	Parameters (M)	Inference (ms)	Energy (kWh)
Attention-GRU	96.8%	1.06	0.36	0.15
BERT (Base)	94.4%	109.5	7.22	0.50
RoBERTa	93.4%	125.0	7.80	0.52

The Attention-GRU is ~14× faster to train and uses ~3× less energy than Transformer baselines while achieving higher accuracy on fine-grained taxonomies.

📦 Files

File	Description
`attention_gru_wos.pth`	PyTorch checkpoint with `encoder_state_dict`, `classifier_state_dict`, and `hyperparameters`
`word2idx.json`	Vocabulary mapping (14,541 tokens)
`labels.json`	Class-id → discipline-name mapping (134 entries; replace placeholders with real names if needed)

🛠️ How to Use

import json, re, torch, torch.nn as nn
from huggingface_hub import hf_hub_download

# --- Model definitions (must match training) ---
class Attention(nn.Module):
    def __init__(self, hidden_dim):
        super().__init__()
        self.attention = nn.Linear(hidden_dim, 1, bias=False)
    def forward(self, rnn_outputs):
        w = torch.softmax(self.attention(rnn_outputs).squeeze(-1), dim=1)
        return torch.bmm(w.unsqueeze(1), rnn_outputs).squeeze(1)

class GRUAttentionEncoder(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_layers, bidirectional):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.gru = nn.GRU(embed_dim, hidden_dim, num_layers=num_layers,
                          batch_first=True, bidirectional=bidirectional)
        self.attention = Attention(hidden_dim * (2 if bidirectional else 1))
    def forward(self, x):
        out, _ = self.gru(self.embedding(x))
        return self.attention(out)

class Classifier(nn.Module):
    def __init__(self, input_dim, num_classes, dropout=0.5):
        super().__init__()
        self.dropout = nn.Dropout(dropout)
        self.fc = nn.Linear(input_dim, num_classes)
    def forward(self, x):
        return self.fc(self.dropout(x))

# --- Load files from this repo ---
REPO = "MAE07/attention-gru-model"
ckpt   = torch.load(hf_hub_download(REPO, "attention_gru_wos.pth"), map_location="cpu")
vocab  = json.load(open(hf_hub_download(REPO, "word2idx.json")))
labels = {int(k): v for k, v in json.load(open(hf_hub_download(REPO, "labels.json"))).items()}

hp = ckpt["hyperparameters"]
encoder = GRUAttentionEncoder(hp["vocab_size"], hp["embed_dim"], hp["hidden_dim"],
                              hp["num_layers"], hp["bidirectional"])
clf = Classifier(hp["hidden_dim"] * (2 if hp["bidirectional"] else 1),
                 hp["num_classes"], hp["fc_dropout"])
encoder.load_state_dict(ckpt["encoder_state_dict"])
clf.load_state_dict(ckpt["classifier_state_dict"])
encoder.eval(); clf.eval()

# --- Inference ---
def predict(text, max_len=250, top_k=5):
    ids = [vocab.get(t, vocab["<UNK>"]) for t in re.findall(r"\b\w+\b", text.lower())]
    ids = (ids + [0] * max_len)[:max_len]
    x = torch.tensor([ids], dtype=torch.long)
    with torch.no_grad():
        probs = torch.softmax(clf(encoder(x)), dim=1)[0]
    conf, idx = torch.topk(probs, k=top_k)
    return [(int(i), labels.get(int(i), f"Class {int(i)}"), float(c))
            for c, i in zip(conf, idx)]

abstract = "The exponential growth of scholarly literature necessitates automated systems."
for cid, name, conf in predict(abstract):
    print(f"{cid:3d}  {name:<25s}  {conf:.4f}")

A ready-to-use Gradio demo is available at: 👉 https://huggingface.co/spaces/MAE07/abstract-submission

🧪 Training Details

Optimizer: Adam, lr 8e-4, weight decay 1e-4
Scheduler: ReduceLROnPlateau (factor 0.5, patience 2) on validation accuracy
Loss: Class-weighted cross-entropy
Batch size: 64
Epochs: 20 (with best-model checkpointing on val accuracy)
Augmentation: WordNet-synonym replacement (2 substitutions per sample), 2× data expansion
Split: 70 / 15 / 15 stratified train/val/test
Hardware: Single GPU (training completed in ~10 minutes)

⚠️ Limitations

Vocabulary is frozen at 14,541 GloVe-covered tokens — out-of-vocabulary scientific terms map to <UNK> and may degrade performance on highly specialized abstracts (e.g., niche biochemistry, novel CS subfields).
Trained on English abstracts only — non-English text is not supported.
The 134 class IDs follow the WOS-46985 Level-2 ordering used during training; if you need human-readable discipline names you must replace the placeholder values in labels.json with the official WOS sub-discipline names.
Domain bias: Web of Science indexes lean toward STEM and Anglophone publication venues. Abstracts from underrepresented humanities or non-Anglophone disciplines may be misclassified.
Soft-attention over GRU outputs is not a direct substitute for self-attention in tasks requiring deep token-token interaction (e.g., NLI, QA).

🌱 Environmental Impact

This model was specifically designed under the Green AI paradigm. Compared to Transformer baselines on the same task, it consumes ~3× less energy during training and ~20× less during inference, while achieving higher accuracy on fine-grained taxonomies. Training a single full run requires only minutes on commodity hardware and produces a checkpoint of just ~26 MB.

Downloads last month: -; Downloads are not tracked for this model. How to track

MAE07
/

attention-gru-model