SciBert_sentence_transformer

pritamdeka/S-Scibert-snli-multinli-stsb

This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.

license: apache-2.0 language: - en library_name: sentence-transformers tags: - bert - scibert - sentence-transformers - sentence-similarity - scientific-text - mirror - r-compatible base_model: pritamdeka/S-Scibert-snli-multinli-stsb pipeline_tag: sentence-similarity

S-SciBERT (snli-multinli-stsb) — safetensors mirror for use from R

This is a format-converted mirror of pritamdeka/S-Scibert-snli-multinli-stsb, maintained for teaching a course on transformer-based topic modeling in R.

The model itself is unchanged: same architecture, same weights, same tokenizer, same outputs as the upstream original. What's different is the on-disk format and provenance, both of which matter for a teaching context.

Why this mirror exists

The upstream repo ships pytorch_model.bin (PyTorch pickle format) and no tokenizer.json. For Python users this works fine, but for R users working through the torch (libtorch) and safetensors R packages there is a more serious problem than just format inconvenience:

The upstream pickle file was saved while the model was on a CUDA device. PyTorch's pickle format records the device of every tensor, so loading on a CPU-only machine fails with an aten::empty_strided ... CUDA backend error. Python's torch.load(map_location='cpu') rescues you from this, but R-torch's loader doesn't expose that argument, so the upstream file is effectively unusable from R unless you have a CUDA GPU available.

This mirror adds:

model.safetensors — the same weights in safetensors format. Safetensors files do not record device information at all, so they load cleanly regardless of where the model was originally saved or what hardware the user has.

The fix is structural, not just cosmetic: safetensors solves a class of cross-device portability problems that pickle cannot, on top of being safer and faster to read.

What it is, briefly

S-SciBERT is SciBERT-cased fine-tuned for sentence similarity using the sentence-transformers framework. The fine-tuning data was the standard general-English similarity benchmark suite: SNLI (natural language inference), MultiNLI (multi-genre NLI), and STS-B (semantic textual similarity).

The result is a model that combines SciBERT's scientific vocabulary (gene names, chemical terms, ML jargon) with sentence-transformer-quality embeddings — meaning the mean-pooled output vectors actually cluster well, which base SciBERT's do not.

Property	Value
Architecture	BERT-base + mean-pooling head
Parameters	~110M
Embedding dimension	768
Layers	12
Attention heads	12
Vocabulary	SciBERT scientific (cased, ~31K tokens)
Pooling	Mean over tokens (masked by attention)
Fine-tuning data	SNLI + MultiNLI + STS-B
Training max_seq_length	75 tokens
Case sensitivity	Cased

When to use this model

Good fit:

Topic modeling, clustering, or semantic search over scientific text (papers, abstracts, scientific tweets, GitHub issues from research codebases).
Domains where SciBERT's vocabulary is an advantage: biomedical, computer science, computational biology, machine learning, chemistry.
Sentence-level or paragraph-level inputs.

Less good fit:

General web text — a general-purpose sentence-transformer like all-MiniLM-L6-v2 or all-mpnet-base-v2 will likely match or beat S-SciBERT on non-scientific content.
Document-level inputs (full papers): the model was fine-tuned on sequences of 75 tokens. It still runs on longer inputs (up to the BERT-base ceiling of 512), but quality degrades for content past the trained length. For long documents, split into sentences or paragraphs and embed those individually.
Languages other than English: the fine-tuning data is English-only.

Usage from R

This mirror works with a pure-R BERT inference pipeline built on top of the torch (libtorch) R package, with no Python at runtime:

source("bert_r.R")
enc <- load_hf_bert("NetworkIsLife/S-SciBert_DAFS")

emb <- embed_texts(enc$model, enc$tokenizer,
                   c("CRISPR-Cas9 enables targeted gene editing.",
                     "Glioblastoma exhibits invasive growth patterns.",
                     "Gradient descent minimizes a loss function."),
                   max_length = 128)
dim(emb)   # 3 x 768

# Cosine similarity (embeddings are L2-normalized by default)
sims <- emb %*% t(emb)
round(sims, 3)
# Rows 1 and 2 should be more similar to each other (both biomedical)
# than either is to row 3 (machine learning)

For topic modeling with the full pipeline:

source("bertopic_r.R")
fit <- fit_bertopic(enc, docs = my_abstracts,
                    umap_n_neighbors = 15,
                    hdbscan_min_pts  = 10)
print_topics(fit)

For long-term reproducibility in course materials, pin to a specific revision:

enc <- load_hf_bert(
  "NetworkIsLife/S-SciBert_DAFS",
  weights_path = hfhub::hub_download(
    "NetworkIsLife/S-SciBert_DAFS",
    "model.safetensors",
    revision = "MAIN_COMMIT_HASH_HERE"
  )
)

Replace MAIN_COMMIT_HASH_HERE with the commit hash visible in this repo's commit history.

Usage from Python

Either of the standard idioms works:

# Via sentence-transformers (easiest)
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("NetworkIsLife/S-SciBert_DAFS")
embeddings = model.encode([
    "CRISPR-Cas9 enables targeted gene editing.",
    "Glioblastoma exhibits invasive growth patterns."
])

# Via transformers (with manual mean pooling)
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

tok = AutoTokenizer.from_pretrained("NetworkIsLife/S-SciBert_DAFS")
mod = AutoModel.from_pretrained("NetworkIsLife/S-SciBert_DAFS").eval()

enc = tok(sentences, padding=True, truncation=True, return_tensors="pt", max_length=128)
with torch.no_grad():
    out = mod(**enc).last_hidden_state
    m = enc["attention_mask"].unsqueeze(-1).float()
    pooled = (out * m).sum(1) / m.sum(1).clamp(min=1e-9)
    embeddings = F.normalize(pooled, p=2, dim=1)

Files in this repo

File	Source	Purpose
`model.safetensors`	converted from upstream `pytorch_model.bin`	model weights, modern format (device-agnostic)
`pytorch_model.bin`	copied from upstream	model weights, legacy format (kept for compatibility)
`config.json`	copied from upstream	BERT architecture parameters
`vocab.txt`	copied from upstream	SciBERT WordPiece vocabulary
`tokenizer_config.json`	copied from upstream (if present)	tokenizer settings (do_lower_case, special tokens)
`README.md`	this file	provenance and usage

Provenance and verification

The model.safetensors file in this repo was produced by HuggingFace's official SFconvertbot (the same automated conversion used across thousands of HuggingFace repos). The conversion is purely a re-serialization — every tensor in the safetensors file is bit-identical to the corresponding tensor in pytorch_model.bin. No re-training, no quantization, no precision loss.

You can verify this yourself in Python:

import torch
from safetensors.torch import load_file

# map_location='cpu' is needed because the upstream pickle was saved on GPU
a = torch.load("pytorch_model.bin", map_location="cpu", weights_only=True)
b = load_file("model.safetensors")
assert set(a.keys()) == set(b.keys())
for k in a:
    assert torch.equal(a[k].cpu(), b[k]), f"Mismatch in {k}"
print("Bit-identical.")

End-to-end verification (cosine similarities computed via the upstream model and via this mirror should agree to ~1e-6):

from sentence_transformers import SentenceTransformer
import numpy as np

sentences = [
    "CRISPR-Cas9 enables targeted gene editing.",
    "Glioblastoma exhibits invasive growth.",
    "Gradient descent minimizes a loss function."
]
a = SentenceTransformer("pritamdeka/S-Scibert-snli-multinli-stsb").encode(sentences)
b = SentenceTransformer("NetworkIsLife/S-SciBert_DAFS").encode(sentences)
print("max |Δ| =", np.abs(a - b).max())   # should be ~1e-6 or smaller

License and citation

This mirror inherits the upstream license: Apache 2.0. If you use this model in academic work, please cite the original paper:

@inproceedings{deka2021unsupervised,
  title={Unsupervised Keyword Combination Query Generation from
         Online Health Related Content for Evidence-Based Fact Checking},
  author={Deka, Pritam and Jurek-Loughrey, Anna},
  booktitle={The 23rd International Conference on Information Integration
             and Web-based Applications & Services},
  pages={267--277},
  year={2021}
}

And consider also citing the underlying SciBERT paper that this model fine-tunes from:

@inproceedings{beltagy-etal-2019-scibert,
  title = "{SciBERT}: A Pretrained Language Model for Scientific Text",
  author = "Beltagy, Iz and Lo, Kyle and Cohan, Arman",
  booktitle = "Proceedings of EMNLP-IJCNLP",
  year = "2019",
  url = "https://www.aclweb.org/anthology/D19-1371"
}

Original model: pritamdeka/S-Scibert-snli-multinli-stsb by Pritam Deka. Base model: allenai/scibert_scivocab_cased by the Allen Institute for AI.

Maintenance

This is a teaching artifact for a course on transformer-based topic modeling in R. It will not be updated except to fix conversion errors. For the canonical, maintained version of S-SciBERT, see the upstream repo.

Downloads last month: 31

Safetensors

Model size

0.1B params

Tensor type

I64

F32