DeCLUTR-sci-base

Model description

This is the allenai/scibert_scivocab_uncased model, with extended pretraining on over 2 million scientific papers from S2ORC using the self-supervised training strategy presented in DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations.

Intended uses & limitations

The model is intended to be used as a sentence encoder, similar to Google's Universal Sentence Encoder or Sentence Transformers. It is particularly suitable for scientific text.

How to use

Please see our repo for full details. A simple example is shown below.

With SentenceTransformers
from scipy.spatial.distance import cosine
from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("johngiorgi/declutr-sci-base")

# Prepare some text to embed
text = [
    "Oncogenic KRAS mutations are common in cancer.",
    "Notably, c-Raf has recently been found essential for development of K-Ras-driven NSCLCs.",
]

# Embed the text
embeddings = model.encode(texts)

# Compute a semantic similarity via the cosine distance
semantic_sim = 1 - cosine(embeddings[0], embeddings[1])
With ๐Ÿค— Transformers
import torch
from scipy.spatial.distance import cosine
from transformers import AutoModel, AutoTokenizer

# Load the model
tokenizer = AutoTokenizer.from_pretrained("johngiorgi/declutr-sci-base")
model = AutoModel.from_pretrained("johngiorgi/declutr-sci-base")

# Prepare some text to embed
text = [
    "Oncogenic KRAS mutations are common in cancer.",
    "Notably, c-Raf has recently been found essential for development of K-Ras-driven NSCLCs.",
]
inputs = tokenizer(text, padding=True, truncation=True, return_tensors="pt")

# Embed the text
with torch.no_grad():
    sequence_output = model(**inputs)[0]

# Mean pool the token-level embeddings to get sentence-level embeddings
embeddings = torch.sum(
    sequence_output * inputs["attention_mask"].unsqueeze(-1), dim=1
) / torch.clamp(torch.sum(inputs["attention_mask"], dim=1, keepdims=True), min=1e-9)

# Compute a semantic similarity via the cosine distance
semantic_sim = 1 - cosine(embeddings[0], embeddings[1])

BibTeX entry and citation info

@inproceedings{giorgi-etal-2021-declutr,
    title        = {{D}e{CLUTR}: Deep Contrastive Learning for Unsupervised Textual Representations},
    author       = {Giorgi, John  and Nitski, Osvald  and Wang, Bo  and Bader, Gary},
    year         = 2021,
    month        = aug,
    booktitle    = {Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)},
    publisher    = {Association for Computational Linguistics},
    address      = {Online},
    pages        = {879--895},
    doi          = {10.18653/v1/2021.acl-long.72},
    url          = {https://aclanthology.org/2021.acl-long.72}
}
Downloads last month
36
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.