metadata

tags:
  - learned sparse
  - transformers
  - retrieval
  - passage-retrieval
  - document-expansion
  - bag-of-words
license: apache-2.0
language: en
base_model:
  - atomic-canyon/fermi-bert-1024

fermi-1024: Sparse Retrieval Model for Nuclear Power

This sparse retrieval model is optimized for nuclear-specific applications. It encodes both queries and documents into high-dimensional sparse vectors, where the non-zero dimensions correspond to specific tokens in the vocabulary, and their values indicate the relative importance of those tokens.

The vocabulary, and thus the sparse embeddings, are based on a nuclear-specific tokenizer. For example, terms like "NRC" are represented as single tokens rather than being split into multiple tokens. This approach improves both accuracy and efficiency. To achieve this, we trained a nuclear-specific BERT base model.

Specifications

Developed by: Atomic Canyon
Finetuned from model: fermi-bert-1024
Context Length: 1024
Vocab Size: 30522
License: Apache 2.0

Training

fermi-1024 was trained on MS MARCO Passage Dataset using the LSR framework using the teacher model ms-marco-MiniLM-L-6-v2. Trained on the Oak Ridge National Laboratory Frontier supercomputer using MI250X AMD GPUs.

Evaluation

The sparse embedding model was primarily evaluated for its effectiveness in information retrieval within the nuclear energy domain. Due to the absence of domain-specific benchmarks, we developed FermiBench to assess the model’s performance on nuclear-related texts. In addition, the model was tested on the MS MARCO dev split and the BEIR benchmark to ensure broader applicability. The model demonstrates strong retrieval capabilities, particularly in handling nuclear-specific jargon and documents.

Although there are standard benchmarks and tooling for evaluating dense embedding models, we found no open, standardized tooling for evaluating sparse embedding models. To support the community, we are releasing our benchmark tooling, built on top of BEIR and pyserini. All evaluation numbers were produced with that tool and should therefore be reproducible.

Model	FermiBench NDCG@10	FermiBench FLOPS	MSMarco Dev NDCG@10	BEIR* NDCG@10	BEIR* FLOPS
fermi-512	0.74	7.07	0.45	0.46	9.14
fermi-1024	0.72	4.75	0.44	0.46	7.5
splade-cocondenser-ensembledistil	0.64	12.9	0.45	0.46	12.4

* BEIR benchmark was a subset containng trec-covid, nfcorpus, arguana, scidocs, scifact.

Efficiency

Given the massive scale of documentation in nuclear energy, efficiency is crucial. Our model addresses this in several ways:

Our 1024-length embedding model reduces the number of required embeddings by half, significantly lowering computational costs.
The custom tokenizer, designed for nuclear-specific jargon, encodes documents and queries using fewer tokens, improving computational efficiency.
Additionally, our models produce sparser vectors, reducing FLOPs and, as a secondary benefit, lowering storage requirements for indexing.

Usage

import itertools
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer


# get sparse vector from dense vectors with shape batch_size * seq_len * vocab_size
def get_sparse_vector(feature, output):
    values, _ = torch.max(output*feature["attention_mask"].unsqueeze(-1), dim=1)
    values = torch.log(1 + torch.relu(values))
    values[:,special_token_ids] = 0
    return values
    
# transform the sparse vector to a dict of (token, weight)
def transform_sparse_vector_to_dict(sparse_vector):
    sample_indices,token_indices=torch.nonzero(sparse_vector,as_tuple=True)
    non_zero_values = sparse_vector[(sample_indices,token_indices)].tolist()
    number_of_tokens_for_each_sample = torch.bincount(sample_indices).cpu().tolist()
    tokens = [id_to_token[_id] for _id in token_indices.tolist()]

    output = []
    end_idxs = list(itertools.accumulate([0]+number_of_tokens_for_each_sample))
    for i in range(len(end_idxs)-1):
        token_strings = tokens[end_idxs[i]:end_idxs[i+1]]
        weights = non_zero_values[end_idxs[i]:end_idxs[i+1]]
        output.append(dict(zip(token_strings, weights)))
    return output
    

# load the model
model = AutoModelForMaskedLM.from_pretrained("atomic-canyon/fermi-1024")
tokenizer = AutoTokenizer.from_pretrained("atomic-canyon/fermi-1024")

# set the special tokens and id_to_token transform for post-process
special_token_ids = [tokenizer.vocab[token] for token in tokenizer.special_tokens_map.values()]
id_to_token = [""] * tokenizer.vocab_size
for token, _id in tokenizer.vocab.items():
    id_to_token[_id] = token

query = "What is the maximum heat load per spent fuel assembly for the EOS-37PTH?"
document = "For the EOS-37PTH DSC, add two new heat load zone configurations (HLZCs) for the EOS37PTH for higher heat load assemblies, up to 3.5 kW/assembly, that also allow for damaged and failed fuel storage."

# encode the query & document
feature = tokenizer([query, document], padding=True, truncation=True, return_tensors='pt', return_token_type_ids=False)
output = model(**feature)[0]
sparse_vector = get_sparse_vector(feature, output)

# get similarity score
sim_score = torch.matmul(sparse_vector[0],sparse_vector[1])
print(sim_score)


query_token_weight, document_query_token_weight = transform_sparse_vector_to_dict(sparse_vector)
for token in sorted(query_token_weight, key=lambda x:query_token_weight[x], reverse=True):
    if token in document_query_token_weight:
        print("score in query: %.4f, score in document: %.4f, token: %s"%(query_token_weight[token],document_query_token_weight[token],token))

Acknowledgement

This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725.