tags:
- learned sparse
- transformers
- retrieval
- passage-retrieval
- document-expansion
- bag-of-words
license: apache-2.0
language: en
base_model:
- atomic-canyon/fermi-bert-1024
fermi-1024: Sparse Retrieval Model for Nuclear Power
This sparse retrieval model is optimized for nuclear-specific applications. It encodes both queries and documents into high-dimensional sparse vectors, where the non-zero dimensions correspond to specific tokens in the vocabulary, and their values indicate the relative importance of those tokens.
The vocabulary, and thus the sparse embeddings, are based on a nuclear-specific tokenizer. For example, terms like "NRC" are represented as single tokens rather than being split into multiple tokens. This approach improves both accuracy and efficiency. To achieve this, we trained a nuclear-specific BERT base model.
Specifications
- Developed by: Atomic Canyon
- Finetuned from model: fermi-bert-1024
- Context Length: 1024
- Vocab Size: 30522
- License:
Apache 2.0
Training
fermi-1024
was trained on MS MARCO Passage Dataset using the LSR framework using the teacher model ms-marco-MiniLM-L-6-v2. Trained on the Oak Ridge National Laboratory Frontier supercomputer using MI250X AMD GPUs.
Evaluation
The sparse embedding model was primarily evaluated for its effectiveness in information retrieval within the nuclear energy domain. Due to the absence of domain-specific benchmarks, we developed FermiBench to assess the model’s performance on nuclear-related texts. In addition, the model was tested on the MS MARCO dev split and the BEIR benchmark to ensure broader applicability. The model demonstrates strong retrieval capabilities, particularly in handling nuclear-specific jargon and documents.
Although there are standard benchmarks and tooling for evaluating dense embedding models, we found no open, standardized tooling for evaluating sparse embedding models. To support the community, we are releasing our benchmark tooling, built on top of BEIR and pyserini. All evaluation numbers were produced with that tool and should therefore be reproducible.
Model | FermiBench NDCG@10 | FermiBench FLOPS | MSMarco Dev NDCG@10 | BEIR* NDCG@10 | BEIR* FLOPS |
---|---|---|---|---|---|
fermi-512 | 0.74 | 7.07 | 0.45 | 0.46 | 9.14 |
fermi-1024 | 0.72 | 4.75 | 0.44 | 0.46 | 7.5 |
splade-cocondenser-ensembledistil | 0.64 | 12.9 | 0.45 | 0.46 | 12.4 |
* BEIR benchmark was a subset containng trec-covid, nfcorpus, arguana, scidocs, scifact.
Efficiency
Given the massive scale of documentation in nuclear energy, efficiency is crucial. Our model addresses this in several ways:
- Our 1024-length embedding model reduces the number of required embeddings by half, significantly lowering computational costs.
- The custom tokenizer, designed for nuclear-specific jargon, encodes documents and queries using fewer tokens, improving computational efficiency.
- Additionally, our models produce sparser vectors, reducing FLOPs and, as a secondary benefit, lowering storage requirements for indexing.
Usage
import itertools
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer
# get sparse vector from dense vectors with shape batch_size * seq_len * vocab_size
def get_sparse_vector(feature, output):
values, _ = torch.max(output*feature["attention_mask"].unsqueeze(-1), dim=1)
values = torch.log(1 + torch.relu(values))
values[:,special_token_ids] = 0
return values
# transform the sparse vector to a dict of (token, weight)
def transform_sparse_vector_to_dict(sparse_vector):
sample_indices,token_indices=torch.nonzero(sparse_vector,as_tuple=True)
non_zero_values = sparse_vector[(sample_indices,token_indices)].tolist()
number_of_tokens_for_each_sample = torch.bincount(sample_indices).cpu().tolist()
tokens = [id_to_token[_id] for _id in token_indices.tolist()]
output = []
end_idxs = list(itertools.accumulate([0]+number_of_tokens_for_each_sample))
for i in range(len(end_idxs)-1):
token_strings = tokens[end_idxs[i]:end_idxs[i+1]]
weights = non_zero_values[end_idxs[i]:end_idxs[i+1]]
output.append(dict(zip(token_strings, weights)))
return output
# load the model
model = AutoModelForMaskedLM.from_pretrained("atomic-canyon/fermi-1024")
tokenizer = AutoTokenizer.from_pretrained("atomic-canyon/fermi-1024")
# set the special tokens and id_to_token transform for post-process
special_token_ids = [tokenizer.vocab[token] for token in tokenizer.special_tokens_map.values()]
id_to_token = [""] * tokenizer.vocab_size
for token, _id in tokenizer.vocab.items():
id_to_token[_id] = token
query = "What is the maximum heat load per spent fuel assembly for the EOS-37PTH?"
document = "For the EOS-37PTH DSC, add two new heat load zone configurations (HLZCs) for the EOS37PTH for higher heat load assemblies, up to 3.5 kW/assembly, that also allow for damaged and failed fuel storage."
# encode the query & document
feature = tokenizer([query, document], padding=True, truncation=True, return_tensors='pt', return_token_type_ids=False)
output = model(**feature)[0]
sparse_vector = get_sparse_vector(feature, output)
# get similarity score
sim_score = torch.matmul(sparse_vector[0],sparse_vector[1])
print(sim_score)
query_token_weight, document_query_token_weight = transform_sparse_vector_to_dict(sparse_vector)
for token in sorted(query_token_weight, key=lambda x:query_token_weight[x], reverse=True):
if token in document_query_token_weight:
print("score in query: %.4f, score in document: %.4f, token: %s"%(query_token_weight[token],document_query_token_weight[token],token))
Acknowledgement
This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725.