bert-chunker

Paper | Github

Introduction

bert-chunker is a text chunker based on BERT with a classifier head to predict the start token of chunks (for use in RAG, etc), and using a sliding window it cuts documents of any size into chunks. It was finetuned on top of nreimers/MiniLM-L6-H384-uncased. The whole training lasted for 10 minutes on a Nvidia P40 GPU with a 50 MB synthetized dataset.

This repo includes model checkpoint, BertChunker class definition file and all the other files needed.

Quickstart

Download this repository. Then enter it. Run the following:

import safetensors
from transformers import AutoConfig,AutoTokenizer
from modeling_bertchunker import BertChunker

# load bert tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "tim1900/bert-chunker",
    padding_side="right",
    model_max_length=255,
    trust_remote_code=True,
)

# load MiniLM-L6-H384-uncased bert config
config = AutoConfig.from_pretrained(
    "tim1900/bert-chunker",
    trust_remote_code=True,
)

# initialize model
model = BertChunker(config)
device='cpu' # or 'cuda'
model.to(device)

# load parameters, tim1900/BertChunker/model.safetensors
state_dict = safetensors.torch.load_file("./model.safetensors")
model.load_state_dict(state_dict)

# text to be chunked, can be any size.
text='''In the heart of the bustling city, where towering skyscrapers touch the clouds and the symphony 
of honking cars never ceases, Sarah, an aspiring

novelist, found solace in the quiet
 corners of the ancient library. 
Surrounded by shelves that whispered stories of centuries

past, she crafted her own world with words, oblivious to the rush outside. Dr. Alexander Thompson, aboard the spaceship 'Pandora's Venture', was en route to the newly discovered exoplanet Zephyr-7. 
As the lead astrobiologist of the expedition, his mission was to uncover signs of microbial life within the planet's subterranean ice caves. With each passing light year, the anticipation of unraveling secrets that could alter humanity's
understanding of life in the universe grew ever stronger.'''

# chunk the text. The prob_threshold should be between (0, 1). The lower it is, the more chunks will be generated.
chunks=model.chunk_text(text, tokenizer, prob_threshold=0.5)

# print chunks
for i, c in enumerate(chunks):
    print(f'-----chunk: {i}------------')
    print(c)
Downloads last month
188
Safetensors
Model size
22.7M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.