bert-chunker / README.md
tim1900's picture
Update README.md
6d51e4f verified
|
raw
history blame
2.07 kB
metadata
license: apache-2.0
language:
  - en
  - zh
pipeline_tag: token-classification

BertChunker

Introduction

BertChunker is an end-to-end trained chunker for chunking text for RAG. It's trained based on MiniLM-L6-H384-uncased with an adapter.

This repo includes model checkpoint, BertChunker class definition file and all the other files needed.

Quickstart

Download this repository. Then enter it. Run the following:

import safetensors
from transformers import AutoConfig,AutoTokenizer
from modeling_bertchunker import BertChunker

# load bert tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "./",
    padding_side="right",
    model_max_length=255,
    trust_remote_code=True,
)

# load MiniLM-L6-H384-uncased bert config
config = AutoConfig.from_pretrained(
    "./",
    trust_remote_code=True,
)

# initialize model
model = BertChunker(config)
device='cuda'
model.to(device)

# load parameters
state_dict = safetensors.torch.load_file("./model.safetensors")
model.load_state_dict(state_dict)

# text to be chunked
text="In the heart of the bustling city, where towering skyscrapers touch the clouds and the symphony \
    of honking cars never ceases, Sarah, an aspiring novelist, found solace in the quiet corners of the ancient library. \
    Surrounded by shelves that whispered stories of centuries past, she crafted her own world with words, oblivious to the rush outside.\
    Dr. Alexander Thompson, aboard the spaceship 'Pandora's Venture', was en route to the newly discovered exoplanet Zephyr-7. \
    As the lead astrobiologist of the expedition, his mission was to uncover signs of microbial life within the planet's subterranean ice caves. \
    With each passing light year, the anticipation of unraveling secrets that could alter humanity's\
     understanding of life in the universe grew ever stronger."

# chunk the text 
chunks=model.chunk_text(text, tokenizer)

# print chunks
for i, c in enumerate(chunks):
    print(f'------------------')
    print(c)