tim1900 commited on
Commit
6d51e4f
·
verified ·
1 Parent(s): 6bcdbb0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +64 -3
README.md CHANGED
@@ -1,3 +1,64 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - zh
6
+ pipeline_tag: token-classification
7
+ ---
8
+ # BertChunker
9
+
10
+ ## Introduction
11
+
12
+ BertChunker is an end-to-end trained chunker for chunking text for RAG. It's trained based on [MiniLM-L6-H384-uncased](https://huggingface.co/nreimers/MiniLM-L6-H384-uncased) with an adapter.
13
+
14
+ This repo includes model checkpoint, BertChunker class definition file and all the other files needed.
15
+
16
+ ## Quickstart
17
+ Download this repository. Then enter it. Run the following:
18
+
19
+ ```python
20
+ import safetensors
21
+ from transformers import AutoConfig,AutoTokenizer
22
+ from modeling_bertchunker import BertChunker
23
+
24
+ # load bert tokenizer
25
+ tokenizer = AutoTokenizer.from_pretrained(
26
+ "./",
27
+ padding_side="right",
28
+ model_max_length=255,
29
+ trust_remote_code=True,
30
+ )
31
+
32
+ # load MiniLM-L6-H384-uncased bert config
33
+ config = AutoConfig.from_pretrained(
34
+ "./",
35
+ trust_remote_code=True,
36
+ )
37
+
38
+ # initialize model
39
+ model = BertChunker(config)
40
+ device='cuda'
41
+ model.to(device)
42
+
43
+ # load parameters
44
+ state_dict = safetensors.torch.load_file("./model.safetensors")
45
+ model.load_state_dict(state_dict)
46
+
47
+ # text to be chunked
48
+ text="In the heart of the bustling city, where towering skyscrapers touch the clouds and the symphony \
49
+ of honking cars never ceases, Sarah, an aspiring novelist, found solace in the quiet corners of the ancient library. \
50
+ Surrounded by shelves that whispered stories of centuries past, she crafted her own world with words, oblivious to the rush outside.\
51
+ Dr. Alexander Thompson, aboard the spaceship 'Pandora's Venture', was en route to the newly discovered exoplanet Zephyr-7. \
52
+ As the lead astrobiologist of the expedition, his mission was to uncover signs of microbial life within the planet's subterranean ice caves. \
53
+ With each passing light year, the anticipation of unraveling secrets that could alter humanity's\
54
+ understanding of life in the universe grew ever stronger."
55
+
56
+ # chunk the text
57
+ chunks=model.chunk_text(text, tokenizer)
58
+
59
+ # print chunks
60
+ for i, c in enumerate(chunks):
61
+ print(f'------------------')
62
+ print(c)
63
+
64
+ ```