How to handle the 384 words limit?
How to best chunk longer texts ? Thanks
one way: chunk it into texts smaller than 384 characters, calculate entities in individual chunks and append results!
see my code below:
from gliner import GLiNER
# -------------------------------------
INPUT
CHANGE: Example long text from which to extract entities
text = "e.g. this is a very long text of over 384 characters from which I want to extract the found entities: John is in his house at 22 Street name."
text = text*100
CHANGE: Labels for the named entity recognition (NER) model
labels = ["person", "location", "address"]
CHANGE: Threshold for NER model confidence
ner_threshold = 0.5
CHANGE: Load the pre-trained GLiNER model
model = GLiNER.from_pretrained("urchade/gliner_large-v2.1")
-------------------------------------
Initialize a list to store all extracted entities
all_entities = []
-------------------------------------
Function to chunk text into smaller segments
def chunk_text(text, max_length=384):
return [text[i:i+max_length] for i in range(0, len(text), max_length)]
-------------------------------------
Check if the text needs to be chunked
if len(text) > 384:
chunks = chunk_text(text)
print("Number of chunks:", len(chunks))
else:
chunks = [text]
Predict entities for each chunk of text
for chunk in chunks:
entities = model.predict_entities(chunk, labels, threshold=ner_threshold)
all_entities.extend(entities)
-------------------------------------
Output all found entities
print(all_entities)