model max_seq_length?
what is the model max length is it 384 tokens or 384 characters?
Hi, it is the maximum token length.
so the maximum token length is from the deberta tokenization..not with just splitting the text by space
please correct if I am wrong.
That is correct, the maximum token length specifies how many tokens, created by the deberta tokenization, can be processed at the same time.
If more tokens are provided, the input text is truncated to the maximum token length.
To use the model to process the text sentence by sentence, I would suggest to try out gliner-spacy. It is a wrapper which enables using the GLiNER model with spacy. With spacy, you can split the text into sentences using the in-built methods, and then use gliner-spacy to extract entities.
thank you . this fine tune is really good
Glad to hear that 😊
So I verified with microsoft/mdeberta-v3-base tokenization. The max length with the tokenization and this finetuned model is really different.
ids = tokenizer(text, add_special_tokens=False, max_length=256, stride=10, return_overflowing_tokens=True, truncation=True, padding=False)
len(ids.input_ids[0]) : 256
text = tokenizer.decode(ids.input_ids[0])
model.predict_entities(text, labels=labels, threshold=0.5)
gliner/data_processing/processor.py:206: UserWarning: Sentence of length 422 has been truncated to 384
still I get this error.
Please help
The GLiNER
models use a different approach to tokenization. They have a method called token_splitter
, which returns a generator that generates the tokens of the original text the GLiNER
models can use. If the text is longer, it gets truncated (as you already found).
If you want to ensure the input text is not too long, you can use the token_splitter
to first determine the tokens and their number.
from gliner import GLiNER
# load the model
model = GLiNER.from_pretrained("E3-JSI/gliner-multi-pii-domains-v1")
# prepare the text for entity extraction
text = "This is an example text having the name John Doe and the date 15-08-1985."
# create the token generator
token_generator = model.token_splitter(text)
# get the tokens
tokens = [t for t in token_generator]
len(tokens) # length = 15
Using this length, you can decide how you want to split your text so it does not get truncated.
Hope this helps.
/python3.10/site-packages/torch/nn/modules/module.py:1709, in Module.getattr(self, name)
1707 if name in modules:
1708 return modules[name]
-> 1709 raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
AttributeError: 'GLiNER' object has no attribute 'token_splitter'
The GLiNER
module seems to be updated. After looking into the GLiNER
official code, I found that the token_splitter
can be accessed in the following way:
token_generator = model.data_processor.token_splitter(text)
This should do the trick if you are using the latest version of GLiNER
.
Also, since this does not seem to be a problem with the model but rather with how to solve the original author's issue, I am closing this thread.
oh thank you but
----> 1 model.data_processor.token_splitter(text)
AttributeError: 'SpanProcessor' object has no attribute 'token_splitter'
can you tell which version you are using I am using the latest version
The GLiNER
module I tested with has version 0.2.10
.
I double-checked the example above. The method is called word_splitter
and not token_splitter
. I mistakenly copied the wrong parts. Apologies.
The complete (fixed) code for counting the number of tokens is the following:
from gliner import GLiNER
# load the model
model = GLiNER.from_pretrained("E3-JSI/gliner-multi-pii-domains-v1")
# prepare the text for entity extraction
text = "This is an example text having the name John Doe and the date 15-08-1985."
# create the token generator
# NOTE: the use of `words_splitter`
token_generator = model.data_processor.words_splitter(text)
# get the tokens
tokens = [t for t in token_generator]
len(tokens) # length = 15
Hope it resolves the problem.