Token Classification
GLiNER
PyTorch

model max_seq_length?

#1
by abpani1994 - opened

what is the model max length is it 384 tokens or 384 characters?

Department for Artificial Intelligence, Jožef Stefan Institute org

Hi, it is the maximum token length.

so the maximum token length is from the deberta tokenization..not with just splitting the text by space
please correct if I am wrong.

Department for Artificial Intelligence, Jožef Stefan Institute org

That is correct, the maximum token length specifies how many tokens, created by the deberta tokenization, can be processed at the same time.

If more tokens are provided, the input text is truncated to the maximum token length.

To use the model to process the text sentence by sentence, I would suggest to try out gliner-spacy. It is a wrapper which enables using the GLiNER model with spacy. With spacy, you can split the text into sentences using the in-built methods, and then use gliner-spacy to extract entities.

thank you . this fine tune is really good

abpani1994 changed discussion status to closed
Department for Artificial Intelligence, Jožef Stefan Institute org

Glad to hear that 😊

So I verified with microsoft/mdeberta-v3-base tokenization. The max length with the tokenization and this finetuned model is really different.
ids = tokenizer(text, add_special_tokens=False, max_length=256, stride=10, return_overflowing_tokens=True, truncation=True, padding=False)
len(ids.input_ids[0]) : 256
text = tokenizer.decode(ids.input_ids[0])
model.predict_entities(text, labels=labels, threshold=0.5)
gliner/data_processing/processor.py:206: UserWarning: Sentence of length 422 has been truncated to 384

still I get this error.

abpani1994 changed discussion status to open

Please help

Department for Artificial Intelligence, Jožef Stefan Institute org

The GLiNER models use a different approach to tokenization. They have a method called token_splitter, which returns a generator that generates the tokens of the original text the GLiNER models can use. If the text is longer, it gets truncated (as you already found).

If you want to ensure the input text is not too long, you can use the token_splitter to first determine the tokens and their number.

from gliner import GLiNER

# load the model
model = GLiNER.from_pretrained("E3-JSI/gliner-multi-pii-domains-v1")

# prepare the text for entity extraction
text = "This is an example text having the name John Doe and the date 15-08-1985."

# create the token generator
token_generator =  model.token_splitter(text)

# get the tokens
tokens = [t for t in token_generator]
len(tokens) # length = 15

Using this length, you can decide how you want to split your text so it does not get truncated.

Hope this helps.

/python3.10/site-packages/torch/nn/modules/module.py:1709, in Module.getattr(self, name)
1707 if name in modules:
1708 return modules[name]
-> 1709 raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")

AttributeError: 'GLiNER' object has no attribute 'token_splitter'

Department for Artificial Intelligence, Jožef Stefan Institute org

The GLiNER module seems to be updated. After looking into the GLiNER official code, I found that the token_splitter can be accessed in the following way:

token_generator =  model.data_processor.token_splitter(text)

This should do the trick if you are using the latest version of GLiNER.

Department for Artificial Intelligence, Jožef Stefan Institute org

Also, since this does not seem to be a problem with the model but rather with how to solve the original author's issue, I am closing this thread.

eriknovak changed discussion status to closed

oh thank you but
----> 1 model.data_processor.token_splitter(text)

AttributeError: 'SpanProcessor' object has no attribute 'token_splitter'
can you tell which version you are using I am using the latest version

Department for Artificial Intelligence, Jožef Stefan Institute org
edited Sep 5

The GLiNERmodule I tested with has version 0.2.10.

I double-checked the example above. The method is called word_splitter and not token_splitter. I mistakenly copied the wrong parts. Apologies.

The complete (fixed) code for counting the number of tokens is the following:

from gliner import GLiNER

# load the model
model = GLiNER.from_pretrained("E3-JSI/gliner-multi-pii-domains-v1")

# prepare the text for entity extraction
text = "This is an example text having the name John Doe and the date 15-08-1985."

# create the token generator
# NOTE: the use of `words_splitter`
token_generator =  model.data_processor.words_splitter(text)

# get the tokens
tokens = [t for t in token_generator]
len(tokens) # length = 15

Hope it resolves the problem.

Sign up or log in to comment