`IndexError: index out of range in self` when creating embeddings

#7
by guido1893 - opened

I'm using T-Systems-onsite/cross-en-de-roberta-sentence-transformer as the embedding model for creating a privateGPT chatbot.

My setup is

privateGPT Setup Used/Parameter
Hardware Ubuntu Server with 48 CPUs
Source Documents One PDF, around 100pages
llm_hf_repo_id TheBloke/Leo-Mistral-Hessianai-7B-Chat-GGUF
llm_hf_model_file leo-mistral-hessianai-7b-chat.Q4_K_M.gguf
embedding_hf_model_name T-Systems-onsite/cross-en-de-roberta-sentence-transformer

Now, my problem is:

When I ingest the .pdf file creating the embeddings, after around 70% of the ingestion, I run into the following error:

 File "/*****/.cache/pypoetry/virtualenvs/private-gpt-igPs2cci-py3.11/lib/python3.11/site-packages/torch/nn/functional.py", line 2233, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: index out of range in self

Does that mean that T-Systems-onsite/cross-en-de-roberta-sentence-transformer cannot handle long pdfs?
Or do I need to set some parameters/options?
Is this a problem of privateGPT of T-Systems-onsite/cross-en-de-roberta-sentence-transformer?

T-Systems on site services GmbH org

Can you please use Sentence Transformers to load this model and do some tests?
Code can be found here: https://www.sbert.net/docs/usage/semantic_textual_similarity.html

I guess this is an issue specific to privateGPT.

Hope that helps. If you still have problems please give me code to to reproduce the error.

Is there an update on this?
I'm currently having a similair issue.

Edit:
So in Llamaindex you can set the max token legnth to max 512 which solves the problem for me.
# Embeddings model
embed_model_base = HuggingFaceEmbedding(
model_name=config.EMBEDDING_MODEL_NAME,
max_length=512
)

Sign up or log in to comment