`IndexError: index out of range in self` when creating embeddings

by guido1893 - opened Oct 30, 2023

Oct 30, 2023

•

edited Oct 30, 2023

I'm using T-Systems-onsite/cross-en-de-roberta-sentence-transformer as the embedding model for creating a privateGPT chatbot.

My setup is

privateGPT Setup	Used/Parameter
Hardware	Ubuntu Server with 48 CPUs
Source Documents	One PDF, around 100pages
`llm_hf_repo_id`	`TheBloke/Leo-Mistral-Hessianai-7B-Chat-GGUF`
`llm_hf_model_file`	`leo-mistral-hessianai-7b-chat.Q4_K_M.gguf`
`embedding_hf_model_name`	`T-Systems-onsite/cross-en-de-roberta-sentence-transformer`

Now, my problem is:

When I ingest the .pdf file creating the embeddings, after around 70% of the ingestion, I run into the following error:

 File "/*****/.cache/pypoetry/virtualenvs/private-gpt-igPs2cci-py3.11/lib/python3.11/site-packages/torch/nn/functional.py", line 2233, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: index out of range in self

Does that mean that T-Systems-onsite/cross-en-de-roberta-sentence-transformer cannot handle long pdfs?
Or do I need to set some parameters/options?
Is this a problem of privateGPT of T-Systems-onsite/cross-en-de-roberta-sentence-transformer?

PhilipMay

T-Systems on site services GmbH org Dec 7, 2023

Can you please use Sentence Transformers to load this model and do some tests?
Code can be found here: https://www.sbert.net/docs/usage/semantic_textual_similarity.html

I guess this is an issue specific to privateGPT.

Hope that helps. If you still have problems please give me code to to reproduce the error.

fismax

Feb 15, 2024

•

edited Feb 19, 2024

Is there an update on this?
I'm currently having a similair issue.

Edit:
So in Llamaindex you can set the max token legnth to max 512 which solves the problem for me.
# Embeddings model
embed_model_base = HuggingFaceEmbedding(
model_name=config.EMBEDDING_MODEL_NAME,
max_length=512
)

Albach

May 23, 2024

Has the problem been solved or not, because I have the same problem now.

Thank you very much

max5800

Jun 24, 2024

You have to set the correct token lengt otherwise it will output that error
@Albach

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment