Facing Issues with Model Output and Inference Times

#78
by ankity09 - opened

I am implementing RAG architecture with ChromaDB as my Vector Store and Falcon-7B as my LLM. I have used Langchains retriever to tie these together. While testing with a single PDF and search results set to return the top 3 matches, I face a number of issues.

  1. The returned answers are not accurate (Tried different Temperature settings)
  2. The model takes a long time and then responds with the same sentence repeated multiple times. (Increasing repetition penalty mitigated this to an extent)
  3. Model does not return with an answer for extended period of times, sometimes greater than 10-15 mins.
  4. Model response is slow. 5X slow in some cases when compared to models like Llama-2 7B or 13B

I reduced the returned search results from 3 to 1, which improved parts of the accuracy and time, however the model stops responding after being queried 3-4 times.

All of these issues have been reported in some form or the other previously

Wrong Output

while giving a input but getting the wrong output for the particular input
falcon-7b-instruct is answering out of context
Repeats the same sentence
any success in In-context question-answering?
Model keeps generating multiple rounds of conversation

Model is Slow or does not give output

Slow inference
4th inference in a row does not work for Falcon7B in 8 or 4 bit

I am using the 16bit version of the model and running on two T4 GPUs on AWS.

Please let me know if there are any workarounds or fixes for the above.

Thanks

When I set the returned search results from VecDB to 3(larger prompt), the model takes
1st Question(Answer is wrong)
CPU times: user 2min 39s, sys: 391 ms, total: 2min 39s
2nd Question (Answer is wrong)
CPU times: user 47.4 s, sys: 7.26 ms, total: 47.4 s
and then does not respond from the third onwards

When I decrease the results to 1 (smaller prompt)
1st question takes(Answer is right)
CPU times: user 44.6 s, sys: 288 ms, total: 44.9 s
2nd Question takes(Answer is somewhat right)
CPU times: user 17.4 s, sys: 0 ns, total: 17.4 s
and then does not respond from the third onwards as above.

Sign up or log in to comment