Feed PDFs to Dolly

#50

by KiranAli - opened Apr 20, 2023

Apr 20, 2023

I'm trying to feed pdfs to Dolly for Q/As. Following is the snippet of code that I'm using.

loader = TextLoader("doc.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()
docsearch = Chroma.from_documents(texts, embeddings)

Is there any other option for generating embeddings to store in vector store or OpenAIEmbeddings is the best option?

srowen

Databricks org Apr 20, 2023

It doesn't look like you are loading PDFs there? You want this: https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/pdf.html
Language models don't operate on PDFs, but, if you get text from PDFs, sure.

You can plug in any embedding model to langchain, not just OpenAI, though it works well.
Dolly is not an encoder model though. It'd be overkill anyway. Just use a sentence-transfomers model.

from langchain.embeddings import HuggingFaceEmbeddings
 hf_embed = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

Dolly could be used as the text-generation LLM part though.
Databricks has a whole demo at https://www.dbdemos.ai/demo-notebooks.html?demoName=llm-dolly-chatbot

srowen changed discussion status to closed Apr 23, 2023

KiranAli

Apr 26, 2023

I'm following this demo https://www.dbdemos.ai/demo-notebooks.html?demoName=llm-dolly-chatbot. But is it possible to fine-tune it on raw data instead of instruction-based dataset?

srowen

Databricks org Apr 26, 2023

I think that's an unrelated question. Yes, but you would modify code at https://github.com/databrickslabs/dolly to accept different input, rather than form question-response into text strings. It's not clear whether tuning on raw data makes its output do what you want though.

KiranAli

May 2, 2023

It doesn't look like you are loading PDFs there? You want this: https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/pdf.html
Language models don't operate on PDFs, but, if you get text from PDFs, sure.

You can plug in any embedding model to langchain, not just OpenAI, though it works well.
Dolly is not an encoder model though. It'd be overkill anyway. Just use a sentence-transfomers model.
from langchain.embeddings import HuggingFaceEmbeddings
 hf_embed = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
Dolly could be used as the text-generation LLM part though.
Databricks has a whole demo at https://www.dbdemos.ai/demo-notebooks.html?demoName=llm-dolly-chatbot

OpenAIEmbeddings works very well but HuggingfaceEmbeddings gives very poor result

srowen

Databricks org May 2, 2023

HuggingFaceEmbeddings isn't an embedding, but a way to apply other embeddings. Sure, use whatever embedding you like.

KiranAli

May 5, 2023

Hi @srowen , I know this is not the right place to ask this question but if you have any idea kindly guide me. I have developed a complete pipeline that reads a text doc file and fed vectors to dolly-v2-7b. I'm running it on VM having two 2 V100 16GB GPUs. It takes 12s to generate answer to simple question like What is "product name"? (it's mentioned at the start of doc.)

I'm following this https://www.dbdemos.ai/demo-notebooks.html?demoName=llm-dolly-chatbot. At the very last step its mentioned that optimum will improve inference greatly so use that to enhance inference speed. I'm using it as

      tokenizer = AutoTokenizer.from_pretrained(input_model, padding_side="left")
      model = ORTModelForCausalLM.from_pretrained("databricks/dolly-v2-3b", export=True, provider="CUDAExecutionProvider")

I get the following error:

     2023-05-05 04:39:44.132458586 [W:onnxruntime:, session_state.cc:1138 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
     2023-05-05 04:39:44.653800957 [E:onnxruntime:, inference_session.cc:1532 operator()] Exception during initialization: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:368 void* 
     onnxruntime::BFCArena::AllocateRawInternal(size_t, bool, onnxruntime::Stream*, bool, onnxruntime::WaitNotificationFn) Failed to allocate memory for requested buffer of size 78643200

I think optimum is not using both GPUs plus need guidance on if I'm on the right track.

srowen

Databricks org May 5, 2023

It won't use 2 GPUs without device_map="auto" or something similar. You should also load in 16 bit with torch_dtype=torch.float16. This says you ran out of GPU mem and those might help.

KiranAli

May 15, 2023

Hi @srowen , optimum doesn't make use of multiple GPUs. So I successfully converted it onnx format on an A100 40GB VM. It enhanced inference speed by 1-2s. Earlier it took 3-4s to generate answers to simple questions like "What is product?" But it takes 34GB memory to run dolly in onnx format. I think that's too much to run a smaller model. Earlier it was taking 6-7GB. Any pointers on this? I've been stuck for a long time on this step, which prevents me from moving forward to the next step. I want to build a chatbot to which I feed lots of books and then use it for my customers. Plus I think inference speed would also be affected in case of multiple queries at same.

srowen

Databricks org May 15, 2023

Use a smaller model? it takes a few seconds for me on an A10, not sure what your current setup or issue is

KiranAli

May 15, 2023

i'm using dolly-v2-3b. It can take 3-5s on A100.

srowen

Databricks org May 15, 2023

I'm seeing 1s on a smaller GPU. What are your generation settings and input/output length? these make a big difference to speed.

KiranAli

May 15, 2023

InstructionTextGenerationPipeline(model=model, tokenizer=tokenizer, return_full_text=True, max_new_tokens=256, top_p=0.95, top_k=50, task='text-generation', torch_dtype=torch.bfloat16)

KiranAli

May 16, 2023

This simple script is taking 5.4s

import time

import torch
from instruct_pipeline import InstructionTextGenerationPipeline
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-7b", padding_side="left")
model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-7b", device_map="auto", torch_dtype=torch.float16)

generate_text = InstructionTextGenerationPipeline(model=model, tokenizer=tokenizer)
start = time.time()
res = generate_text("Explain to me the difference between nuclear fission and fusion.")
end = time.time()
print(res[0]["generated_text"])
print(end - start)

Running on A10 24GB on huggingface space.

srowen

Databricks org May 16, 2023

You should load in bfloat16 but that's separate.
Please use pipeline() to load as shown in model card. Might work better. This depends a lot on generation settings

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment