Feed PDFs to Dolly

#50
by KiranAli - opened

I'm trying to feed pdfs to Dolly for Q/As. Following is the snippet of code that I'm using.

loader = TextLoader("doc.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()
docsearch = Chroma.from_documents(texts, embeddings)

Is there any other option for generating embeddings to store in vector store or OpenAIEmbeddings is the best option?

Databricks org

It doesn't look like you are loading PDFs there? You want this: https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/pdf.html
Language models don't operate on PDFs, but, if you get text from PDFs, sure.

You can plug in any embedding model to langchain, not just OpenAI, though it works well.
Dolly is not an encoder model though. It'd be overkill anyway. Just use a sentence-transfomers model.

from langchain.embeddings import HuggingFaceEmbeddings
 hf_embed = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

Dolly could be used as the text-generation LLM part though.
Databricks has a whole demo at https://www.dbdemos.ai/demo-notebooks.html?demoName=llm-dolly-chatbot

srowen changed discussion status to closed

I'm following this demo https://www.dbdemos.ai/demo-notebooks.html?demoName=llm-dolly-chatbot. But is it possible to fine-tune it on raw data instead of instruction-based dataset?

Databricks org

I think that's an unrelated question. Yes, but you would modify code at https://github.com/databrickslabs/dolly to accept different input, rather than form question-response into text strings. It's not clear whether tuning on raw data makes its output do what you want though.

It doesn't look like you are loading PDFs there? You want this: https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/pdf.html
Language models don't operate on PDFs, but, if you get text from PDFs, sure.

You can plug in any embedding model to langchain, not just OpenAI, though it works well.
Dolly is not an encoder model though. It'd be overkill anyway. Just use a sentence-transfomers model.

from langchain.embeddings import HuggingFaceEmbeddings
 hf_embed = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

Dolly could be used as the text-generation LLM part though.
Databricks has a whole demo at https://www.dbdemos.ai/demo-notebooks.html?demoName=llm-dolly-chatbot

OpenAIEmbeddings works very well but HuggingfaceEmbeddings gives very poor result

Databricks org

HuggingFaceEmbeddings isn't an embedding, but a way to apply other embeddings. Sure, use whatever embedding you like.

Hi @srowen , I know this is not the right place to ask this question but if you have any idea kindly guide me. I have developed a complete pipeline that reads a text doc file and fed vectors to dolly-v2-7b. I'm running it on VM having two 2 V100 16GB GPUs. It takes 12s to generate answer to simple question like What is "product name"? (it's mentioned at the start of doc.)

I'm following this https://www.dbdemos.ai/demo-notebooks.html?demoName=llm-dolly-chatbot. At the very last step its mentioned that optimum will improve inference greatly so use that to enhance inference speed. I'm using it as

      tokenizer = AutoTokenizer.from_pretrained(input_model, padding_side="left")
      model = ORTModelForCausalLM.from_pretrained("databricks/dolly-v2-3b", export=True, provider="CUDAExecutionProvider")

I get the following error:

     2023-05-05 04:39:44.132458586 [W:onnxruntime:, session_state.cc:1138 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
     2023-05-05 04:39:44.653800957 [E:onnxruntime:, inference_session.cc:1532 operator()] Exception during initialization: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:368 void* 
     onnxruntime::BFCArena::AllocateRawInternal(size_t, bool, onnxruntime::Stream*, bool, onnxruntime::WaitNotificationFn) Failed to allocate memory for requested buffer of size 78643200

I think optimum is not using both GPUs plus need guidance on if I'm on the right track.

Databricks org

It won't use 2 GPUs without device_map="auto" or something similar. You should also load in 16 bit with torch_dtype=torch.float16. This says you ran out of GPU mem and those might help.

Hi @srowen , optimum doesn't make use of multiple GPUs. So I successfully converted it onnx format on an A100 40GB VM. It enhanced inference speed by 1-2s. Earlier it took 3-4s to generate answers to simple questions like "What is product?" But it takes 34GB memory to run dolly in onnx format. I think that's too much to run a smaller model. Earlier it was taking 6-7GB. Any pointers on this? I've been stuck for a long time on this step, which prevents me from moving forward to the next step. I want to build a chatbot to which I feed lots of books and then use it for my customers. Plus I think inference speed would also be affected in case of multiple queries at same.

Databricks org

Use a smaller model? it takes a few seconds for me on an A10, not sure what your current setup or issue is

i'm using dolly-v2-3b. It can take 3-5s on A100.

Databricks org

I'm seeing 1s on a smaller GPU. What are your generation settings and input/output length? these make a big difference to speed.

InstructionTextGenerationPipeline(model=model, tokenizer=tokenizer, return_full_text=True, max_new_tokens=256, top_p=0.95, top_k=50, task='text-generation', torch_dtype=torch.bfloat16)

This simple script is taking 5.4s

import time

import torch
from instruct_pipeline import InstructionTextGenerationPipeline
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-7b", padding_side="left")
model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-7b", device_map="auto", torch_dtype=torch.float16)

generate_text = InstructionTextGenerationPipeline(model=model, tokenizer=tokenizer)
start = time.time()
res = generate_text("Explain to me the difference between nuclear fission and fusion.")
end = time.time()
print(res[0]["generated_text"])
print(end - start)

Running on A10 24GB on huggingface space.

Databricks org

You should load in bfloat16 but that's separate.
Please use pipeline() to load as shown in model card. Might work better. This depends a lot on generation settings

Sign up or log in to comment