Spaces:

tryolabs
/

blogpost-cqa

Runtime error

App Files Files Community

rgallardo commited on Jan 27, 2023

Commit

e9d3a78

1 Parent(s): e414796

Update context

Browse files

Files changed (1) hide show

updated_context.txt +13 -5

updated_context.txt CHANGED Viewed

@@ -10,7 +10,7 @@ The model powering the chatbot was trained on the task of Conversational Questio
 Caveats
 Remember that this is just a small demo, so the answers may not be as accurate as expected; in the Improvements section, we’ll discuss how we can enhance the model’s responses! Also, the content of this blog was not included in the training set, which means that the chatbot will give you answers about new, unseen data!
-Response time may take around 10 seconds due to the use of 🤗 Hugging Face’s free-tier space, which only has access to two CPU cores. This response time is slow for a chatbot being used in production, but it's fast for a CPU deployment and for such large inputs (since the model needs to process the entire blog to generate an answer).
 The Boom of Foundation Models
 Have you ever wondered how computers can understand and respond to human language? The answer lies in a new concept called ‘Foundation Models’. Popularized under this name by the Stanford University, these are machine learning models trained on immense amounts of data (over tens of terabytes and growing) that can be adapted to a wide range of downstream tasks in fields such as image, text, and audio, among others.
@@ -39,6 +39,9 @@ Our goal was to build a chatbot that could answer questions about Tryolabs’ bl
 Like the CoQA dataset for conversational question answering, TryoCoQA consists of questions and answers about a specific context in a conversational manner, with some questions and answers referencing previous points in the conversation. We aimed for natural language questions with a variety of writing styles and vocabulary, and answers that can be short spans of text extracted from the context or free-form text, hoping for the model to produce more human-like responses with high-quality content.
 2. Choosing a Foundation Model
 Selecting the right open-source LLM to fine-tune can be a tough choice. There’s a handful of them, and while they may perform quite well for a broad range of general tasks, it can be challenging to predict how well they will perform on our specific task before fine-tuning. Think of it like buying a new pair of dancing shoes - they may look great and feel comfortable, but you won't know how well they'll perform until you hit the dance floor.
@@ -88,6 +91,7 @@ In addition to deciding how to combine the datasets, we also needed to adapt our
 To refresh the reader’s memory, in the use-case of Conversational Question Answering, the goal of the model is to generate the answer to a question given a context and the previous questions and answers from the conversation. With this in mind, we formatted the inputs following this structure:
 Here, the input is a text string containing the context (i.e., the content of one of Tryolabs blog posts) followed by the last two question-and-answer pairs in the conversation and the current target question, with the target output being the answer to the target question.  We chose to add just the last two question-and-answer pairs to limit the amount of conversation history the model needs to pay attention to while still being able to generate coherent responses. Note that this is a hyper-parameter you can adjust when fine-tuning your own model.
 With our data prepared and fine-tuning strategy determined, the final step was setting up our infrastructure environment and training the model.
@@ -116,8 +120,8 @@ To assess its performance, we used the F1 Score by validating how many tokens ap
 Since we had two different training steps, we also had two additional evaluation steps. The first training, on SQuAD2.0 and CoQa, resulted in a 74.29 F1 Score on the validation split after 3 epochs. The second training, on TryoCoQa, produced a 54.77 F1 Score after 166 epochs.
-More than analyzing the quantitative metrics is required to evaluate these results and conversational models in general. It is essential to consider the qualitative aspect of the model's answers, like their grammatical correctness and coherence within the conversation context. Sometimes it is preferable to have better answers (qualitatively speaking) than a better F1. So we looked at some answers from the validation set to ensure that the model was correctly generating what we were looking for. Our analysis revealed that higher F1 scores were generally associated with greater-quality answers. As a result, we selected the checkpoint with the highest F1 score to use in constructing our demonstration chatbot.
 Faster inference with 🤗 Optimum and ONNX
 After fine-tuning our model, we wanted to make it available to our awesome readers, so we deployed it on 🤗 Hugging Face Spaces, which offers a free tier with two CPU cores for running inference on the model. However, this setup can lead to slow inference times, and processing significant inputs like ours doesn’t make it any better. And a chatbot that takes a few minutes to answer a question doesn't strike anyone as being particularly chatty, does it? So, to improve the speed of our chatbot, we turned to 🤗 Optimum and the ONNX Runtime!
@@ -126,12 +130,16 @@ After fine-tuning our model, we wanted to make it available to our awesome reade
 In our previous blog post, A guide to optimizing Transformer-based models for faster inference, we used 🤗 Optimum and ONNX to achieve an x8 speed-up on inference for a Transformer model. Be sure to check it out!
 Using 🤗 Optimum’s recently released exporters feature, we were able to convert our PyTorch model to the ONNX format. This feature is handy for encoder-decoder models like the LongT5 model we trained, as it exports the three main components separately: the encoder, the decoder with the Language Modeling head, and the same decoder with pre-computed hidden states as additional inputs. According to 🤗 Optimum’s documentation, combining these three components can speed up sequential decoding, which results in faster text generation.
-Our fine-tuned model, exported to ONNX into these three components, is also available on 🤗 Hugging Face with the ID tryolabs/long-t5-tglobal-base-blogpost-cqa-onnx!
 Once our model was exported to ONNX, we used 🤗 Optimum’s integration with the ONNX Runtime to optimize our model and run inference on it by using the ORTModelForSeq2SeqLM class. This class can optimize and downcast the model using ONNX’s tools and then use ONNX Runtime to run inference with this new, faster model! You can even take it one step further and quantize the model for even shorter inference time on CPU and lower memory consumption.
-With these improvements, we could achieve an x2 speed-up on inference time! Although the model still takes around 10 seconds to answer, this is a reasonable speed for a CPU-only deployment and processing such large inputs.
 Takeaways
 With the ever-increasing popularity of LLMs, it can seem almost impossible to train these models without having access to millions of dollars in resources and tons of data. However, with the right skills and knowledge about Foundation Models, Deep Learning, and the Transformer architecture, we showed you that fine-tuning these huge models is possible, even with few resources and a small dataset!

 Caveats
 Remember that this is just a small demo, so the answers may not be as accurate as expected; in the Improvements section, we’ll discuss how we can enhance the model’s responses! Also, the content of this blog was not included in the training set, which means that the chatbot will give you answers about new, unseen data!
+Response time may take around 10 seconds due to the use of 🤗 Hugging Face’s free-tier space, which only has access to two CPU cores. This response time is slow for a chatbot being used in production, but it's fast for a CPU deployment and for such large inputs (since the model needs to process the entire blog to generate an answer). We'll discuss optimizing inference time in the Optimizing with ONNX section.
 The Boom of Foundation Models
 Have you ever wondered how computers can understand and respond to human language? The answer lies in a new concept called ‘Foundation Models’. Popularized under this name by the Stanford University, these are machine learning models trained on immense amounts of data (over tens of terabytes and growing) that can be adapted to a wide range of downstream tasks in fields such as image, text, and audio, among others.
 Like the CoQA dataset for conversational question answering, TryoCoQA consists of questions and answers about a specific context in a conversational manner, with some questions and answers referencing previous points in the conversation. We aimed for natural language questions with a variety of writing styles and vocabulary, and answers that can be short spans of text extracted from the context or free-form text, hoping for the model to produce more human-like responses with high-quality content.
+The dataset and guidelines for building it are available in the following GitHub repository.
 2. Choosing a Foundation Model
 Selecting the right open-source LLM to fine-tune can be a tough choice. There’s a handful of them, and while they may perform quite well for a broad range of general tasks, it can be challenging to predict how well they will perform on our specific task before fine-tuning. Think of it like buying a new pair of dancing shoes - they may look great and feel comfortable, but you won't know how well they'll perform until you hit the dance floor.
 To refresh the reader’s memory, in the use-case of Conversational Question Answering, the goal of the model is to generate the answer to a question given a context and the previous questions and answers from the conversation. With this in mind, we formatted the inputs following this structure:
 Here, the input is a text string containing the context (i.e., the content of one of Tryolabs blog posts) followed by the last two question-and-answer pairs in the conversation and the current target question, with the target output being the answer to the target question.  We chose to add just the last two question-and-answer pairs to limit the amount of conversation history the model needs to pay attention to while still being able to generate coherent responses. Note that this is a hyper-parameter you can adjust when fine-tuning your own model.
 With our data prepared and fine-tuning strategy determined, the final step was setting up our infrastructure environment and training the model.
 Since we had two different training steps, we also had two additional evaluation steps. The first training, on SQuAD2.0 and CoQa, resulted in a 74.29 F1 Score on the validation split after 3 epochs. The second training, on TryoCoQa, produced a 54.77 F1 Score after 166 epochs.
+More than analyzing the quantitative metrics is required to evaluate these results and conversational models in general. It is essential to consider the qualitative aspect of the model's answers, like their grammatical correctness and coherence within the conversation context. Sometimes it is preferable to have better answers (qualitatively speaking) than a better F1. So we looked at some answers from the validation set to ensure that the model was correctly generating what we were looking for. Our analysis revealed that higher F1 scores were generally associated with greater-quality answers. As a result, we selected the checkpoint with the highest F1 score to use in constructing our demonstration chatbot.
 Faster inference with 🤗 Optimum and ONNX
 After fine-tuning our model, we wanted to make it available to our awesome readers, so we deployed it on 🤗 Hugging Face Spaces, which offers a free tier with two CPU cores for running inference on the model. However, this setup can lead to slow inference times, and processing significant inputs like ours doesn’t make it any better. And a chatbot that takes a few minutes to answer a question doesn't strike anyone as being particularly chatty, does it? So, to improve the speed of our chatbot, we turned to 🤗 Optimum and the ONNX Runtime!
 In our previous blog post, A guide to optimizing Transformer-based models for faster inference, we used 🤗 Optimum and ONNX to achieve an x8 speed-up on inference for a Transformer model. Be sure to check it out!
 Using 🤗 Optimum’s recently released exporters feature, we were able to convert our PyTorch model to the ONNX format. This feature is handy for encoder-decoder models like the LongT5 model we trained, as it exports the three main components separately: the encoder, the decoder with the Language Modeling head, and the same decoder with pre-computed hidden states as additional inputs. According to 🤗 Optimum’s documentation, combining these three components can speed up sequential decoding, which results in faster text generation.
 Once our model was exported to ONNX, we used 🤗 Optimum’s integration with the ONNX Runtime to optimize our model and run inference on it by using the ORTModelForSeq2SeqLM class. This class can optimize and downcast the model using ONNX’s tools and then use ONNX Runtime to run inference with this new, faster model! You can even take it one step further and quantize the model for even shorter inference time on CPU and lower memory consumption.
+With these improvements, we achieved an x2 speed-up on inference time! Although the model still takes around 10 seconds to answer, this is a reasonable speed for a CPU-only deployment and processing such large inputs.
+Improvements
+We’re just scratching the surface of what’s possible. There are numerous potential improvements to keep working on to enhance the user's overall experience and the chatbot's performance. Here are some ideas to keep working on:
+Fine-tune on more public datasets for different downstream tasks. By doing so, we hope that the model learns to respond in a more human-like manner. For example, we could fine-tune on context-free QA datasets or on free-form answers so that the model not only retrieves the answers from the context but also generates original answers. We could also fine-tune on summarizing or paraphrasing tasks so that the user can also ask the chatbot to rewrite certain parts of the blog or summarize entire sections!
+Increase the size of our dataset. We built a small dataset for this demo, so it's not surprising that the model doesn't generalize well to new unseen blogs. We could potentially train a more accurate model by providing it with more training data.
+Optimize training using ONNX Runtime. We could achieve better results in the same amount of time if we could train for more epochs or fit a larger version of LongT5 in the same amount of memory. To do this, we could leverage the optimization power of ONNX Runtime for training PyTorch models, which would allow us to accelerate training speeds and optimize memory usage. Thanks to the recently released 🤗 Optimum integration with ONNX Runtime for training, this can be done in just a few lines of code!
 Takeaways
 With the ever-increasing popularity of LLMs, it can seem almost impossible to train these models without having access to millions of dollars in resources and tons of data. However, with the right skills and knowledge about Foundation Models, Deep Learning, and the Transformer architecture, we showed you that fine-tuning these huge models is possible, even with few resources and a small dataset!