sgarbi
/

t5-qa-builder

@@ -1,12 +1,12 @@
 ---
 language:
 - en
-license: apache-2.0
 tags:
 - question-answering
 - t5
 - compact-model
 - sgarbi
 datasets:
 - squad2
 - quac
@@ -16,57 +16,169 @@ datasets:
 - squad_v2
 ---
-# Model Card for sgarbi/t5-compact-qa-gen
-## Model Description
-`sgarbi/t5-compact-qa-gen` is a compact T5-based model designed to generate question and answer pairs from a given text. This model has been trained with a focus on efficiency and speed, making it suitable for deployment on devices with limited computational resources, including CPUs. It utilizes a novel data formatting approach for training, which simplifies the parsing process and enhances the model's performance.
-## Intended Use
-This model is intended for a wide range of question-answering tasks, including but not limited to:
-- Generating study materials from educational texts.
-- Enhancing search engines with precise Q&A capabilities.
-- Supporting content creators in generating FAQs.
-- Deploying on edge devices for real-time question answering in various applications.
-## How to Use
-Here is a simple way to use this model with the Transformers library:
 ```python
-from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
-tokenizer = AutoTokenizer.from_pretrained("sgarbi/t5-compact-qa-gen")
-model = AutoModelForSeq2SeqLM.from_pretrained("sgarbi/t5-compact-qa-gen")
-text = "INPUT: <qa_builder_context>Your context here."
-inputs = tokenizer(text, return_tensors="pt")
-output = model.generate(inputs["input_ids"])
-print(tokenizer.decode(output[0], skip_special_tokens=True))
 ```
-## Training Data
-The model was trained on the following datasets:
-SQuAD 2.0: A large collection of question and answer pairs based on Wikipedia articles.
-QuAC: Question Answering in Context, a dataset for modeling, understanding, and participating in information-seeking dialogues.
-Natural Questions (NQ): A dataset containing real user questions sourced from Google search.
-Training Procedure
-The model was trained using a novel input and output formatting technique, focusing on generating "shallow" training data for efficient model training. The model's architecture, flan-T5-small, was selected for its balance between performance and computational efficiency. Training involved fine-tuning the model on the specified datasets, utilizing a custom XML-like format for simplifying the data structure.
-## Evaluation Results
-(Include any evaluation metrics and results here to showcase the model's performance on various benchmarks or tasks.)
-## Limitations and Bias
-(Describe any limitations of the model, including potential biases in the training data and areas where the model's performance may be suboptimal.)
 ## Ethical Considerations
-(Provide guidance on ethical considerations for users of the model, including appropriate and inappropriate uses.)
-## Citation
-@misc{sgarbi_t5_compact_qa_gen,
-  author = {Erick Sgarbi},
-  title = {T5 Compact QA Generator},
-  year = {2024},
-  publisher = {Hugging Face},
-  journal = {Hugging Face Model Hub}
-}

 ---
 language:
 - en
 tags:
 - question-answering
 - t5
 - compact-model
 - sgarbi
+license: apache-2.0
 datasets:
 - squad2
 - quac
 - squad_v2
 ---
+## T5-qa-builder: QA Pair Generation
+This is a google/flan-t5-base model fine-tuned on various question-answering datasets such as SQuAD, QuAC, Natural Questions, and a custom Q/A dataset. The model pipeline takes input text and generates a JSON output containing question-answer pairs relevant to the input text. When inferred outside of the pipeline a QA markup is returned.
+## Model Details
+- Model: google/flan-t5-base
+- Training Data:
+    - SQuAD
+    - QuAC
+    - Natural Questions
+    - Custom Q/A dataset generated from Zephyr 7B β and Mixtral 8 7B Instruct.
+- Intended Use: Question answering, generating synthetic Q/A pairs from input text
+- Language: English
+#### Example Usage
 ```python
+from transformers import T5ForConditionalGeneration, AutoTokenizer
+model = T5ForConditionalGeneration.from_pretrained('sgarbi/t5-qa-builder')
+tokenizer = AutoTokenizer.from_pretrained('sgarbi/t5-qa-builder')
+input = '''<qa_builder_context>A new breed of AI-powered coding tools have emerged—and they’re claiming to be more autonomous versions of earlier assistants like GitHub Copilot, Amazon CodeWhisperer, and Tabnine.
+One such new entrant, Devin AI, has been dubbed an “AI software engineer” by its maker, applied AI lab Cognition. According to Cognition, Devin can perform all these tasks unassisted: build a website from scratch and deploy it, find and fix bugs in codebases, and even train and fine-tune its own large language model.
+Following its launch, open-source alternatives to Devin have cropped up, including Devika and OpenDevin. Meanwhile makers of established assistants have not been standing still. Researchers at Microsoft, GitHub Copilot’s developer, recently uploaded a paper to the arXiv preprint server introducing AutoDev, which uses autonomous AI agents to generate code and test cases, run tests and check the results, and fix bugs within the test cases.''' # citation https://spectrum.ieee.org/ai-code-generator
+output = model.generate(tokenizer(input, return_tensors='pt').input_ids, max_length=512)
+decoded = tokenizer.decode(output[0], skip_special_tokens=False)
+print(decoded)
 ```
+#### QA Pairs (raw output):
+```xml
+<qa_builder_question>What are some of the new AI-powered coding tools emerging?<qa_builder_answer> Some of the new AI-powered coding tools are claiming to be more autonomous versions of earlier assistants like GitHub Copilot, Amazon CodeWhisperer, and Tabnine.<qa_builder_question> How has Devin AI been dubbed by Cognition?<qa_builder_answer> Devin AI has been dubbed an "AI software engineer" by Cognition.<qa_builder_question> What tasks can Devin AI perform unassisted?<qa_builder_answer> Devin AI can build a website from scratch, find and fix bugs in codebases, train and fine-tune its own large language model.<qa_builder_question> What open-source alternatives to Devin AI have cropped up after its launch?<qa_builder_answer> Open-source alternatives to Devin AI have cropped up, including Devika and OpenDevin.<qa_builder_question> What is the new AI-powered coding tool introduced by researchers at Microsoft?<qa_builder_answer> AutoDev, which uses autonomous AI agents to generate code and test cases, run tests, check results, and fix bugs within test cases, is introduced by researchers at Microsoft.
+```
+## Custom pipeline
+#### The easist way is to use a custom pipeline for which allow more control over the generation also with no restrictions of text lengh.
+```shell
+pip install git+https://github.com/ESgarbi/t5-qa-builder
+```
+#### Run the custom pipeline. Weights will be downloaded from the HF hub automatically. All special tokens will be handled internally, no need to supply the <qa_builder_context> token to initialise the task.
+```python
+from sgarbi import QABuilderPipeline
+producer = QABuilderPipeline()
+input = '''What is artificial intelligence?
+If you hear the term artificial intelligence (AI), you might think of self-driving cars, robots, ChatGPT, other AI chatbots, and artificially created images. But it's also important to look behind the outputs of AI and understand how the technology works and its impacts on this and future generations.
+AI is a concept that has been around formally since the 1950s when it was defined as a machine's ability to perform a task that would've previously required human intelligence. This is quite a broad definition that has been modified over decades of research and technological advancements.
+When you consider assigning intelligence to a machine, such as a computer, it makes sense to start by defining the term 'intelligence' -- especially when you want to determine if an artificial system truly deserves it.
+Also: ChatGPT vs. Microsoft Copilot vs. Gemini: Which is the best AI chatbot?
+Our level of intelligence sets us apart from other living beings and is essential to the human experience. Some experts define intelligence as the ability to adapt, solve problems, plan, improvise in new situations, and learn new things.
+With intelligence sometimes seen as the foundation for being human, it's perhaps no surprise that we'd try and recreate it artificially in scientific endeavors.
+Today's AI systems might demonstrate some traits of human intelligence, including learning, problem-solving, perception, and even a limited spectrum of creativity and social intelligence.
+AI comes in different forms and has become widely available in everyday life. The smart speakers on your mantle with Alexa or Google voice assistant built-in are two great examples of AI. Other good examples include popular AI chatbots, such as ChatGPT, the new Bing Chat, and Google Bard.
+When you ask ChatGPT for the capital of a country, or you ask Alexa to give you an update on the weather, the responses come from machine-learning algorithms.
+''' # Citation: https://www.zdnet.com/article/what-is-ai-heres-everything-you-need-to-know-about-artificial-intelligence
+result = producer(context=input, silent_mode=False, json_output=True)
+print(result)
+```
+#### Output:
+```json
+Generating QA pairs: 100%|██████████| 1/1 [00:13<00:00, 13.66s/it]
+{
+    "What is artificial intelligence? If you hear the term artificial intelligence (AI), you might think of self-driving cars, robots, ChatGPT, other AI chatbots, and artificially created images. But it's also important to look behind the outputs of AI and understand how the technology works and its impacts on this and future generations. AI is a concept that has been around formally since the 1950s when it was defined as a machine's ability to perform a task that would've previously required human intelligence. This is quite a broad definition that has been modified over decades of research and technological advancements. When you consider assigning intelligence to a machine, such as a computer, it makes sense to start by defining the term 'intelligence' -- especially when you want to determine if an artificial system truly deserves it. Also: ChatGPT vs. Microsoft Copilot vs. Gemini: Which is the best AI chatbot? Our level of intelligence sets us apart from other living beings and is essential to the human experience. Some experts define intelligence as the ability to adapt, solve problems, plan, improvise in new situations, and learn new things. With intelligence sometimes seen as the foundation for being human, it's perhaps no surprise that we'd try and recreate it artificially in scientific endeavors. Today's AI systems might demonstrate some traits of human intelligence, including learning, problem-solving, perception, and even a limited spectrum of creativity and social intelligence. AI comes in different forms and has become widely available in everyday life. The smart speakers on your mantle with Alexa or Google voice assistant built-in are two great examples of AI. Other good examples include popular AI chatbots, such as ChatGPT, the new Bing Chat, and Google Bard. When you ask ChatGPT for the capital of a country, or you ask Alexa to give you an update on the weather, the responses come from machine-learning algorithms. ": [
+        {
+            "question": "What is artificial intelligence (AI) and how has it been defined?",
+            "answer": "Artificial intelligence (AI) is a concept that has been around formally since the 1950s when it was defined as a machine's ability to perform tasks that would've previously required human intelligence.",
+            "score": 85.98722666501999
+        },
+        {
+            "question": "What is the term 'intelligence' and how has it been modified?",
+            "answer": "The term 'intelligence' has been modified over decades of research and technological advancements, and it has been defined as the ability to adapt, solve problems, plan, improvise in new situations, and learn new things.",
+            "score": 85.98722666501999
+        },
+        {
+            "question": "What are some traits of human intelligence mentioned in the text?",
+            "answer": "Some traits of human intelligence mentioned in the text include learning, problem-solving, perception, and a limited spectrum of creativity and social intelligence.",
+            "score": 85.98722666501999
+        },
+        {
+            "question": "What are some examples of popular AI chatbots?",
+            "answer": "Examples of popular AI chatbots include ChatGPT, the new Bing Chat, and Google Bard.",
+            "score": 85.98722666501999
+        }
+    ]
+}
+```
 ## Ethical Considerations
+Limitations and Risks
+The model may generate inconsistent Q/A pairs for inputs significantly different from the training data if the context is outside of the training data and pre-training corpus.
+Limited to the English language.
+As with any language model, this QA model may reflect biases present in its training data. The generated outputs should be used with care and fact-checked against reliable sources when possible.
+Additionally, this model does not have any capabilities to open links, images or other media. If such inputs are provided, it will clarify that it cannot process that type of data.
+## Broader Use Cases
+While this model was specifically fine-tuned for generating question-answer pairs from input text, its capabilities can be extended to several other use cases:
+### Query Understanding and Rewriting
+The model can be used to rephrase input queries in a question-answer format to better understand the user's intent. This can aid in developing more natural conversational AI systems.
+### Knowledge Base Population (original intended use)
+The generated question-answer pairs can be used to automatically populate and expand knowledge bases and FAQ systems with relevant information extracted from documents or web pages feeding a RAG system.
+### Data Augmentation
+The generated Q/A pairs can augment existing question-answering datasets to reduce annotation costs and increase performance on limited data scenarios.
+## Risks and Limitations
+- **Domain Shift**: Performance may degrade on inputs significantly different from the combined pretraining and fine-tuning datasets.
+- **Exposure Bias**: The model is trained using teacher-forced maximum likelihood, which can lead to exposure bias issues during inference.
+## Software and Dependencies
+The requiments.txt is included with the custom pipeline within the https://github.com/ESgarbi/t5-qa-builder repository.
+- Python 3.8
+- Transformers 4.17.0
+- PyTorch 1.10.0
+- SentenceTokenizer
+- Datasets
+## Citing This Model
+If you use this model or the generated outputs in your work, please cite:
+```bibtex
+@inproceedings{examplecitation,
+  author    = {Erick Sgarbi},
+  title     = {QA Pair Generation},
+  year      = {2024}
+}
+```