sgarbi commited on
Commit
0586fd8
1 Parent(s): 2576fe1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +151 -39
README.md CHANGED
@@ -1,12 +1,12 @@
1
  ---
2
  language:
3
  - en
4
- license: apache-2.0
5
  tags:
6
  - question-answering
7
  - t5
8
  - compact-model
9
  - sgarbi
 
10
  datasets:
11
  - squad2
12
  - quac
@@ -16,57 +16,169 @@ datasets:
16
  - squad_v2
17
  ---
18
 
19
- # Model Card for sgarbi/t5-compact-qa-gen
20
 
21
- ## Model Description
22
- `sgarbi/t5-compact-qa-gen` is a compact T5-based model designed to generate question and answer pairs from a given text. This model has been trained with a focus on efficiency and speed, making it suitable for deployment on devices with limited computational resources, including CPUs. It utilizes a novel data formatting approach for training, which simplifies the parsing process and enhances the model's performance.
23
 
24
- ## Intended Use
25
- This model is intended for a wide range of question-answering tasks, including but not limited to:
26
- - Generating study materials from educational texts.
27
- - Enhancing search engines with precise Q&A capabilities.
28
- - Supporting content creators in generating FAQs.
29
- - Deploying on edge devices for real-time question answering in various applications.
30
 
31
- ## How to Use
32
- Here is a simple way to use this model with the Transformers library:
 
 
 
 
 
 
 
 
33
 
34
  ```python
35
- from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
 
 
 
 
 
 
 
 
 
36
 
37
- tokenizer = AutoTokenizer.from_pretrained("sgarbi/t5-compact-qa-gen")
38
- model = AutoModelForSeq2SeqLM.from_pretrained("sgarbi/t5-compact-qa-gen")
39
 
40
- text = "INPUT: <qa_builder_context>Your context here."
41
- inputs = tokenizer(text, return_tensors="pt")
42
- output = model.generate(inputs["input_ids"])
43
- print(tokenizer.decode(output[0], skip_special_tokens=True))
44
  ```
45
 
46
- ## Training Data
47
- The model was trained on the following datasets:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
 
49
- SQuAD 2.0: A large collection of question and answer pairs based on Wikipedia articles.
50
- QuAC: Question Answering in Context, a dataset for modeling, understanding, and participating in information-seeking dialogues.
51
- Natural Questions (NQ): A dataset containing real user questions sourced from Google search.
52
- Training Procedure
53
- The model was trained using a novel input and output formatting technique, focusing on generating "shallow" training data for efficient model training. The model's architecture, flan-T5-small, was selected for its balance between performance and computational efficiency. Training involved fine-tuning the model on the specified datasets, utilizing a custom XML-like format for simplifying the data structure.
54
 
55
- ## Evaluation Results
56
- (Include any evaluation metrics and results here to showcase the model's performance on various benchmarks or tasks.)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57
 
58
- ## Limitations and Bias
59
- (Describe any limitations of the model, including potential biases in the training data and areas where the model's performance may be suboptimal.)
60
 
61
  ## Ethical Considerations
62
- (Provide guidance on ethical considerations for users of the model, including appropriate and inappropriate uses.)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
63
 
64
- ## Citation
65
 
66
- @misc{sgarbi_t5_compact_qa_gen,
67
- author = {Erick Sgarbi},
68
- title = {T5 Compact QA Generator},
69
- year = {2024},
70
- publisher = {Hugging Face},
71
- journal = {Hugging Face Model Hub}
72
- }
 
1
  ---
2
  language:
3
  - en
 
4
  tags:
5
  - question-answering
6
  - t5
7
  - compact-model
8
  - sgarbi
9
+ license: apache-2.0
10
  datasets:
11
  - squad2
12
  - quac
 
16
  - squad_v2
17
  ---
18
 
19
+ ## T5-qa-builder: QA Pair Generation
20
 
21
+ This is a google/flan-t5-base model fine-tuned on various question-answering datasets such as SQuAD, QuAC, Natural Questions, and a custom Q/A dataset. The model pipeline takes input text and generates a JSON output containing question-answer pairs relevant to the input text. When inferred outside of the pipeline a QA markup is returned.
 
22
 
23
+ ## Model Details
 
 
 
 
 
24
 
25
+ - Model: google/flan-t5-base
26
+ - Training Data:
27
+ - SQuAD
28
+ - QuAC
29
+ - Natural Questions
30
+ - Custom Q/A dataset generated from Zephyr 7B β and Mixtral 8 7B Instruct.
31
+ - Intended Use: Question answering, generating synthetic Q/A pairs from input text
32
+ - Language: English
33
+
34
+ #### Example Usage
35
 
36
  ```python
37
+ from transformers import T5ForConditionalGeneration, AutoTokenizer
38
+
39
+ model = T5ForConditionalGeneration.from_pretrained('sgarbi/t5-qa-builder')
40
+ tokenizer = AutoTokenizer.from_pretrained('sgarbi/t5-qa-builder')
41
+
42
+ input = '''<qa_builder_context>A new breed of AI-powered coding tools have emerged—and they’re claiming to be more autonomous versions of earlier assistants like GitHub Copilot, Amazon CodeWhisperer, and Tabnine.
43
+
44
+ One such new entrant, Devin AI, has been dubbed an “AI software engineer” by its maker, applied AI lab Cognition. According to Cognition, Devin can perform all these tasks unassisted: build a website from scratch and deploy it, find and fix bugs in codebases, and even train and fine-tune its own large language model.
45
+
46
+ Following its launch, open-source alternatives to Devin have cropped up, including Devika and OpenDevin. Meanwhile makers of established assistants have not been standing still. Researchers at Microsoft, GitHub Copilot’s developer, recently uploaded a paper to the arXiv preprint server introducing AutoDev, which uses autonomous AI agents to generate code and test cases, run tests and check the results, and fix bugs within the test cases.''' # citation https://spectrum.ieee.org/ai-code-generator
47
 
48
+ output = model.generate(tokenizer(input, return_tensors='pt').input_ids, max_length=512)
 
49
 
50
+ decoded = tokenizer.decode(output[0], skip_special_tokens=False)
51
+ print(decoded)
 
 
52
  ```
53
 
54
+ #### QA Pairs (raw output):
55
+ ```xml
56
+ <qa_builder_question>What are some of the new AI-powered coding tools emerging?<qa_builder_answer> Some of the new AI-powered coding tools are claiming to be more autonomous versions of earlier assistants like GitHub Copilot, Amazon CodeWhisperer, and Tabnine.<qa_builder_question> How has Devin AI been dubbed by Cognition?<qa_builder_answer> Devin AI has been dubbed an "AI software engineer" by Cognition.<qa_builder_question> What tasks can Devin AI perform unassisted?<qa_builder_answer> Devin AI can build a website from scratch, find and fix bugs in codebases, train and fine-tune its own large language model.<qa_builder_question> What open-source alternatives to Devin AI have cropped up after its launch?<qa_builder_answer> Open-source alternatives to Devin AI have cropped up, including Devika and OpenDevin.<qa_builder_question> What is the new AI-powered coding tool introduced by researchers at Microsoft?<qa_builder_answer> AutoDev, which uses autonomous AI agents to generate code and test cases, run tests, check results, and fix bugs within test cases, is introduced by researchers at Microsoft.
57
+
58
+ ```
59
+
60
+ ## Custom pipeline
61
+
62
+ #### The easist way is to use a custom pipeline for which allow more control over the generation also with no restrictions of text lengh.
63
+
64
+ ```shell
65
+ pip install git+https://github.com/ESgarbi/t5-qa-builder
66
+ ```
67
+
68
+ #### Run the custom pipeline. Weights will be downloaded from the HF hub automatically. All special tokens will be handled internally, no need to supply the <qa_builder_context> token to initialise the task.
69
+
70
+ ```python
71
+
72
+ from sgarbi import QABuilderPipeline
73
+
74
+ producer = QABuilderPipeline()
75
+
76
+ input = '''What is artificial intelligence?
77
+ If you hear the term artificial intelligence (AI), you might think of self-driving cars, robots, ChatGPT, other AI chatbots, and artificially created images. But it's also important to look behind the outputs of AI and understand how the technology works and its impacts on this and future generations.
78
+
79
+ AI is a concept that has been around formally since the 1950s when it was defined as a machine's ability to perform a task that would've previously required human intelligence. This is quite a broad definition that has been modified over decades of research and technological advancements.
80
+
81
+ When you consider assigning intelligence to a machine, such as a computer, it makes sense to start by defining the term 'intelligence' -- especially when you want to determine if an artificial system truly deserves it.
82
+
83
+ Also: ChatGPT vs. Microsoft Copilot vs. Gemini: Which is the best AI chatbot?
84
+
85
+ Our level of intelligence sets us apart from other living beings and is essential to the human experience. Some experts define intelligence as the ability to adapt, solve problems, plan, improvise in new situations, and learn new things.
86
+
87
+ With intelligence sometimes seen as the foundation for being human, it's perhaps no surprise that we'd try and recreate it artificially in scientific endeavors.
88
+
89
+ Today's AI systems might demonstrate some traits of human intelligence, including learning, problem-solving, perception, and even a limited spectrum of creativity and social intelligence.
90
 
91
+ AI comes in different forms and has become widely available in everyday life. The smart speakers on your mantle with Alexa or Google voice assistant built-in are two great examples of AI. Other good examples include popular AI chatbots, such as ChatGPT, the new Bing Chat, and Google Bard.
 
 
 
 
92
 
93
+ When you ask ChatGPT for the capital of a country, or you ask Alexa to give you an update on the weather, the responses come from machine-learning algorithms.
94
+
95
+ ''' # Citation: https://www.zdnet.com/article/what-is-ai-heres-everything-you-need-to-know-about-artificial-intelligence
96
+
97
+ result = producer(context=input, silent_mode=False, json_output=True)
98
+ print(result)
99
+ ```
100
+
101
+
102
+ #### Output:
103
+ ```json
104
+ Generating QA pairs: 100%|██████████| 1/1 [00:13<00:00, 13.66s/it]
105
+ {
106
+ "What is artificial intelligence? If you hear the term artificial intelligence (AI), you might think of self-driving cars, robots, ChatGPT, other AI chatbots, and artificially created images. But it's also important to look behind the outputs of AI and understand how the technology works and its impacts on this and future generations. AI is a concept that has been around formally since the 1950s when it was defined as a machine's ability to perform a task that would've previously required human intelligence. This is quite a broad definition that has been modified over decades of research and technological advancements. When you consider assigning intelligence to a machine, such as a computer, it makes sense to start by defining the term 'intelligence' -- especially when you want to determine if an artificial system truly deserves it. Also: ChatGPT vs. Microsoft Copilot vs. Gemini: Which is the best AI chatbot? Our level of intelligence sets us apart from other living beings and is essential to the human experience. Some experts define intelligence as the ability to adapt, solve problems, plan, improvise in new situations, and learn new things. With intelligence sometimes seen as the foundation for being human, it's perhaps no surprise that we'd try and recreate it artificially in scientific endeavors. Today's AI systems might demonstrate some traits of human intelligence, including learning, problem-solving, perception, and even a limited spectrum of creativity and social intelligence. AI comes in different forms and has become widely available in everyday life. The smart speakers on your mantle with Alexa or Google voice assistant built-in are two great examples of AI. Other good examples include popular AI chatbots, such as ChatGPT, the new Bing Chat, and Google Bard. When you ask ChatGPT for the capital of a country, or you ask Alexa to give you an update on the weather, the responses come from machine-learning algorithms. ": [
107
+ {
108
+ "question": "What is artificial intelligence (AI) and how has it been defined?",
109
+ "answer": "Artificial intelligence (AI) is a concept that has been around formally since the 1950s when it was defined as a machine's ability to perform tasks that would've previously required human intelligence.",
110
+ "score": 85.98722666501999
111
+ },
112
+ {
113
+ "question": "What is the term 'intelligence' and how has it been modified?",
114
+ "answer": "The term 'intelligence' has been modified over decades of research and technological advancements, and it has been defined as the ability to adapt, solve problems, plan, improvise in new situations, and learn new things.",
115
+ "score": 85.98722666501999
116
+ },
117
+ {
118
+ "question": "What are some traits of human intelligence mentioned in the text?",
119
+ "answer": "Some traits of human intelligence mentioned in the text include learning, problem-solving, perception, and a limited spectrum of creativity and social intelligence.",
120
+ "score": 85.98722666501999
121
+ },
122
+ {
123
+ "question": "What are some examples of popular AI chatbots?",
124
+ "answer": "Examples of popular AI chatbots include ChatGPT, the new Bing Chat, and Google Bard.",
125
+ "score": 85.98722666501999
126
+ }
127
+ ]
128
+ }
129
+ ```
130
 
 
 
131
 
132
  ## Ethical Considerations
133
+ Limitations and Risks
134
+ The model may generate inconsistent Q/A pairs for inputs significantly different from the training data if the context is outside of the training data and pre-training corpus.
135
+ Limited to the English language.
136
+
137
+ As with any language model, this QA model may reflect biases present in its training data. The generated outputs should be used with care and fact-checked against reliable sources when possible.
138
+
139
+ Additionally, this model does not have any capabilities to open links, images or other media. If such inputs are provided, it will clarify that it cannot process that type of data.
140
+
141
+
142
+ ## Broader Use Cases
143
+
144
+ While this model was specifically fine-tuned for generating question-answer pairs from input text, its capabilities can be extended to several other use cases:
145
+
146
+ ### Query Understanding and Rewriting
147
+ The model can be used to rephrase input queries in a question-answer format to better understand the user's intent. This can aid in developing more natural conversational AI systems.
148
+
149
+ ### Knowledge Base Population (original intended use)
150
+ The generated question-answer pairs can be used to automatically populate and expand knowledge bases and FAQ systems with relevant information extracted from documents or web pages feeding a RAG system.
151
+
152
+ ### Data Augmentation
153
+ The generated Q/A pairs can augment existing question-answering datasets to reduce annotation costs and increase performance on limited data scenarios.
154
+
155
+
156
+ ## Risks and Limitations
157
+
158
+ - **Domain Shift**: Performance may degrade on inputs significantly different from the combined pretraining and fine-tuning datasets.
159
+
160
+ - **Exposure Bias**: The model is trained using teacher-forced maximum likelihood, which can lead to exposure bias issues during inference.
161
+
162
+
163
+ ## Software and Dependencies
164
+
165
+ The requiments.txt is included with the custom pipeline within the https://github.com/ESgarbi/t5-qa-builder repository.
166
+
167
+ - Python 3.8
168
+ - Transformers 4.17.0
169
+ - PyTorch 1.10.0
170
+ - SentenceTokenizer
171
+ - Datasets
172
+
173
+
174
+ ## Citing This Model
175
 
176
+ If you use this model or the generated outputs in your work, please cite:
177
 
178
+ ```bibtex
179
+ @inproceedings{examplecitation,
180
+ author = {Erick Sgarbi},
181
+ title = {QA Pair Generation},
182
+ year = {2024}
183
+ }
184
+ ```