aorogat commited on
Commit
949149f
·
verified ·
1 Parent(s): 04a3fd4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +158 -8
README.md CHANGED
@@ -1,26 +1,176 @@
1
  ---
2
  datasets:
3
- - aorogat/QueryBridge
4
  ---
 
 
5
  # Model Overview
6
 
7
- This model employs **fine-tuning** using **Low-Rank Adaptation (LoRA)** for mapping questions to tagged questions.
8
 
9
  The tagged questions in the QueryBridge dataset are designed to train language models to understand the components and structure of a question effectively. By annotating questions with specific tags such as `<qt>`, `<p>`, `<o>`, and `<s>`, we provide a detailed breakdown of each question's elements, which aids the model in grasping the roles of different components.
10
 
11
  For example, the video below demonstrates how a model can be trained to interpret these tagged questions. We convert these annotated questions into a graph representation, which visually maps out the relationships and roles within the question. This graph-based representation facilitates the construction of queries in various query languages such as SPARQL, SQL, Cypher, and others, by translating the structured understanding into executable query formats. This approach not only enhances the model’s ability to parse and generate queries across different languages but also ensures consistency and accuracy in query formulation.
12
 
13
-
14
  <a href="https://youtu.be/J_N-6m8fHz0">
15
  <img src="https://cdn-uploads.huggingface.co/production/uploads/664adb4a691370727c200af0/sDfp7DiYrGKvH58KdXOIY.png" alt="Training Model with Tagged Questions" width="400" height="300" />
16
  </a>
17
 
18
- ## Fine-Tuning with LoRA
19
- - Fine-tuning adjusts the model's parameters for specific tasks, enhancing its ability to handle nuanced requirements.
20
- - **LoRA** allows efficient updates by modifying only a subset of model weights, significantly reducing computational overhead while maintaining or improving performance.
21
- - The fine-tuned **llama3-Finetuned** model demonstrates exceptional stability and accuracy, achieving perfect scores with no error margins.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
 
23
- ## Model Configuration
24
  See https://pytorch.org/torchtune/stable/tutorials/e2e_flow.html to know how to use torchtune.
25
 
26
  To finetune the model:
 
1
  ---
2
  datasets:
3
+ - USERNAME/QueryBridge
4
  ---
5
+ - This model is a fine-tuned version of llama3 using LoRA. We used TorchTune to fine-tune the model. Below, you will find a section on how we fine-tuned it.
6
+
7
  # Model Overview
8
 
9
+ This model is a fine-tuned version of llama3 using the [QueryBridge dataset](https://huggingface.co/datasets/USERNAME/QueryBridge). We utilized **Low-Rank Adaptation (LoRA)** to train it for tagging question components using the tags in the table below. The demo video shows how the mapped question appears and, after converting it to a graph representation, how we visualized it as shown in the video.
10
 
11
  The tagged questions in the QueryBridge dataset are designed to train language models to understand the components and structure of a question effectively. By annotating questions with specific tags such as `<qt>`, `<p>`, `<o>`, and `<s>`, we provide a detailed breakdown of each question's elements, which aids the model in grasping the roles of different components.
12
 
13
  For example, the video below demonstrates how a model can be trained to interpret these tagged questions. We convert these annotated questions into a graph representation, which visually maps out the relationships and roles within the question. This graph-based representation facilitates the construction of queries in various query languages such as SPARQL, SQL, Cypher, and others, by translating the structured understanding into executable query formats. This approach not only enhances the model’s ability to parse and generate queries across different languages but also ensures consistency and accuracy in query formulation.
14
 
 
15
  <a href="https://youtu.be/J_N-6m8fHz0">
16
  <img src="https://cdn-uploads.huggingface.co/production/uploads/664adb4a691370727c200af0/sDfp7DiYrGKvH58KdXOIY.png" alt="Training Model with Tagged Questions" width="400" height="300" />
17
  </a>
18
 
19
+ # Tags Used in Tagged Questions
20
+
21
+ The tagging system categorizes different components of the question as follows:
22
+
23
+ | Tag | Description |
24
+ |-------|-------------|
25
+ | `<qt>` | **Question Type**: Tags the keywords or phrases that denote the type of question being asked, such as 'What', 'Who', 'How many', etc. This tag helps determine the type of SPARQL query to generate. Example: In "What is the capital of Canada?", the tag `<qt>What</qt>` indicates that the question is asking for an entity retrieval. |
26
+ | `<o>` | **Object Entities**: Tags entities that are objects in the question. These are usually noun phrases referring to the entities being described or queried. Example: In "What is the capital of Canada?", the term 'Canada' is tagged as `<o>Canada</o>`. |
27
+ | `<s>` | **Subject Entities**: Tags entities that are subjects in Yes-No questions. This tag is used exclusively for questions that can be answered with 'Yes' or 'No'. Example: In "Is Ottawa the capital of Canada?", the entity 'Ottawa' is tagged as `<s>Ottawa</s>`. |
28
+ | `<p>` | **Predicates**: Tags predicates that represent relationships or attributes in the knowledge graph. Predicates can be verb phrases or noun phrases that describe how entities are related. Example: In "What is the capital of Canada?", the phrase 'is the capital of' is tagged as `<p>is the capital of</p>`. |
29
+ | `<cc>` | **Coordinating Conjunctions**: Tags conjunctions that connect multiple predicates or entities in complex queries. These include words like 'and', 'or', and 'nor'. They influence how the SPARQL query combines conditions. Example: In "Who is the CEO and founder of Apple Inc?", the conjunction 'and' is tagged as `<cc>and</cc>`. |
30
+ | `<off>`| **Offsets**: Tags specific terms that indicate position or order in a sequence, such as 'first', 'second', etc. These are used in questions asking for ordinal positions. Example: In "What is the second largest country?", the word 'second' is tagged as `<off>second</off>`. |
31
+ | `<t>` | **Entity Types**: Tags that describe the type or category of the entities involved in the question. This can include types like 'person', 'place', 'organization', etc. Example: In "Which film directed by Garry Marshall?", the type 'film' might be tagged as `<t>film</t>`. |
32
+ | `<op>` | **Operators**: Tags operators used in questions that involve comparisons or calculations, such as 'greater than', 'less than', 'more than'. Example: In "Which country has a population greater than 50 million?", the operator 'greater than' is tagged as `<op>greater than</op>`. |
33
+ | `<ref>`| **References**: Tags in questions that refer back to previously mentioned entities or concepts. These can indicate cycles or self-references in queries. Example: In "Who is the CEO of the company founded by himself?", the word 'himself' is tagged as `<ref>himself</ref>`. |
34
+
35
+
36
+ ## How to use the model?
37
+ To use the model, you can run it with TorchTune commands. I have provided the necessary Python code to automate the process. Follow these steps to get started:
38
+
39
+ ### Step 1: Create a Configuration File
40
+ First, save a file named `custom_generation_config_bigModel.yaml` in `/home/USERNAME/` with the following content:
41
+
42
+ ```yaml
43
+ # Config for running the InferenceRecipe in generate.py to generate output from an LLM
44
+
45
+ # Model arguments
46
+ model:
47
+ _component_: torchtune.models.llama3.llama3_8b
48
+
49
+ checkpointer:
50
+ _component_: torchtune.utils.FullModelMetaCheckpointer
51
+ checkpoint_dir: /home/USERNAME/Meta-Llama-3-8B/
52
+ checkpoint_files: [
53
+ meta_model_0.pt
54
+ ]
55
+ output_dir: /home/USERNAME/Meta-Llama-3-8B/
56
+ model_type: LLAMA3
57
+
58
+ device: cuda
59
+ dtype: bf16
60
+
61
+ seed: 1234
62
+
63
+ # Tokenizer arguments
64
+ tokenizer:
65
+ _component_: torchtune.models.llama3.llama3_tokenizer
66
+ path: /home/USERNAME/Meta-Llama-3-8B/original/tokenizer.model
67
+
68
+ # Generation arguments; defaults taken from gpt-fast
69
+ prompt: "### Instruction: \nYou are a powerful model trained to convert questions to tagged questions. Use the tags as follows: \n<qt> to surround question keywords like 'What', 'Who', 'Which', 'How many', 'Return' or any word that represents requests. \n<o> to surround entities as an object like person name, place name, etc. It must be a noun or a noun phrase. \n<s> to surround entities as a subject like person name, place name, etc. The difference between <s> and <o>, <s> only appear in yes/no questions as in the training data you saw before. \n<cc> to surround coordinating conjunctions that connect two or more phrases like 'and', 'or', 'nor', etc. \n<p> to surround predicates that may be an entity attribute or a relationship between two entities. It can be a verb phrase or a noun phrase. The question must contain at least one predicate. \n<off> for offset in questions asking for the second, third, etc. For example, the question 'What is the second largest country?', <off> will be located as follows. 'What is the <off>second</off> largest country?' \n<t> to surround entity types like person, place, etc. \n<op> to surround operators that compare quantities or values, like 'greater than', 'more than', etc. \n<ref> to indicate a reference within the question that requires a cycle to refer back to an entity (e.g., 'Who is the CEO of a company founded by himself?' where 'himself' would be tagged as <ref>himself</ref>). \nInput: Which films directed by a dirctor died in 2014 and starring both Julia Roberts and Richard Gere?\nResponse:"
70
+ max_new_tokens: 100
71
+ temperature: 0.6
72
+ top_k: 1
73
+
74
+ quantizer: null
75
+ ```
76
+
77
+ ### Step 2: Set Up the Environment
78
+ Create a virtual environment:
79
+
80
+ ```bash
81
+ /home/USERNAME/myenv
82
+ ```
83
+
84
+ Install TorchTune with:
85
+ ```bash
86
+ pip install torchtune
87
+ ```
88
+
89
+ ### Step 3: Create the Python File
90
+ Next, create a Python file called `command.py` with the following content:
91
+
92
+ ```python
93
+ import subprocess
94
+ import os
95
+ import re
96
+ import shlex # For safely handling command line arguments
97
+
98
+ def _create_config_file(question):
99
+ # Path to the template and output config file
100
+ template_path = "/home/USERNAME/custom_generation_config_bigModel.yaml"
101
+ output_path = "/tmp/dynamic_generation.yaml"
102
+
103
+ # Load the template from the file
104
+ with open(template_path, 'r') as file:
105
+ config_template = file.read()
106
+
107
+ # Replace the placeholder in the template with the actual question
108
+ updated_prompt = config_template.replace("Input: Which films directed by a dirctor died in 2014 and starring both Julia Roberts and Richard Gere?", f"Input: {question}")
109
+ maxLen = int(1.3*len(question))
110
+ print(f"maxLen: {maxLen}")
111
+ updated_prompt = updated_prompt.replace("max_new_tokens: 100", f"max_new_tokens: {maxLen}")
112
+
113
+ # Write the updated configuration to a new file
114
+ with open(output_path, 'w') as file:
115
+ file.write(updated_prompt)
116
+
117
+ print(f"Configuration file created at: {output_path}")
118
+
119
+ def get_tagged_question(question):
120
+ # Define the path to the virtual environment's activation script
121
+ activate_env = "/home/USERNAME/myenv/bin/activate"
122
+
123
+ # Create configuration file with the question
124
+ _create_config_file(question)
125
+
126
+ print('get_tagged_question')
127
+
128
+ # Command to run within the virtual environment
129
+ command = f"tune run generate --config /tmp/dynamic_generation.yaml"
130
+
131
+ # Full command to activate the environment and run your command
132
+ full_command = f"source {activate_env} && {command}"
133
+
134
+ # Run the full command in a shell
135
+ try:
136
+ result = subprocess.run(full_command, shell=True, check=True, text=True, capture_output=True, executable="/bin/bash")
137
+ print("Command output:", result.stdout)
138
+ print("Command error output:", result.stderr)
139
+
140
+ output = result.stdout + result.stderr
141
+ # Extract the input and response using modified regular expressions
142
+ input_match = re.search(r'Input: (.*?)(?=Response:)', output, re.S)
143
+ response_match = re.search(r'Response: (.*)', output)
144
+
145
+ response_match = response_match.group(1).strip()
146
+
147
+ if input_match and response_match:
148
+ print("Input Question: ", question)
149
+ print("Extracted Response: ", response_match)
150
+ else:
151
+ print("Input or Response not found in the output.")
152
+
153
+ except subprocess.CalledProcessError as e:
154
+ print("An error occurred:", e.stderr)
155
+ return response_match
156
+
157
+ if __name__ == "__main__":
158
+ # Call the function with a sample question
159
+ get_tagged_question("Who is the president of largest country in Africa?")
160
+ ```
161
+
162
+ ### Step 4: Run the Script
163
+ To run the script and generate tagged questions, execute the following command in your terminal:
164
+
165
+ ```bash
166
+ python command.py
167
+ ```
168
+
169
+
170
+
171
+ ## How we finetuned the model?
172
 
173
+ ### Model Configuration
174
  See https://pytorch.org/torchtune/stable/tutorials/e2e_flow.html to know how to use torchtune.
175
 
176
  To finetune the model: