chilly-magician
commited on
Commit
•
effad5d
1
Parent(s):
b762f14
[add]: base model card description
Browse files- README.md +180 -52
- requirements.txt +3 -1
README.md
CHANGED
@@ -3,57 +3,201 @@ library_name: peft
|
|
3 |
base_model: tiiuae/falcon-7b-instruct
|
4 |
---
|
5 |
|
6 |
-
# Model Card for
|
7 |
|
8 |
-
|
|
|
|
|
9 |
|
|
|
|
|
10 |
|
|
|
|
|
|
|
11 |
|
12 |
-
|
13 |
-
|
14 |
-
|
15 |
-
|
16 |
-
<!-- Provide a longer summary of what this model is. -->
|
17 |
|
|
|
|
|
|
|
18 |
|
|
|
|
|
19 |
|
20 |
-
|
21 |
-
- **Funded by [optional]:** [More Information Needed]
|
22 |
-
- **Shared by [optional]:** [More Information Needed]
|
23 |
-
- **Model type:** [More Information Needed]
|
24 |
-
- **Language(s) (NLP):** [More Information Needed]
|
25 |
-
- **License:** [More Information Needed]
|
26 |
-
- **Finetuned from model [optional]:** [More Information Needed]
|
27 |
-
|
28 |
-
### Model Sources [optional]
|
29 |
|
30 |
-
|
31 |
|
32 |
-
|
33 |
-
|
34 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
35 |
|
36 |
## Uses
|
37 |
|
38 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
39 |
|
40 |
### Direct Use
|
41 |
|
42 |
-
|
43 |
-
|
44 |
-
|
45 |
-
|
46 |
-
|
47 |
-
|
48 |
-
|
49 |
-
|
50 |
-
|
51 |
-
|
52 |
-
|
53 |
-
|
54 |
-
|
55 |
-
|
56 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
57 |
|
58 |
## Bias, Risks, and Limitations
|
59 |
|
@@ -168,22 +312,6 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
|
|
168 |
|
169 |
[More Information Needed]
|
170 |
|
171 |
-
## Citation [optional]
|
172 |
-
|
173 |
-
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
174 |
-
|
175 |
-
**BibTeX:**
|
176 |
-
|
177 |
-
[More Information Needed]
|
178 |
-
|
179 |
-
**APA:**
|
180 |
-
|
181 |
-
[More Information Needed]
|
182 |
-
|
183 |
-
## Glossary [optional]
|
184 |
-
|
185 |
-
<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
|
186 |
-
|
187 |
[More Information Needed]
|
188 |
|
189 |
## More Information [optional]
|
|
|
3 |
base_model: tiiuae/falcon-7b-instruct
|
4 |
---
|
5 |
|
6 |
+
# Model Card for the Query Parser LLM using Falcon-7B-Instruct
|
7 |
|
8 |
+
[![version](https://img.shields.io/badge/version-0.0.1-red.svg)]()
|
9 |
+
[![Python 3.9](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/release/python-390/)
|
10 |
+
![CUDA 11.7.1](https://img.shields.io/badge/CUDA-11.7.1-green.svg)
|
11 |
|
12 |
+
EmbeddingStudio is the [open-source framework](https://github.com/EulerSearch/embedding_studio/tree/main), that allows you transform a joint "Embedding Model + Vector DB" into
|
13 |
+
a full-cycle search engine: collect clickstream -> improve search experience-> adapt embedding model and repeat out of the box.
|
14 |
|
15 |
+
It's a highly rare case when a company will use unstructured search as is. And by searching `brick red houses san francisco area for april`
|
16 |
+
user definitely wants to find some houses in San Francisco for a month-long rent in April, and then maybe brick-red houses.
|
17 |
+
Unfortunately, for the 15th January 2024 there is no such accurate embedding model. So, companies need to mix structured and unstructured search.
|
18 |
|
19 |
+
The very first step of mixing it - to parse a search query. Usual approaches are:
|
20 |
+
* Implement a bunch of rules, regexps, or grammar parsers (like [NLTK grammar parser](https://www.nltk.org/howto/grammar.html)).
|
21 |
+
* Collect search queries and to annotate some dataset for NER task.
|
|
|
|
|
22 |
|
23 |
+
It takes some time to do, but at the end you can get controllable and very accurate query parser.
|
24 |
+
EmbeddingStudio team decided to dive into LLM instruct fine-tuning for `Zero-Shot query parsing` task
|
25 |
+
to close the first gap while a company doesn't have any rules and data being collected, or even eliminate exhausted rules implementation, but in the future.
|
26 |
|
27 |
+
The main idea is to align an LLM to being to parse short search queries knowing just a company market and a schema of search filters. Moreover, being oriented on applied NLP,
|
28 |
+
we are trying to serve only light-weight LLMs a.k.a `not heavier than 7B parameters`.
|
29 |
|
30 |
+
## Model Details
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
31 |
|
32 |
+
### Model Description
|
33 |
|
34 |
+
This is only [Falcon-7B-Instruct](https://huggingface.co/tiiuae/falcon-7b-instruct) aligned to follow instructions like:
|
35 |
+
```markdown
|
36 |
+
### System: Master in Query Analysis
|
37 |
+
### Instruction: Organize queries in JSON, adhere to schema, verify spelling.
|
38 |
+
#### Category: Logistics and Supply Chain Management
|
39 |
+
#### Schema: ```[{"Name": "Customer_Ratings", "Representations": [{"Name": "Exact_Rating", "Type": "float", "Examples": [4.5, 3.2, 5.0, "4.5", "Unstructured"]}, {"Name": "Minimum_Rating", "Type": "float", "Examples": [4.0, 3.0, 5.0, "4.5"]}, {"Name": "Star_Rating", "Type": "int", "Examples": [4, 3, 5], "Enum": [1, 2, 3, 4, 5]}]}, {"Name": "Date", "Representations": [{"Name": "Day_Month_Year", "Type": "str", "Examples": ["01.01.2024", "15.06.2023", "31.12.2022", "25.12.2021", "20.07.2024", "15.06.2023"], "Pattern": "dd.mm.YYYY"}, {"Name": "Day_Name", "Type": "str", "Examples": ["Monday", "Wednesday", "Friday", "Thursday", "Monday", "Tuesday"], "Enum": ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]}]}, {"Name": "Date_Period", "Representations": [{"Name": "Specific_Period", "Type": "str", "Examples": ["01.01.2024 - 31.01.2024", "01.06.2023 - 30.06.2023", "01.12.2022 - 31.12.2022"], "Pattern": "dd.mm.YYYY - dd.mm.YYYY"}, {"Name": "Month", "Type": "str", "Examples": ["January", "June", "December"], "Enum": ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]}, {"Name": "Quarter", "Type": "str", "Examples": ["Q1", "Q2", "Q3"], "Enum": ["Q1", "Q2", "Q3", "Q4"]}, {"Name": "Season", "Type": "str", "Examples": ["Winter", "Summer", "Autumn"], "Enum": ["Winter", "Spring", "Summer", "Autumn"]}]}, {"Name": "Destination_Country", "Representations": [{"Name": "Country_Name", "Type": "str", "Examples": ["United States", "Germany", "China"]}, {"Name": "Country_Code", "Type": "str", "Examples": ["US", "DE", "CN"]}, {"Name": "Country_Abbreviation", "Type": "str", "Examples": ["USA", "GER", "CHN"]}]}]```
|
40 |
+
#### Query: Which logistics companies in the US have a perfect 5.0 rating ?
|
41 |
+
### Response:
|
42 |
+
[{"Value": "Which logistics companies in the US have a perfect 5.0 rating?", "Name": "Correct"}, {"Name": "Customer_Ratings.Exact_Rating", "Value": 5.0}, {"Name": "Destination_Country.Country_Code", "Value": "US"}]
|
43 |
+
```
|
44 |
+
|
45 |
+
**Important:** Additionally, we are trying to fine-tune the Large Language Model (LLM) to not only parse unstructured search queries but also to correct spelling.
|
46 |
+
|
47 |
+
- **Developed by EmbeddingStudio team:**
|
48 |
+
* Aleksandr Iudaev | [LinkedIn](https://www.linkedin.com/in/alexanderyudaev/) | [Email](mailto:alexander@yudaev.ru) |
|
49 |
+
* Andrei Kostin | [LinkedIn](https://www.linkedin.com/in/andrey-kostin/) | [Email](mailto:andreynitsok@gmail.com) |
|
50 |
+
* ML Doom | `AI Assistant`
|
51 |
+
- **Funded by EmbeddingStudio team**
|
52 |
+
- **Model type:** Instruct Fine-Tuned Large Language Model
|
53 |
+
- **Model task:** Zero-shot search query parsing
|
54 |
+
- **Language(s) (NLP):** English
|
55 |
+
- **License:** apache-2.0
|
56 |
+
- **Finetuned from model:** [Falcon-7B-Instruct](https://huggingface.co/tiiuae/falcon-7b-instruct)
|
57 |
+
- **!Maximal Length Size:** we used 1024 for fine-tuning, this is highly different from the original model `max_seq_length = 2048`
|
58 |
+
- **Tuning Epochs:** 3 for now, but will be more later.
|
59 |
+
|
60 |
+
**Disclaimer:** As a small startup, this direction forms a part of our Minimum Viable Product (MVP). It's more of
|
61 |
+
an attempt to test the 'product-market fit' rather than a well-structured scientific endeavor. Once we check it and go with a round, we definitely will:
|
62 |
+
* Curating a specific dataset for more precise analysis.
|
63 |
+
* Exploring various approaches and Large Language Models (LLMs) to identify the most effective solution.
|
64 |
+
* Publishing a detailed paper to ensure our findings and methodologies can be thoroughly reviewed and verified.
|
65 |
+
|
66 |
+
We acknowledge the complexity involved in utilizing Large Language Models, particularly in the context
|
67 |
+
of `Zero-Shot search query parsing` and `AI Alignment`. Given the intricate nature of this technology, we emphasize the importance of rigorous verification.
|
68 |
+
Until our work is thoroughly reviewed, we recommend being cautious and critical of the results.
|
69 |
+
|
70 |
+
### Model Sources
|
71 |
+
|
72 |
+
- **Repository:** code of inference the model will be [here](https://github.com/EulerSearch/embedding_studio/tree/main)
|
73 |
+
- **Paper:** Work In Progress
|
74 |
+
- **Demo:** Work In Progress
|
75 |
|
76 |
## Uses
|
77 |
|
78 |
+
We strongly recommend only the direct usage of this fine-tuned version of [Falcon-7B-Instruct](https://huggingface.co/tiiuae/falcon-7b-instruct):
|
79 |
+
* Zero-shot Search Query Parsing with porived company market name and filters schema
|
80 |
+
* Search Query Spell Correction
|
81 |
+
|
82 |
+
For any other needs the behaviour of the model in unpredictable, please utilize the [original mode](https://huggingface.co/tiiuae/falcon-7b-instruct) or fine-tune your own.
|
83 |
+
|
84 |
+
### Instruction format
|
85 |
+
|
86 |
+
```markdown
|
87 |
+
### System: Master in Query Analysis
|
88 |
+
### Instruction: Organize queries in JSON, adhere to schema, verify spelling.
|
89 |
+
#### Category: {your_company_category}
|
90 |
+
#### Schema: ```{filters_schema}```
|
91 |
+
#### Query: {query}
|
92 |
+
### Response:
|
93 |
+
```
|
94 |
+
|
95 |
+
Filters schema is JSON-readable line in the format (we highly recommend you to use it):
|
96 |
+
List of filters (dict):
|
97 |
+
* Name - name of filter (better to be meaningful).
|
98 |
+
* Representations - list of possible filter formats (dict):
|
99 |
+
* Name - name of representation (better to be meaningful).
|
100 |
+
* Type - python base type (int, float, str, bool).
|
101 |
+
* Examples - list of examples.
|
102 |
+
* Enum - if a representation is enumeration, provide a list of possible values, LLM should map parsed value into this list.
|
103 |
+
* Pattern - if a representation is pattern-like (datetime, regexp, etc.) provide a pattern text in any format.
|
104 |
+
|
105 |
+
Example:
|
106 |
+
```json
|
107 |
+
[{"Name": "Customer_Ratings", "Representations": [{"Name": "Exact_Rating", "Type": "float", "Examples": [4.5, 3.2, 5.0, "4.5", "Unstructured"]}, {"Name": "Minimum_Rating", "Type": "float", "Examples": [4.0, 3.0, 5.0, "4.5"]}, {"Name": "Star_Rating", "Type": "int", "Examples": [4, 3, 5], "Enum": [1, 2, 3, 4, 5]}]}, {"Name": "Date", "Representations": [{"Name": "Day_Month_Year", "Type": "str", "Examples": ["01.01.2024", "15.06.2023", "31.12.2022", "25.12.2021", "20.07.2024", "15.06.2023"], "Pattern": "dd.mm.YYYY"}, {"Name": "Day_Name", "Type": "str", "Examples": ["Monday", "Wednesday", "Friday", "Thursday", "Monday", "Tuesday"], "Enum": ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]}]}, {"Name": "Date_Period", "Representations": [{"Name": "Specific_Period", "Type": "str", "Examples": ["01.01.2024 - 31.01.2024", "01.06.2023 - 30.06.2023", "01.12.2022 - 31.12.2022"], "Pattern": "dd.mm.YYYY - dd.mm.YYYY"}, {"Name": "Month", "Type": "str", "Examples": ["January", "June", "December"], "Enum": ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]}, {"Name": "Quarter", "Type": "str", "Examples": ["Q1", "Q2", "Q3"], "Enum": ["Q1", "Q2", "Q3", "Q4"]}, {"Name": "Season", "Type": "str", "Examples": ["Winter", "Summer", "Autumn"], "Enum": ["Winter", "Spring", "Summer", "Autumn"]}]}, {"Name": "Destination_Country", "Representations": [{"Name": "Country_Name", "Type": "str", "Examples": ["United States", "Germany", "China"]}, {"Name": "Country_Code", "Type": "str", "Examples": ["US", "DE", "CN"]}, {"Name": "Country_Abbreviation", "Type": "str", "Examples": ["USA", "GER", "CHN"]}]}]
|
108 |
+
```
|
109 |
+
|
110 |
+
As the result, response will be JSON-readable line in the format:
|
111 |
+
```json
|
112 |
+
[{"Value": "Corrected search phrase", "Name": "Correct"}, {"Name": "filter-name.representation", "Value": "some-value"}]
|
113 |
+
```
|
114 |
+
|
115 |
+
Field and representation names will be aligned with the provided schema. Example:
|
116 |
+
```json
|
117 |
+
[{"Value": "Which logistics companies in the US have a perfect 5.0 rating?", "Name": "Correct"}, {"Name": "Customer_Ratings.Exact_Rating", "Value": 5.0}, {"Name": "Destination_Country.Country_Code", "Value": "US"}]
|
118 |
+
```
|
119 |
+
|
120 |
+
|
121 |
+
Used for fine-tuning `system` phrases:
|
122 |
+
```python
|
123 |
+
[
|
124 |
+
"Expert at Deconstructing Search Queries",
|
125 |
+
"Master in Query Analysis",
|
126 |
+
"Premier Search Query Interpreter",
|
127 |
+
"Advanced Search Query Decoder",
|
128 |
+
"Search Query Parsing Genius",
|
129 |
+
"Search Query Parsing Wizard",
|
130 |
+
"Unrivaled Query Parsing Mechanism",
|
131 |
+
"Search Query Parsing Virtuoso",
|
132 |
+
"Query Parsing Maestro",
|
133 |
+
"Ace of Search Query Structuring"
|
134 |
+
]
|
135 |
+
```
|
136 |
+
|
137 |
+
Used for fine-tuning `instruction` phrases:
|
138 |
+
```python
|
139 |
+
[
|
140 |
+
"Convert queries to JSON, align with schema, ensure correct spelling.",
|
141 |
+
"Analyze and structure queries in JSON, maintain schema, check spelling.",
|
142 |
+
"Organize queries in JSON, adhere to schema, verify spelling.",
|
143 |
+
"Decode queries to JSON, follow schema, correct spelling.",
|
144 |
+
"Parse queries to JSON, match schema, spell correctly.",
|
145 |
+
"Transform queries to structured JSON, align with schema and spelling.",
|
146 |
+
"Restructure queries in JSON, comply with schema, accurate spelling.",
|
147 |
+
"Rearrange queries in JSON, strict schema adherence, maintain spelling.",
|
148 |
+
"Harmonize queries with JSON schema, ensure spelling accuracy.",
|
149 |
+
"Efficient JSON conversion of queries, schema compliance, correct spelling."
|
150 |
+
]
|
151 |
+
```
|
152 |
|
153 |
### Direct Use
|
154 |
|
155 |
+
```python
|
156 |
+
import json
|
157 |
+
|
158 |
+
from json import JSONDecodeError
|
159 |
+
|
160 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
161 |
+
|
162 |
+
INSTRUCTION_TEMPLATE = """
|
163 |
+
### System: Master in Query Analysis
|
164 |
+
### Instruction: Organize queries in JSON, adhere to schema, verify spelling.
|
165 |
+
#### Category: {0}
|
166 |
+
#### Schema: ```{1}```
|
167 |
+
#### Query: {2}
|
168 |
+
### Response:
|
169 |
+
"""
|
170 |
+
|
171 |
+
|
172 |
+
|
173 |
+
def parse(
|
174 |
+
query: str,
|
175 |
+
company_category: str,
|
176 |
+
filter_schema: dict,
|
177 |
+
model: AutoModelForCausalLM,
|
178 |
+
tokenizer: AutoTokenizer
|
179 |
+
):
|
180 |
+
input_text = INSTRUCTION_TEMPLATE.format(
|
181 |
+
company_category,
|
182 |
+
json.dumps(filter_schema),
|
183 |
+
query
|
184 |
+
)
|
185 |
+
input_ids = tokenizer.encode(input_text, return_tensors='pt')
|
186 |
+
|
187 |
+
# Generating text
|
188 |
+
output = model.generate(input_ids.to('cuda'),
|
189 |
+
max_new_tokens=1024,
|
190 |
+
do_sample=True,
|
191 |
+
temperature=0.05,
|
192 |
+
pad_token_id=50256
|
193 |
+
)
|
194 |
+
try:
|
195 |
+
parsed = json.loads(tokenizer.decode(output[0], skip_special_tokens=True).split('## Response:\n')[-1])
|
196 |
+
except JSONDecodeError as e:
|
197 |
+
parsed = dict()
|
198 |
+
|
199 |
+
return parsed
|
200 |
+
```
|
201 |
|
202 |
## Bias, Risks, and Limitations
|
203 |
|
|
|
312 |
|
313 |
[More Information Needed]
|
314 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
315 |
[More Information Needed]
|
316 |
|
317 |
## More Information [optional]
|
requirements.txt
CHANGED
@@ -1,9 +1,11 @@
|
|
|
|
1 |
datasets==2.16.1
|
2 |
nltk==3.8.1
|
3 |
huggingface-hub==0.19.4
|
|
|
4 |
torch==2.0.0+cu117
|
5 |
torchmetrics==1.2.0
|
6 |
torchsummary==1.5.1
|
7 |
torchtext==0.15.0+cpu
|
8 |
transformers==4.36.2
|
9 |
-
trl==0.7.7
|
|
|
1 |
+
bitsandbytes==0.41.0
|
2 |
datasets==2.16.1
|
3 |
nltk==3.8.1
|
4 |
huggingface-hub==0.19.4
|
5 |
+
peft==0.5.0
|
6 |
torch==2.0.0+cu117
|
7 |
torchmetrics==1.2.0
|
8 |
torchsummary==1.5.1
|
9 |
torchtext==0.15.0+cpu
|
10 |
transformers==4.36.2
|
11 |
+
trl==0.7.7
|