chilly-magician commited on
Commit
effad5d
1 Parent(s): b762f14

[add]: base model card description

Browse files
Files changed (2) hide show
  1. README.md +180 -52
  2. requirements.txt +3 -1
README.md CHANGED
@@ -3,57 +3,201 @@ library_name: peft
3
  base_model: tiiuae/falcon-7b-instruct
4
  ---
5
 
6
- # Model Card for Model ID
7
 
8
- <!-- Provide a quick summary of what the model is/does. -->
 
 
9
 
 
 
10
 
 
 
 
11
 
12
- ## Model Details
13
-
14
- ### Model Description
15
-
16
- <!-- Provide a longer summary of what this model is. -->
17
 
 
 
 
18
 
 
 
19
 
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
-
28
- ### Model Sources [optional]
29
 
30
- <!-- Provide the basic links for the model. -->
31
 
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
  ## Uses
37
 
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
 
40
  ### Direct Use
41
 
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
-
44
- [More Information Needed]
45
-
46
- ### Downstream Use [optional]
47
-
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
-
50
- [More Information Needed]
51
-
52
- ### Out-of-Scope Use
53
-
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
-
56
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57
 
58
  ## Bias, Risks, and Limitations
59
 
@@ -168,22 +312,6 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
168
 
169
  [More Information Needed]
170
 
171
- ## Citation [optional]
172
-
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
-
175
- **BibTeX:**
176
-
177
- [More Information Needed]
178
-
179
- **APA:**
180
-
181
- [More Information Needed]
182
-
183
- ## Glossary [optional]
184
-
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
-
187
  [More Information Needed]
188
 
189
  ## More Information [optional]
 
3
  base_model: tiiuae/falcon-7b-instruct
4
  ---
5
 
6
+ # Model Card for the Query Parser LLM using Falcon-7B-Instruct
7
 
8
+ [![version](https://img.shields.io/badge/version-0.0.1-red.svg)]()
9
+ [![Python 3.9](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/release/python-390/)
10
+ ![CUDA 11.7.1](https://img.shields.io/badge/CUDA-11.7.1-green.svg)
11
 
12
+ EmbeddingStudio is the [open-source framework](https://github.com/EulerSearch/embedding_studio/tree/main), that allows you transform a joint "Embedding Model + Vector DB" into
13
+ a full-cycle search engine: collect clickstream -> improve search experience-> adapt embedding model and repeat out of the box.
14
 
15
+ It's a highly rare case when a company will use unstructured search as is. And by searching `brick red houses san francisco area for april`
16
+ user definitely wants to find some houses in San Francisco for a month-long rent in April, and then maybe brick-red houses.
17
+ Unfortunately, for the 15th January 2024 there is no such accurate embedding model. So, companies need to mix structured and unstructured search.
18
 
19
+ The very first step of mixing it - to parse a search query. Usual approaches are:
20
+ * Implement a bunch of rules, regexps, or grammar parsers (like [NLTK grammar parser](https://www.nltk.org/howto/grammar.html)).
21
+ * Collect search queries and to annotate some dataset for NER task.
 
 
22
 
23
+ It takes some time to do, but at the end you can get controllable and very accurate query parser.
24
+ EmbeddingStudio team decided to dive into LLM instruct fine-tuning for `Zero-Shot query parsing` task
25
+ to close the first gap while a company doesn't have any rules and data being collected, or even eliminate exhausted rules implementation, but in the future.
26
 
27
+ The main idea is to align an LLM to being to parse short search queries knowing just a company market and a schema of search filters. Moreover, being oriented on applied NLP,
28
+ we are trying to serve only light-weight LLMs a.k.a `not heavier than 7B parameters`.
29
 
30
+ ## Model Details
 
 
 
 
 
 
 
 
31
 
32
+ ### Model Description
33
 
34
+ This is only [Falcon-7B-Instruct](https://huggingface.co/tiiuae/falcon-7b-instruct) aligned to follow instructions like:
35
+ ```markdown
36
+ ### System: Master in Query Analysis
37
+ ### Instruction: Organize queries in JSON, adhere to schema, verify spelling.
38
+ #### Category: Logistics and Supply Chain Management
39
+ #### Schema: ```[{"Name": "Customer_Ratings", "Representations": [{"Name": "Exact_Rating", "Type": "float", "Examples": [4.5, 3.2, 5.0, "4.5", "Unstructured"]}, {"Name": "Minimum_Rating", "Type": "float", "Examples": [4.0, 3.0, 5.0, "4.5"]}, {"Name": "Star_Rating", "Type": "int", "Examples": [4, 3, 5], "Enum": [1, 2, 3, 4, 5]}]}, {"Name": "Date", "Representations": [{"Name": "Day_Month_Year", "Type": "str", "Examples": ["01.01.2024", "15.06.2023", "31.12.2022", "25.12.2021", "20.07.2024", "15.06.2023"], "Pattern": "dd.mm.YYYY"}, {"Name": "Day_Name", "Type": "str", "Examples": ["Monday", "Wednesday", "Friday", "Thursday", "Monday", "Tuesday"], "Enum": ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]}]}, {"Name": "Date_Period", "Representations": [{"Name": "Specific_Period", "Type": "str", "Examples": ["01.01.2024 - 31.01.2024", "01.06.2023 - 30.06.2023", "01.12.2022 - 31.12.2022"], "Pattern": "dd.mm.YYYY - dd.mm.YYYY"}, {"Name": "Month", "Type": "str", "Examples": ["January", "June", "December"], "Enum": ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]}, {"Name": "Quarter", "Type": "str", "Examples": ["Q1", "Q2", "Q3"], "Enum": ["Q1", "Q2", "Q3", "Q4"]}, {"Name": "Season", "Type": "str", "Examples": ["Winter", "Summer", "Autumn"], "Enum": ["Winter", "Spring", "Summer", "Autumn"]}]}, {"Name": "Destination_Country", "Representations": [{"Name": "Country_Name", "Type": "str", "Examples": ["United States", "Germany", "China"]}, {"Name": "Country_Code", "Type": "str", "Examples": ["US", "DE", "CN"]}, {"Name": "Country_Abbreviation", "Type": "str", "Examples": ["USA", "GER", "CHN"]}]}]```
40
+ #### Query: Which logistics companies in the US have a perfect 5.0 rating ?
41
+ ### Response:
42
+ [{"Value": "Which logistics companies in the US have a perfect 5.0 rating?", "Name": "Correct"}, {"Name": "Customer_Ratings.Exact_Rating", "Value": 5.0}, {"Name": "Destination_Country.Country_Code", "Value": "US"}]
43
+ ```
44
+
45
+ **Important:** Additionally, we are trying to fine-tune the Large Language Model (LLM) to not only parse unstructured search queries but also to correct spelling.
46
+
47
+ - **Developed by EmbeddingStudio team:**
48
+ * Aleksandr Iudaev | [LinkedIn](https://www.linkedin.com/in/alexanderyudaev/) | [Email](mailto:alexander@yudaev.ru) |
49
+ * Andrei Kostin | [LinkedIn](https://www.linkedin.com/in/andrey-kostin/) | [Email](mailto:andreynitsok@gmail.com) |
50
+ * ML Doom | `AI Assistant`
51
+ - **Funded by EmbeddingStudio team**
52
+ - **Model type:** Instruct Fine-Tuned Large Language Model
53
+ - **Model task:** Zero-shot search query parsing
54
+ - **Language(s) (NLP):** English
55
+ - **License:** apache-2.0
56
+ - **Finetuned from model:** [Falcon-7B-Instruct](https://huggingface.co/tiiuae/falcon-7b-instruct)
57
+ - **!Maximal Length Size:** we used 1024 for fine-tuning, this is highly different from the original model `max_seq_length = 2048`
58
+ - **Tuning Epochs:** 3 for now, but will be more later.
59
+
60
+ **Disclaimer:** As a small startup, this direction forms a part of our Minimum Viable Product (MVP). It's more of
61
+ an attempt to test the 'product-market fit' rather than a well-structured scientific endeavor. Once we check it and go with a round, we definitely will:
62
+ * Curating a specific dataset for more precise analysis.
63
+ * Exploring various approaches and Large Language Models (LLMs) to identify the most effective solution.
64
+ * Publishing a detailed paper to ensure our findings and methodologies can be thoroughly reviewed and verified.
65
+
66
+ We acknowledge the complexity involved in utilizing Large Language Models, particularly in the context
67
+ of `Zero-Shot search query parsing` and `AI Alignment`. Given the intricate nature of this technology, we emphasize the importance of rigorous verification.
68
+ Until our work is thoroughly reviewed, we recommend being cautious and critical of the results.
69
+
70
+ ### Model Sources
71
+
72
+ - **Repository:** code of inference the model will be [here](https://github.com/EulerSearch/embedding_studio/tree/main)
73
+ - **Paper:** Work In Progress
74
+ - **Demo:** Work In Progress
75
 
76
  ## Uses
77
 
78
+ We strongly recommend only the direct usage of this fine-tuned version of [Falcon-7B-Instruct](https://huggingface.co/tiiuae/falcon-7b-instruct):
79
+ * Zero-shot Search Query Parsing with porived company market name and filters schema
80
+ * Search Query Spell Correction
81
+
82
+ For any other needs the behaviour of the model in unpredictable, please utilize the [original mode](https://huggingface.co/tiiuae/falcon-7b-instruct) or fine-tune your own.
83
+
84
+ ### Instruction format
85
+
86
+ ```markdown
87
+ ### System: Master in Query Analysis
88
+ ### Instruction: Organize queries in JSON, adhere to schema, verify spelling.
89
+ #### Category: {your_company_category}
90
+ #### Schema: ```{filters_schema}```
91
+ #### Query: {query}
92
+ ### Response:
93
+ ```
94
+
95
+ Filters schema is JSON-readable line in the format (we highly recommend you to use it):
96
+ List of filters (dict):
97
+ * Name - name of filter (better to be meaningful).
98
+ * Representations - list of possible filter formats (dict):
99
+ * Name - name of representation (better to be meaningful).
100
+ * Type - python base type (int, float, str, bool).
101
+ * Examples - list of examples.
102
+ * Enum - if a representation is enumeration, provide a list of possible values, LLM should map parsed value into this list.
103
+ * Pattern - if a representation is pattern-like (datetime, regexp, etc.) provide a pattern text in any format.
104
+
105
+ Example:
106
+ ```json
107
+ [{"Name": "Customer_Ratings", "Representations": [{"Name": "Exact_Rating", "Type": "float", "Examples": [4.5, 3.2, 5.0, "4.5", "Unstructured"]}, {"Name": "Minimum_Rating", "Type": "float", "Examples": [4.0, 3.0, 5.0, "4.5"]}, {"Name": "Star_Rating", "Type": "int", "Examples": [4, 3, 5], "Enum": [1, 2, 3, 4, 5]}]}, {"Name": "Date", "Representations": [{"Name": "Day_Month_Year", "Type": "str", "Examples": ["01.01.2024", "15.06.2023", "31.12.2022", "25.12.2021", "20.07.2024", "15.06.2023"], "Pattern": "dd.mm.YYYY"}, {"Name": "Day_Name", "Type": "str", "Examples": ["Monday", "Wednesday", "Friday", "Thursday", "Monday", "Tuesday"], "Enum": ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]}]}, {"Name": "Date_Period", "Representations": [{"Name": "Specific_Period", "Type": "str", "Examples": ["01.01.2024 - 31.01.2024", "01.06.2023 - 30.06.2023", "01.12.2022 - 31.12.2022"], "Pattern": "dd.mm.YYYY - dd.mm.YYYY"}, {"Name": "Month", "Type": "str", "Examples": ["January", "June", "December"], "Enum": ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]}, {"Name": "Quarter", "Type": "str", "Examples": ["Q1", "Q2", "Q3"], "Enum": ["Q1", "Q2", "Q3", "Q4"]}, {"Name": "Season", "Type": "str", "Examples": ["Winter", "Summer", "Autumn"], "Enum": ["Winter", "Spring", "Summer", "Autumn"]}]}, {"Name": "Destination_Country", "Representations": [{"Name": "Country_Name", "Type": "str", "Examples": ["United States", "Germany", "China"]}, {"Name": "Country_Code", "Type": "str", "Examples": ["US", "DE", "CN"]}, {"Name": "Country_Abbreviation", "Type": "str", "Examples": ["USA", "GER", "CHN"]}]}]
108
+ ```
109
+
110
+ As the result, response will be JSON-readable line in the format:
111
+ ```json
112
+ [{"Value": "Corrected search phrase", "Name": "Correct"}, {"Name": "filter-name.representation", "Value": "some-value"}]
113
+ ```
114
+
115
+ Field and representation names will be aligned with the provided schema. Example:
116
+ ```json
117
+ [{"Value": "Which logistics companies in the US have a perfect 5.0 rating?", "Name": "Correct"}, {"Name": "Customer_Ratings.Exact_Rating", "Value": 5.0}, {"Name": "Destination_Country.Country_Code", "Value": "US"}]
118
+ ```
119
+
120
+
121
+ Used for fine-tuning `system` phrases:
122
+ ```python
123
+ [
124
+ "Expert at Deconstructing Search Queries",
125
+ "Master in Query Analysis",
126
+ "Premier Search Query Interpreter",
127
+ "Advanced Search Query Decoder",
128
+ "Search Query Parsing Genius",
129
+ "Search Query Parsing Wizard",
130
+ "Unrivaled Query Parsing Mechanism",
131
+ "Search Query Parsing Virtuoso",
132
+ "Query Parsing Maestro",
133
+ "Ace of Search Query Structuring"
134
+ ]
135
+ ```
136
+
137
+ Used for fine-tuning `instruction` phrases:
138
+ ```python
139
+ [
140
+ "Convert queries to JSON, align with schema, ensure correct spelling.",
141
+ "Analyze and structure queries in JSON, maintain schema, check spelling.",
142
+ "Organize queries in JSON, adhere to schema, verify spelling.",
143
+ "Decode queries to JSON, follow schema, correct spelling.",
144
+ "Parse queries to JSON, match schema, spell correctly.",
145
+ "Transform queries to structured JSON, align with schema and spelling.",
146
+ "Restructure queries in JSON, comply with schema, accurate spelling.",
147
+ "Rearrange queries in JSON, strict schema adherence, maintain spelling.",
148
+ "Harmonize queries with JSON schema, ensure spelling accuracy.",
149
+ "Efficient JSON conversion of queries, schema compliance, correct spelling."
150
+ ]
151
+ ```
152
 
153
  ### Direct Use
154
 
155
+ ```python
156
+ import json
157
+
158
+ from json import JSONDecodeError
159
+
160
+ from transformers import AutoTokenizer, AutoModelForCausalLM
161
+
162
+ INSTRUCTION_TEMPLATE = """
163
+ ### System: Master in Query Analysis
164
+ ### Instruction: Organize queries in JSON, adhere to schema, verify spelling.
165
+ #### Category: {0}
166
+ #### Schema: ```{1}```
167
+ #### Query: {2}
168
+ ### Response:
169
+ """
170
+
171
+
172
+
173
+ def parse(
174
+ query: str,
175
+ company_category: str,
176
+ filter_schema: dict,
177
+ model: AutoModelForCausalLM,
178
+ tokenizer: AutoTokenizer
179
+ ):
180
+ input_text = INSTRUCTION_TEMPLATE.format(
181
+ company_category,
182
+ json.dumps(filter_schema),
183
+ query
184
+ )
185
+ input_ids = tokenizer.encode(input_text, return_tensors='pt')
186
+
187
+ # Generating text
188
+ output = model.generate(input_ids.to('cuda'),
189
+ max_new_tokens=1024,
190
+ do_sample=True,
191
+ temperature=0.05,
192
+ pad_token_id=50256
193
+ )
194
+ try:
195
+ parsed = json.loads(tokenizer.decode(output[0], skip_special_tokens=True).split('## Response:\n')[-1])
196
+ except JSONDecodeError as e:
197
+ parsed = dict()
198
+
199
+ return parsed
200
+ ```
201
 
202
  ## Bias, Risks, and Limitations
203
 
 
312
 
313
  [More Information Needed]
314
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
315
  [More Information Needed]
316
 
317
  ## More Information [optional]
requirements.txt CHANGED
@@ -1,9 +1,11 @@
 
1
  datasets==2.16.1
2
  nltk==3.8.1
3
  huggingface-hub==0.19.4
 
4
  torch==2.0.0+cu117
5
  torchmetrics==1.2.0
6
  torchsummary==1.5.1
7
  torchtext==0.15.0+cpu
8
  transformers==4.36.2
9
- trl==0.7.7
 
1
+ bitsandbytes==0.41.0
2
  datasets==2.16.1
3
  nltk==3.8.1
4
  huggingface-hub==0.19.4
5
+ peft==0.5.0
6
  torch==2.0.0+cu117
7
  torchmetrics==1.2.0
8
  torchsummary==1.5.1
9
  torchtext==0.15.0+cpu
10
  transformers==4.36.2
11
+ trl==0.7.7