EmbeddingStudio
/

query-parser-falcon-7b-instruct

@@ -3,57 +3,201 @@ library_name: peft
 base_model: tiiuae/falcon-7b-instruct
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
 ## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 ### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
 ## Bias, Risks, and Limitations
@@ -168,22 +312,6 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
 [More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
 [More Information Needed]
 ## More Information [optional]

 base_model: tiiuae/falcon-7b-instruct
 ---
+# Model Card for the Query Parser LLM using Falcon-7B-Instruct
+[![version](https://img.shields.io/badge/version-0.0.1-red.svg)]()
+[![Python 3.9](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/release/python-390/)
+![CUDA 11.7.1](https://img.shields.io/badge/CUDA-11.7.1-green.svg)
+EmbeddingStudio is the [open-source framework](https://github.com/EulerSearch/embedding_studio/tree/main), that allows you transform a joint "Embedding Model + Vector DB" into
+a full-cycle search engine: collect clickstream -> improve search experience-> adapt embedding model and repeat out of the box.
+It's a highly rare case when a company will use unstructured search as is. And by searching `brick red houses san francisco area for april`
+user definitely wants to find some houses in San Francisco for a month-long rent in April, and then maybe brick-red houses.
+Unfortunately, for the 15th January 2024 there is no such accurate embedding model. So, companies need to mix structured and unstructured search.
+The very first step of mixing it - to parse a search query. Usual approaches are:
+* Implement a bunch of rules, regexps, or grammar parsers (like [NLTK grammar parser](https://www.nltk.org/howto/grammar.html)).
+* Collect search queries and to annotate some dataset for NER task.
+It takes some time to do, but at the end you can get controllable and very accurate query parser.
+EmbeddingStudio team decided to dive into LLM instruct fine-tuning for `Zero-Shot query parsing` task
+to close the first gap while a company doesn't have any rules and data being collected, or even eliminate exhausted rules implementation, but in the future.
+The main idea is to align an LLM to being to parse short search queries knowing just a company market and a schema of search filters. Moreover, being oriented on applied NLP,
+we are trying to serve only light-weight LLMs a.k.a `not heavier than 7B parameters`.
+## Model Details
+### Model Description
+This is only [Falcon-7B-Instruct](https://huggingface.co/tiiuae/falcon-7b-instruct) aligned to follow instructions like:
+```markdown
+### System: Master in Query Analysis
+### Instruction: Organize queries in JSON, adhere to schema, verify spelling.
+#### Category: Logistics and Supply Chain Management
+#### Schema: ```[{"Name": "Customer_Ratings", "Representations": [{"Name": "Exact_Rating", "Type": "float", "Examples": [4.5, 3.2, 5.0, "4.5", "Unstructured"]}, {"Name": "Minimum_Rating", "Type": "float", "Examples": [4.0, 3.0, 5.0, "4.5"]}, {"Name": "Star_Rating", "Type": "int", "Examples": [4, 3, 5], "Enum": [1, 2, 3, 4, 5]}]}, {"Name": "Date", "Representations": [{"Name": "Day_Month_Year", "Type": "str", "Examples": ["01.01.2024", "15.06.2023", "31.12.2022", "25.12.2021", "20.07.2024", "15.06.2023"], "Pattern": "dd.mm.YYYY"}, {"Name": "Day_Name", "Type": "str", "Examples": ["Monday", "Wednesday", "Friday", "Thursday", "Monday", "Tuesday"], "Enum": ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]}]}, {"Name": "Date_Period", "Representations": [{"Name": "Specific_Period", "Type": "str", "Examples": ["01.01.2024 - 31.01.2024", "01.06.2023 - 30.06.2023", "01.12.2022 - 31.12.2022"], "Pattern": "dd.mm.YYYY - dd.mm.YYYY"}, {"Name": "Month", "Type": "str", "Examples": ["January", "June", "December"], "Enum": ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]}, {"Name": "Quarter", "Type": "str", "Examples": ["Q1", "Q2", "Q3"], "Enum": ["Q1", "Q2", "Q3", "Q4"]}, {"Name": "Season", "Type": "str", "Examples": ["Winter", "Summer", "Autumn"], "Enum": ["Winter", "Spring", "Summer", "Autumn"]}]}, {"Name": "Destination_Country", "Representations": [{"Name": "Country_Name", "Type": "str", "Examples": ["United States", "Germany", "China"]}, {"Name": "Country_Code", "Type": "str", "Examples": ["US", "DE", "CN"]}, {"Name": "Country_Abbreviation", "Type": "str", "Examples": ["USA", "GER", "CHN"]}]}]```
+#### Query: Which logistics companies in the US have a perfect 5.0 rating ?
+### Response:
+[{"Value": "Which logistics companies in the US have a perfect 5.0 rating?", "Name": "Correct"}, {"Name": "Customer_Ratings.Exact_Rating", "Value": 5.0}, {"Name": "Destination_Country.Country_Code", "Value": "US"}]
+```
+**Important:** Additionally, we are trying to fine-tune the Large Language Model (LLM) to not only parse unstructured search queries but also to correct spelling.
+- **Developed by EmbeddingStudio team:**
+  * Aleksandr Iudaev | [LinkedIn](https://www.linkedin.com/in/alexanderyudaev/) | [Email](mailto:alexander@yudaev.ru) |
+  * Andrei Kostin | [LinkedIn](https://www.linkedin.com/in/andrey-kostin/) | [Email](mailto:andreynitsok@gmail.com) |
+  * ML Doom | `AI Assistant`
+- **Funded by EmbeddingStudio team**
+- **Model type:** Instruct Fine-Tuned Large Language Model
+- **Model task:** Zero-shot search query parsing
+- **Language(s) (NLP):** English
+- **License:** apache-2.0
+- **Finetuned from model:** [Falcon-7B-Instruct](https://huggingface.co/tiiuae/falcon-7b-instruct)
+- **!Maximal Length Size:** we used 1024 for fine-tuning, this is highly different from the original model `max_seq_length = 2048`
+- **Tuning Epochs:** 3 for now, but will be more later.
+**Disclaimer:** As a small startup, this direction forms a part of our Minimum Viable Product (MVP). It's more of
+an attempt to test the 'product-market fit' rather than a well-structured scientific endeavor. Once we check it and go with a round, we definitely will:
+* Curating a specific dataset for more precise analysis.
+* Exploring various approaches and Large Language Models (LLMs) to identify the most effective solution.
+* Publishing a detailed paper to ensure our findings and methodologies can be thoroughly reviewed and verified.
+We acknowledge the complexity involved in utilizing Large Language Models, particularly in the context
+of `Zero-Shot search query parsing` and `AI Alignment`. Given the intricate nature of this technology, we emphasize the importance of rigorous verification.
+Until our work is thoroughly reviewed, we recommend being cautious and critical of the results.
+### Model Sources
+- **Repository:** code of inference the model will be [here](https://github.com/EulerSearch/embedding_studio/tree/main)
+- **Paper:** Work In Progress
+- **Demo:** Work In Progress
 ## Uses
+We strongly recommend only the direct usage of this fine-tuned version of [Falcon-7B-Instruct](https://huggingface.co/tiiuae/falcon-7b-instruct):
+* Zero-shot Search Query Parsing with porived company market name and filters schema
+* Search Query Spell Correction
+For any other needs the behaviour of the model in unpredictable, please utilize the [original mode](https://huggingface.co/tiiuae/falcon-7b-instruct) or fine-tune your own.
+### Instruction format
+```markdown
+### System: Master in Query Analysis
+### Instruction: Organize queries in JSON, adhere to schema, verify spelling.
+#### Category: {your_company_category}
+#### Schema: ```{filters_schema}```
+#### Query: {query}
+### Response:
+```
+Filters schema is JSON-readable line in the format (we highly recommend you to use it):
+List of filters (dict):
+* Name - name of filter (better to be meaningful).
+* Representations - list of possible filter formats (dict):
+  * Name - name of representation (better to be meaningful).
+  * Type - python base type (int, float, str, bool).
+  * Examples - list of examples.
+  * Enum - if a representation is enumeration, provide a list of possible values, LLM should map parsed value into this list.
+  * Pattern - if a representation is pattern-like (datetime, regexp, etc.) provide a pattern text in any format.
+Example:
+```json
+[{"Name": "Customer_Ratings", "Representations": [{"Name": "Exact_Rating", "Type": "float", "Examples": [4.5, 3.2, 5.0, "4.5", "Unstructured"]}, {"Name": "Minimum_Rating", "Type": "float", "Examples": [4.0, 3.0, 5.0, "4.5"]}, {"Name": "Star_Rating", "Type": "int", "Examples": [4, 3, 5], "Enum": [1, 2, 3, 4, 5]}]}, {"Name": "Date", "Representations": [{"Name": "Day_Month_Year", "Type": "str", "Examples": ["01.01.2024", "15.06.2023", "31.12.2022", "25.12.2021", "20.07.2024", "15.06.2023"], "Pattern": "dd.mm.YYYY"}, {"Name": "Day_Name", "Type": "str", "Examples": ["Monday", "Wednesday", "Friday", "Thursday", "Monday", "Tuesday"], "Enum": ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]}]}, {"Name": "Date_Period", "Representations": [{"Name": "Specific_Period", "Type": "str", "Examples": ["01.01.2024 - 31.01.2024", "01.06.2023 - 30.06.2023", "01.12.2022 - 31.12.2022"], "Pattern": "dd.mm.YYYY - dd.mm.YYYY"}, {"Name": "Month", "Type": "str", "Examples": ["January", "June", "December"], "Enum": ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]}, {"Name": "Quarter", "Type": "str", "Examples": ["Q1", "Q2", "Q3"], "Enum": ["Q1", "Q2", "Q3", "Q4"]}, {"Name": "Season", "Type": "str", "Examples": ["Winter", "Summer", "Autumn"], "Enum": ["Winter", "Spring", "Summer", "Autumn"]}]}, {"Name": "Destination_Country", "Representations": [{"Name": "Country_Name", "Type": "str", "Examples": ["United States", "Germany", "China"]}, {"Name": "Country_Code", "Type": "str", "Examples": ["US", "DE", "CN"]}, {"Name": "Country_Abbreviation", "Type": "str", "Examples": ["USA", "GER", "CHN"]}]}]
+```
+As the result, response will be JSON-readable line in the format:
+```json
+[{"Value": "Corrected search phrase", "Name": "Correct"}, {"Name": "filter-name.representation", "Value": "some-value"}]
+```
+Field and representation names will be aligned with the provided schema. Example:
+```json
+[{"Value": "Which logistics companies in the US have a perfect 5.0 rating?", "Name": "Correct"}, {"Name": "Customer_Ratings.Exact_Rating", "Value": 5.0}, {"Name": "Destination_Country.Country_Code", "Value": "US"}]
+```
+Used for fine-tuning `system` phrases:
+```python
+[
+    "Expert at Deconstructing Search Queries",
+    "Master in Query Analysis",
+    "Premier Search Query Interpreter",
+    "Advanced Search Query Decoder",
+    "Search Query Parsing Genius",
+    "Search Query Parsing Wizard",
+    "Unrivaled Query Parsing Mechanism",
+    "Search Query Parsing Virtuoso",
+    "Query Parsing Maestro",
+    "Ace of Search Query Structuring"
+]
+```
+Used for fine-tuning `instruction` phrases:
+```python
+[
+    "Convert queries to JSON, align with schema, ensure correct spelling.",
+    "Analyze and structure queries in JSON, maintain schema, check spelling.",
+    "Organize queries in JSON, adhere to schema, verify spelling.",
+    "Decode queries to JSON, follow schema, correct spelling.",
+    "Parse queries to JSON, match schema, spell correctly.",
+    "Transform queries to structured JSON, align with schema and spelling.",
+    "Restructure queries in JSON, comply with schema, accurate spelling.",
+    "Rearrange queries in JSON, strict schema adherence, maintain spelling.",
+    "Harmonize queries with JSON schema, ensure spelling accuracy.",
+    "Efficient JSON conversion of queries, schema compliance, correct spelling."
+]
+```
 ### Direct Use
+```python
+import json
+from json import JSONDecodeError
+from transformers import AutoTokenizer, AutoModelForCausalLM
+INSTRUCTION_TEMPLATE = """
+### System: Master in Query Analysis
+### Instruction: Organize queries in JSON, adhere to schema, verify spelling.
+#### Category: {0}
+#### Schema: ```{1}```
+#### Query: {2}
+### Response:
+"""
+def parse(
+        query: str,
+        company_category: str,
+        filter_schema: dict,
+        model: AutoModelForCausalLM,
+        tokenizer: AutoTokenizer
+):
+    input_text = INSTRUCTION_TEMPLATE.format(
+      company_category,
+      json.dumps(filter_schema),
+      query
+    )
+    input_ids = tokenizer.encode(input_text, return_tensors='pt')
+    # Generating text
+    output = model.generate(input_ids.to('cuda'),
+                            max_new_tokens=1024,
+                            do_sample=True,
+                            temperature=0.05,
+                            pad_token_id=50256
+    )
+    try:
+        parsed = json.loads(tokenizer.decode(output[0], skip_special_tokens=True).split('## Response:\n')[-1])
+    except JSONDecodeError as e:
+        parsed = dict()
+    return parsed
+```
 ## Bias, Risks, and Limitations
 [More Information Needed]
 [More Information Needed]
 ## More Information [optional]

requirements.txt CHANGED Viewed

@@ -1,9 +1,11 @@
 datasets==2.16.1
 nltk==3.8.1
 huggingface-hub==0.19.4
 torch==2.0.0+cu117
 torchmetrics==1.2.0
 torchsummary==1.5.1
 torchtext==0.15.0+cpu
 transformers==4.36.2
-trl==0.7.7

+bitsandbytes==0.41.0
 datasets==2.16.1
 nltk==3.8.1
 huggingface-hub==0.19.4
+peft==0.5.0
 torch==2.0.0+cu117
 torchmetrics==1.2.0
 torchsummary==1.5.1
 torchtext==0.15.0+cpu
 transformers==4.36.2
+trl==0.7.7