[up]: add testing section to the model card

Browse files

Files changed (4) hide show

README.md +73 -5
calculate_metrics.py +0 -0
falcon-7b-instruct-test.json +0 -0
test_query_parser.py +0 -0

README.md CHANGED Viewed

@@ -447,18 +447,86 @@ The preprocessing steps are not detailed in the provided code. Typically, prepro
 #### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
 #### Metrics
 ##### Total metrics
 | Category                                         | Recall | Precision | F1    | Accuracy |

 #### Testing Data
+All information is provided in [Training Data](#training-data) section.
+### Factors Influencing Falcon-7B-Instruct Model Performance
+#### 1. Company Category and Domain Knowledge
+- Performance may vary based on the specific company category or domain.
+- Enhanced performance in domains specifically trained on, such as Educational Institutions, Banking Services, Logistics, etc.
+#### 2. Filter Schema Adaptability
+- Ability to adapt to various filter schemas.
+- Performance in parsing and organizing queries according to different schemas.
+#### 3. Handling of Spelling and Syntax Errors
+- Robustness in handling spelling errors and syntax variations in queries.
+#### 4. Representation and Type Handling
+- Capability to handle different data representations (e.g., date formats, enumerations, patterns).
+- Accurate processing of various base types (int, float, str, bool).
+#### 5. Length and Complexity of Queries
+- Impact of the length and complexity of queries on performance.
+- Maximum sequence length of 1024 could pose limitations for longer or complex queries.
+#### 6. Bias and Ethical Considerations
+- Inherited ethical biases from the original model.
+- Importance of understanding these biases in different contexts.
+#### 7. Limitations in Fine-Tuning and Data Curation
+- Limitations such as extra spaces, handling of abbreviations, etc.
+- Influence of the extent of training data curation on model accuracy.
+#### 8. Specific Use Cases
+- Recommended primarily for zero-shot search query parsing and search query spell correction.
+- Performance in other use cases might be unpredictable.
+#### 9. Training Data Quality and Diversity
+- Quality and diversity of synthetic training data.
+- Influence on the model's effectiveness across different scenarios.
+##### Testing Procedure
+Results of testing procedure as JSON is provided [here](https://huggingface.co/EmbeddingStudio/query-parser-falcon-7b-instruct/blob/main/falcon-7b-instruct-test.json).
+This is a list of items, each item is:
+1. Predicted parsed query
+2. Real parsed query
+3. Category
 #### Metrics
+#### Metric Overview
+Our zero-shot search query parsing model is designed to extract structured information from unstructured search queries with high precision. The primary metric for evaluating our model's performance is the True Positive (TP) rate, which is assessed using a specialized token-wise Levenshtein distance. This approach is aligned with our goal to achieve semantic accuracy in parsing user queries.
+#### True Positives (TP)
+- **Definition**: A True Positive in our model is counted when the model correctly identifies both the 'Name' and 'Value' in a query, matching the expected results.
+- **Measurement Method**: The TP rate is quantified using the `levenshtein_tokenwise` function, which calculates the distance between predicted and actual key-value pairs at a token level. We consider a Levenshtein distance of 0.25 or less as acceptable for matching.
+- **Importance**:
+   - **Token-Level Accuracy**: We use token-wise accuracy over traditional character-level Levenshtein distance, which can be overly strict, especially for minor spelling variations. Our token-wise approach prioritizes semantic accuracy.
+   - **Relevance to Search Queries**: Accuracy at the token level is more indicative of the model's ability to understand and parse user intent in search queries.
+#### Generation Strategy
+- **Approach**: The model generates responses based on input queries with a maximum token length set to 1000, employing a sampling strategy (do_sample=True), and a low temperature setting of 0.05. This controlled randomness in generation ensures a variety of accurate and relevant responses.
+- **Impact on TP**:
+   - The low temperature setting directly influences the TP rate by reducing the randomness in the model's predictions. With a lower temperature, the model is more likely to choose the most probable word in a given context, leading to more accurate and consistent outputs. This is particularly crucial in search query parsing, where understanding and interpreting user input with high precision is vital.
+#### Additional Metrics
+- **False Positives (FP) and False Negatives (FN)**: These metrics are monitored to provide a comprehensive view of the model's predictive capabilities.
+- **Precision, Recall, F1 Score, Accuracy**: These standard metrics complement our TP-focused assessment, providing a rounded picture of the model's performance in various aspects.
+#### Motivation for Metric Choice
+- **Alignment with User Intent**: Focusing on token-wise accuracy ensures the model's performance closely mirrors the structure and intent typical in search queries.
+- **Robustness Against Query Variations**: This metric approach makes the model adaptable to the varied formulations of real-world search queries.
+- **Balancing Precision and Recall**: Our method aims to balance the model's ability not to miss relevant key-value pairs (high recall) while not over-identifying irrelevant ones (high precision).
 ##### Total metrics
 | Category                                         | Recall | Precision | F1    | Accuracy |

calculate_metrics.py DELETED Viewed

File without changes

falcon-7b-instruct-test.json ADDED Viewed

The diff for this file is too large to render. See raw diff

test_query_parser.py DELETED Viewed

File without changes