chilly-magician commited on
Commit
64454d4
1 Parent(s): 78a1cf9

[up]: add testing section to the model card

Browse files
README.md CHANGED
@@ -447,18 +447,86 @@ The preprocessing steps are not detailed in the provided code. Typically, prepro
447
 
448
  #### Testing Data
449
 
450
- <!-- This should link to a Dataset Card if possible. -->
451
 
452
- [More Information Needed]
453
 
454
- #### Factors
 
 
455
 
456
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
 
 
457
 
458
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
459
 
460
  #### Metrics
461
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
462
  ##### Total metrics
463
 
464
  | Category | Recall | Precision | F1 | Accuracy |
 
447
 
448
  #### Testing Data
449
 
450
+ All information is provided in [Training Data](#training-data) section.
451
 
452
+ ### Factors Influencing Falcon-7B-Instruct Model Performance
453
 
454
+ #### 1. Company Category and Domain Knowledge
455
+ - Performance may vary based on the specific company category or domain.
456
+ - Enhanced performance in domains specifically trained on, such as Educational Institutions, Banking Services, Logistics, etc.
457
 
458
+ #### 2. Filter Schema Adaptability
459
+ - Ability to adapt to various filter schemas.
460
+ - Performance in parsing and organizing queries according to different schemas.
461
 
462
+ #### 3. Handling of Spelling and Syntax Errors
463
+ - Robustness in handling spelling errors and syntax variations in queries.
464
+
465
+ #### 4. Representation and Type Handling
466
+ - Capability to handle different data representations (e.g., date formats, enumerations, patterns).
467
+ - Accurate processing of various base types (int, float, str, bool).
468
+
469
+ #### 5. Length and Complexity of Queries
470
+ - Impact of the length and complexity of queries on performance.
471
+ - Maximum sequence length of 1024 could pose limitations for longer or complex queries.
472
+
473
+ #### 6. Bias and Ethical Considerations
474
+ - Inherited ethical biases from the original model.
475
+ - Importance of understanding these biases in different contexts.
476
+
477
+ #### 7. Limitations in Fine-Tuning and Data Curation
478
+ - Limitations such as extra spaces, handling of abbreviations, etc.
479
+ - Influence of the extent of training data curation on model accuracy.
480
+
481
+ #### 8. Specific Use Cases
482
+ - Recommended primarily for zero-shot search query parsing and search query spell correction.
483
+ - Performance in other use cases might be unpredictable.
484
+
485
+ #### 9. Training Data Quality and Diversity
486
+ - Quality and diversity of synthetic training data.
487
+ - Influence on the model's effectiveness across different scenarios.
488
+
489
+
490
+ ##### Testing Procedure
491
+
492
+ Results of testing procedure as JSON is provided [here](https://huggingface.co/EmbeddingStudio/query-parser-falcon-7b-instruct/blob/main/falcon-7b-instruct-test.json).
493
+
494
+ This is a list of items, each item is:
495
+ 1. Predicted parsed query
496
+ 2. Real parsed query
497
+ 3. Category
498
 
499
  #### Metrics
500
 
501
+ #### Metric Overview
502
+
503
+ Our zero-shot search query parsing model is designed to extract structured information from unstructured search queries with high precision. The primary metric for evaluating our model's performance is the True Positive (TP) rate, which is assessed using a specialized token-wise Levenshtein distance. This approach is aligned with our goal to achieve semantic accuracy in parsing user queries.
504
+
505
+ #### True Positives (TP)
506
+
507
+ - **Definition**: A True Positive in our model is counted when the model correctly identifies both the 'Name' and 'Value' in a query, matching the expected results.
508
+ - **Measurement Method**: The TP rate is quantified using the `levenshtein_tokenwise` function, which calculates the distance between predicted and actual key-value pairs at a token level. We consider a Levenshtein distance of 0.25 or less as acceptable for matching.
509
+ - **Importance**:
510
+ - **Token-Level Accuracy**: We use token-wise accuracy over traditional character-level Levenshtein distance, which can be overly strict, especially for minor spelling variations. Our token-wise approach prioritizes semantic accuracy.
511
+ - **Relevance to Search Queries**: Accuracy at the token level is more indicative of the model's ability to understand and parse user intent in search queries.
512
+
513
+ #### Generation Strategy
514
+
515
+ - **Approach**: The model generates responses based on input queries with a maximum token length set to 1000, employing a sampling strategy (do_sample=True), and a low temperature setting of 0.05. This controlled randomness in generation ensures a variety of accurate and relevant responses.
516
+ - **Impact on TP**:
517
+ - The low temperature setting directly influences the TP rate by reducing the randomness in the model's predictions. With a lower temperature, the model is more likely to choose the most probable word in a given context, leading to more accurate and consistent outputs. This is particularly crucial in search query parsing, where understanding and interpreting user input with high precision is vital.
518
+
519
+ #### Additional Metrics
520
+
521
+ - **False Positives (FP) and False Negatives (FN)**: These metrics are monitored to provide a comprehensive view of the model's predictive capabilities.
522
+ - **Precision, Recall, F1 Score, Accuracy**: These standard metrics complement our TP-focused assessment, providing a rounded picture of the model's performance in various aspects.
523
+
524
+ #### Motivation for Metric Choice
525
+
526
+ - **Alignment with User Intent**: Focusing on token-wise accuracy ensures the model's performance closely mirrors the structure and intent typical in search queries.
527
+ - **Robustness Against Query Variations**: This metric approach makes the model adaptable to the varied formulations of real-world search queries.
528
+ - **Balancing Precision and Recall**: Our method aims to balance the model's ability not to miss relevant key-value pairs (high recall) while not over-identifying irrelevant ones (high precision).
529
+
530
  ##### Total metrics
531
 
532
  | Category | Recall | Precision | F1 | Accuracy |
calculate_metrics.py DELETED
File without changes
falcon-7b-instruct-test.json ADDED
The diff for this file is too large to render. See raw diff
 
test_query_parser.py DELETED
File without changes