YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Cross-store matching CatBoost classifier
Binary classifier for product variant matching (train/validation/holdout from HF dataset).
Thresholds
Holdout: "20260214_consideration_50k"
Recall at Precision=0.90: 0.319810 (threshold=0.7523473006041936)
Recall at Precision=0.95: 0.200178 (threshold=0.891946742060409)
Recall at Precision=0.96: 0.170566 (threshold=0.9176396207992682)
Recall at Precision=0.97: 0.143915 (threshold=0.935318100138531)
Recall at Precision=0.98: 0.087060 (threshold=0.9687166096895236)
Recall at Precision=0.99: 0.058632 (threshold=0.9820673917303737)
Config
- HF_DATASET_ID:
olegakimovmle/cross-store-matching-variant-catboost-features - VERSION (dataset revision):
v8_0_0_25022026 - MODEL_VERSION:
v8_0_0_25022026 - EXPERIMENT_NAME:
consider_300k_all_feats
TRAIN_VALID_SAMPLE_SOURCES
[
"20260206_consideration_100k",
"20260212_consideration_50k",
"20260215_consideration_50k",
"20260217_consideration_50k",
"20260219_consideration_50k"
]
HOLDOUT_SAMPLE_SOURCES
[
"20260209_consideration_10k",
"20260203_search_10k",
"20260214_consideration_50k"
]
FEATURE_COLS (43)
[
"old_phash_hamming_distance",
"old_unique_terms_count",
"old_common_terms_count",
"old_bm25_distance",
"old_product_vendor_bm25distance",
"old_same_shop",
"old_are_categories_equal",
"old_max_common_category_level",
"old_min_category_precision",
"old_options_iou",
"old_cosine_similarity",
"old_has_different_gender",
"old_avg_price_difference",
"new_title_common_prefix_words",
"new_title_common_prefix_words_pct",
"new_title_common_suffix_words",
"new_title_common_suffix_words_pct",
"new_title_common_set_words",
"new_title_common_set_words_pct",
"new_title_common_prefix_letters",
"new_title_common_prefix_letters_pct",
"new_title_common_suffix_letters",
"new_title_common_suffix_letters_pct",
"new_url_common_prefix_words",
"new_url_common_prefix_words_pct",
"new_url_common_prefix_letters",
"new_url_common_prefix_letters_pct",
"new_desc_len_ratio",
"new_desc_len_diff",
"new_desc_common_word_count",
"new_desc_overlap_ratio_min",
"new_desc_overlap_ratio_max",
"new_desc_word_jaccard",
"new_desc_left_word_count",
"new_desc_right_word_count",
"new_desc_overlap_ratio_left",
"new_desc_overlap_ratio_right",
"new_same_phash",
"new_same_product_type",
"new_same_handle",
"new_product_age_days_diff",
"new_avg_price_ratio",
"new_same_predicted_category"
]
CATBOOST_PARAMS
{
"iterations": 3000,
"learning_rate": 0.05,
"depth": 10,
"loss_function": "Logloss",
"eval_metric": "PRAUC",
"random_seed": 42,
"verbose": 100,
"early_stopping_rounds": 100,
"min_data_in_leaf": 50
}
Feature importance (full)
| feature | importance |
|---|---|
| old_cosine_similarity | 17.846883 |
| old_bm25_distance | 6.120002 |
| old_product_vendor_bm25distance | 5.968252 |
| new_product_age_days_diff | 5.397785 |
| old_unique_terms_count | 4.823864 |
| old_options_iou | 3.758595 |
| old_phash_hamming_distance | 3.608113 |
| new_desc_overlap_ratio_min | 3.535647 |
| old_avg_price_difference | 3.333098 |
| new_desc_len_diff | 3.206305 |
| new_avg_price_ratio | 3.181133 |
| new_desc_right_word_count | 2.997256 |
| new_desc_common_word_count | 2.893128 |
| new_desc_len_ratio | 2.713265 |
| new_title_common_set_words_pct | 2.698235 |
| new_desc_left_word_count | 2.620198 |
| old_min_category_precision | 2.277187 |
| new_desc_word_jaccard | 2.133502 |
| new_title_common_prefix_letters_pct | 1.708667 |
| old_same_shop | 1.632605 |
| new_title_common_set_words | 1.569174 |
| new_desc_overlap_ratio_max | 1.526644 |
| old_max_common_category_level | 1.518747 |
| old_common_terms_count | 1.445524 |
| new_desc_overlap_ratio_right | 1.401540 |
| new_desc_overlap_ratio_left | 1.246630 |
| new_title_common_prefix_words_pct | 1.226824 |
| new_title_common_prefix_letters | 1.174842 |
| new_url_common_prefix_letters_pct | 1.040912 |
| new_title_common_suffix_letters_pct | 0.833897 |
| new_url_common_prefix_letters | 0.827906 |
| new_url_common_prefix_words_pct | 0.669128 |
| new_same_product_type | 0.658220 |
| new_title_common_suffix_words_pct | 0.584922 |
| new_title_common_suffix_letters | 0.515657 |
| new_title_common_prefix_words | 0.346420 |
| new_title_common_suffix_words | 0.222175 |
| old_are_categories_equal | 0.215149 |
| new_url_common_prefix_words | 0.214980 |
| new_same_predicted_category | 0.160420 |
| new_same_phash | 0.081744 |
| old_has_different_gender | 0.045671 |
| new_same_handle | 0.019155 |
holdout_results["20260214_consideration_50k"]["precision_thrs"]
precision_threshold precision_actual recall proba_threshold above_threshold pct_above_threshold
0 0.90 0.900000 0.319810 0.752347 1200.0 2.726839
1 0.95 0.950774 0.200178 0.891947 711.0 1.615652
2 0.96 0.960000 0.170566 0.917640 600.0 1.363419
3 0.97 0.970060 0.143915 0.935318 501.0 1.138455
4 0.98 0.980000 0.087060 0.968717 300.0 0.681710
5 0.99 0.990000 0.058632 0.982067 200.0 0.454473
holdout_results["20260214_consideration_50k"]["recall_thrs"]
recall_threshold recall_actual precision proba_threshold above_threshold pct_above_threshold
0 0.90 0.900800 0.298323 0.045783 10197.0 23.171314
1 0.95 0.950252 0.213976 0.020878 14997.0 34.078669
2 0.96 0.960024 0.193645 0.016674 16742.0 38.043948
3 0.97 0.970388 0.172574 0.012527 18989.0 43.149953
4 0.98 0.980160 0.152093 0.009200 21763.0 49.453496
5 0.99 0.990228 0.130223 0.006002 25679.0 58.352080
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support