Spaces:
Sleeping
Sleeping
classification_intro_1 = ''' | |
**Model performance**<br> | |
The first table shows the model performance metrics of the current selection. In other words, how the BERT or XGB classification model performs on the current lead selection. | |
These performance metrics are calculated by comparing the 'employee_is_selected' column (true labels) to the 'cutoff_prediction' column (predicted labels). The model performance is also dependent | |
on the minimum relevance probability selected by the user in the sidebar. Important to note, the models were not trained on the campaigns and accompanying leads shown in this dashboard. | |
This means that these campaigns are 'new' for the models. Model descriptions can be found at the bottom of the page. | |
''' | |
classification_intro_2 = ''' | |
**Top 3**<br> | |
The second table below shows the top 3 relevant employees. It's possible that the top 3 contains less than 3 employees, as the relevance probability needs to be above the selected minimum | |
in the sidebar. In the third table, all employees within the selected lead can be found, including the top 3. | |
''' | |
classification_intro_3 = ''' | |
**Prediction/relevance probabilities**<br> | |
The predicted employee probability/relevance scores can be found in the 'prediction_proba_1' column of the tables. Again, how many of the employees show up in the first top 3 table | |
depends on the selected minimum probability percentage in the sidebar. | |
''' | |
models_test_classification_1 = ''' | |
The model performance metrics below are derived from testing the models on a testing set (this set does not contain the campaigns in this dashboard). Note, these metrics are calculated with a standard probability | |
cutoff value of 0.5. A higher cutoff value like 0.8 tends to increase model performance for all models. | |
''' | |
models_test_classification_2 = ''' | |
The classification models are based on two frameworks. The first is XGBoost, an optimized distributed gradient boosting library. Multiple models were created and only the best performing model is | |
added to this dashboard. The model variations differ in, for example, how they deal with the unbalanced training data and their hyperparameters. The XGB model added to this dashboard uses oversampling to deal | |
with the unbalanced data and grid search cross validation to optimize the hyperparameters. | |
<br><br> | |
The other classification model uses a BERT transformer to vectorize and classify the text variables used in training. In this approach, the Pytorch library is used to fine tune the pre-trained BERT | |
transformer on the training data. Again, multiple models are created and the best performing model is added to this dashboard. This BERT model uses a data imbalance ratio and optimizes the F1 score. | |
For specifics regarding all models, see the source code. | |
<br><br> | |
Comparing the XGB and BERT models, BERT performs slightly better when looking at the weighted F1 score. However the 'normal' F1 score (harmonic mean of precision and recall), is the same for both models. | |
That being said, XGB is significantly more efficient. This efficiency difference is not noticeable when classifying small datasets like single leads, but becomes evident when trying to classify larger datasets | |
like full campaigns on the 'comparison' tab of this dashboard. | |
''' | |
ranking_intro_1 = ''' | |
**Model performance**<br> | |
The first table shows the model performance metrics of the current selection. For the ranking models, this is the NDCG@k score. Here, the k parameter is set to 3, as for this exercise, | |
correctly ranking the top 3 relevant employees is most important. The NDCG@k metric is calculated from comparing the 'employee_is_selected' column (true labels) to the 'predicted_score' | |
column (predicted labels). Note that the predicted scores are not binary like the true labels. Model descriptions can be found at the bottom of the page. | |
''' | |
ranking_intro_2 = ''' | |
**Top 3**<br> | |
The second table shows the top 3 relevant employees. As the predicted_score column contain's values with different ranges from lead to lead, these scores are scaled between 0 and 1 to make them | |
comparable across leads (scaling is explained below). These scaled scores are then used to determine a top 3 relevant employee ranking by checking if the scaled score is above the user selected minimum | |
prediction score in the sidebar. Note that ranking models like these prioritize the correct employee order above individual employee relevance scoring. Therefore, the top 3 and its dependence on | |
the selected minimum relevance probability is a bit less important compared to the total ranking order in the third table. | |
''' | |
ranking_intro_3 = ''' | |
**Predicted relevance scores and score scaling**<br> | |
The predicted employee relevance scores can be found in the 'predicted_score' column of the tables. The predicted scores range of the whole campaign is used to scale the predicted scores in single leads. | |
The scores are scaled between values of 0 and 1. This approach means that the score scaling is dependent on the campaign the lead falls into. The scaled relevance scores are stored in the 'scaled_predicted_score' | |
column of the tables. Again, how many of the employees show up in the first top 3 table depends on the selected minimum prediction score in the sidebar. | |
''' | |
models_test_ranking_1 = ''' | |
The ranking models are based on LightGBM, which is a gradient boosting framework that uses tree based learning algorithms. From this framework, models are created using the LGBMRanker. Here, the objective | |
parameter is set to 'lambdarank' and the boosting type parameter is set to 'Gradient Boosting Decision Tree'. The combination of these two parameters mean that the models use a lambdaMART algorithm. The models | |
available in this dashboard are described below. More information about, for example, the specific training parameters can be found in the source code. | |
<br> | |
- **Light GBM 1:** First implementation of the lambdaMART algorithm. Hyperparameters are not tuned. Performs the worst of the three models. | |
- **Light GBM 2:** Hyperparameters are tuned by testing multiple hyperparameter values using trials. These trails try to optimize the NDCG@k score on a testing set. For this model, k is set to 10. | |
- **Light GBM 3:** Hyperparameters are tuned with different value ranges. Also, a unbalanced data flag is added. Finally, k is set to 3 to prioritize the top 3 ranking score. This model performs the best. | |
''' | |
comparison_1 = ''' | |
**Classification vs Ranking**<br> | |
In order to make the classification and ranking models comparable, performance metrics like F1 and recall are calculated for both. To calculate these, the predicted relevance scores | |
are converted to a binary classification (0 for non relevant and 1 for relevant) with the use of the selected minimum relevance probabilities/scores. Therefore, the performance metrics are dependent | |
on the selected campaign, model and minimum relevance probabilities/scores. It should be noted that of all the performance metrics, the NDCG score is best to estimate the ranking model performance. | |
More about this is explained below.<br><br> | |
Comparing the two techniques, the classification models seem to perform better in terms of individual item scoring. This means that the classification models are more accurate | |
in determining whether or not a single employee is relevant (0 or 1). This is evident when looking at the F1 score (harmonic mean of precision and recall score), which is higher in the classification models | |
when compared to the F1 scores of the ranking models. However, this is to be expected as the learning to rank models are more focussed on the relative ordering of the items (employees) instead of the most | |
accurate relevance score of each individual item. This means that metrics like F1, accuracy, precision and recall are less optimal for assessing the ranking performance of the model. The NDCG score is better | |
suited for this purpose. | |
''' | |
ranking_ndcg = ''' | |
**Ranking (NDCG scores)**<br> | |
The ranking models are best evaluated by their NDCG@k scores. The k parameter is set to 3, as for this exercise, correctly ranking the top 3 relevant employees is most important. On this page, the average | |
NDCG scores of a full campaign can be found. This means that the individual lead NDCG scores are summed an divided by the total number of lead within the selected campaign. Some campaigns have low average | |
campaign NDCG scores, like with campaign 256, can be explained by the fact that this campaign has lots of leads with zero relevant employees in them. When this is the case, the NDCG score of such a lead is | |
likely 0. Of course, lots of these 0 scores will lower the average NDCG score of the full campaign. | |
''' |