Where I work, we are obsessed with what happens to a model's performance after it has been deployed. We call this post-deployment data science.
Let me tell you about a post-deployment data science algorithm that we recently developed to measure the impact of Concept Drift on a model's performance.
How can we detect Concept Drift? π€
All ML models are designed to do one thing: learning a probability distribution in the form of P(y|X). In other words, they try to learn how to model an outcome 'y' given the input variables 'X'. π§
This probability distribution, P(y|X), is also called Concept. Therefore, if the Concept changes, the model may become invalid.
βBut how do we know if there is a new Concept in our data?
βOr, more important, how do we measure if the new Concept is affecting the model's performance?
π‘ We came up with a clever solution where the main ingredients are a reference dataset, one where the model's performance is known, and a dataset with the latest data we would like to monitor.
π£ Step-by-Step solution:
1οΈβ£ We start by training an internal model on a chunk of the latest data. β‘οΈ This allows us to learn the new possible Concept presented in the data.
2οΈβ£ Next, we use the internal model to make predictions on the reference dataset.
3οΈβ£ We then estimate the model's performance on the reference dataset, assuming the model's predictions on the monitoring data as ground truth.
4οΈβ£ If the estimated performance of the internal model and the actual monitored model are very different, we then say that there has been a Concept Drift.
To quantify how this Concept impacts performance, we subtract the actual model's performance on reference from the estimated performance and report a delta of the performance metric. β‘οΈ This is what the plot below shows. The change of the F1-score due to Concept drift! π¨
This process is repeated for every new chunk of data that we get. π