AI & ML interests

Post-deployment data science. ML monitoring. Performance estimation.

Recent Activity

NannyML's activity

santiviquezΒ 
posted an update 2 months ago
view post
Post
1491
Professors should ask students to write blog posts based on their final projects instead of having them do paper-like reports.

A single blog post, accessible to the entire internet, can have a greater career impact than dozens of reports that nobody will read.
santiviquezΒ 
posted an update 2 months ago
view post
Post
470
Some exciting news...

We are open-sourcing The Little Book of ML Metrics! πŸŽ‰

The book that will be on every data scientist's desk is open source.

What does that mean?

It means hundreds of people can review it, contribute to it, and help us improve it before it's finished!

This also means that everyone will have free access to the digital version!

Meanwhile, the high-quality printed edition will be available for purchase as it has been for a while.

Revenue from printed copies will help us support further development and maintenance of the book. Not to mention that reviewers and contributors will receive revenue sharing through their affiliate links. πŸ™Œ

Check out the book repo (make sure to leave a star 🌟):

https://github.com/NannyML/The-Little-Book-of-ML-Metrics
santiviquezΒ 
posted an update 3 months ago
view post
Post
442
We can’t think in more than three dimensions.

But we have no problem doing math and writing computer programs in many dimensions. It just works.

I find that extremely crazy.
Β·
santiviquezΒ 
posted an update 4 months ago
view post
Post
426
ML people on a long flight

(See picture)
santiviquezΒ 
posted an update 4 months ago
view post
Post
466
Some personal and professional news ✨

I'm writing a book on ML metrics.

Together with Wojtek Kuberski, we’re creating the missing piece of every ML university program and online course: a book solely dedicated to Machine Learning metrics!

The book will cover the following types of metrics:
β€’ Regression
β€’ Classification
β€’ Clustering
β€’ Ranking
β€’ Vision
β€’ Text
β€’ GenAI
β€’ Bias and Fairness

πŸ‘‰ check out the book: https://www.nannyml.com/metrics
  • 2 replies
Β·
santiviquezΒ 
posted an update 7 months ago
view post
Post
1044
They: you need ground truth to measure performance! 😠

NannyML: hold my beer...
santiviquezΒ 
posted an update 7 months ago
santiviquezΒ 
posted an update 7 months ago
view post
Post
1567
I ran 580 experiments (yes, 580 🀯) to check if we can quantify data drift's impact on model performance using only drift metrics.

For these experiments, I built a technique that relies on drift signals to estimate model performance. I compared its results against the current SoTA performance estimation methods and checked which technique performs best.

The plot below summarizes the general results. It measures the quality of performance estimation versus the absolute performance change. (The lower, the better).

Full experiment: https://www.nannyml.com/blog/data-drift-estimate-model-performance

In it, I describe the setup, datasets, models, benchmarking methods, and the code used in the project.
santiviquezΒ 
posted an update 8 months ago
view post
Post
1571
Looking for someone with +10 years of experience training Deep Kolmogorov-Arnold Networks.

Any suggestions?
santiviquezΒ 
posted an update 9 months ago
view post
Post
2050
More open research updates 🧡

Performance estimation is currently the best way to quantify the impact of data drift on model performance. πŸ’‘

I've been benchmarking performance estimation methods (CBPE and M-CBPE) against data drift signals.

I'm using drift results as features for many regression algorithms, and then I'm taking those to estimate the model's performance. Finally, I'm measuring the Mean Absolute Error (MAE) between the regression models' predictions and actual performance.

So far, for all my experiments, performance estimation methods do better than drift signals. πŸ‘¨β€πŸ”¬

Bear in mind that these are some early results, I'm running the flow on more datasets as we speak.

Hopefully, by next week, I will have more results to share πŸ‘€
santiviquezΒ 
posted an update 9 months ago
view post
Post
1350
How would you benchmark performance estimation algorithms vs data drift signals?

I'm working on a benchmarking analysis, and I'm currently doing the following:

- Get univariate and multivariate drift signals and measure their correlation with realized performance.
- Use drift signals as features of a regression model to predict the model's performance.
- Use drift signals as features of a classification model to predict a performance drop.
- Compare all the above experiments with results from Performance Estimation algorithms.

Any other ideas?
santiviquezΒ 
posted an update 10 months ago
view post
Post
People in Paris πŸ‡«πŸ‡· πŸ₯

Next week we'll be hosting our first Post-Deployment Data Science Meetup in Paris!

My boss will be talking about Quantifying the Impact of Data Drift on Model
Performance. πŸ‘€

The event is completely free, and there's only space for 50 people, so if you are interested, RSVP as soon as possible πŸ€—

πŸ—“οΈ Thursday, March 14
πŸ•  5:30 PM - 8:30 PM GMT+1
πŸ”— RSVP: https://lu.ma/postdeploymentparis
santiviquezΒ 
posted an update 10 months ago
view post
Post
Where I work, we are obsessed with what happens to a model's performance after it has been deployed. We call this post-deployment data science.

Let me tell you about a post-deployment data science algorithm that we recently developed to measure the impact of Concept Drift on a model's performance.

How can we detect Concept Drift? πŸ€”

All ML models are designed to do one thing: learning a probability distribution in the form of P(y|X). In other words, they try to learn how to model an outcome 'y' given the input variables 'X'. 🧠

This probability distribution, P(y|X), is also called Concept. Therefore, if the Concept changes, the model may become invalid.

❓But how do we know if there is a new Concept in our data?
❓Or, more important, how do we measure if the new Concept is affecting the model's performance?

πŸ’‘ We came up with a clever solution where the main ingredients are a reference dataset, one where the model's performance is known, and a dataset with the latest data we would like to monitor.

πŸ‘£ Step-by-Step solution:

1️⃣ We start by training an internal model on a chunk of the latest data. ➑️ This allows us to learn the new possible Concept presented in the data.

2️⃣ Next, we use the internal model to make predictions on the reference dataset.

3️⃣ We then estimate the model's performance on the reference dataset, assuming the model's predictions on the monitoring data as ground truth.

4️⃣ If the estimated performance of the internal model and the actual monitored model are very different, we then say that there has been a Concept Drift.

To quantify how this Concept impacts performance, we subtract the actual model's performance on reference from the estimated performance and report a delta of the performance metric. ➑️ This is what the plot below shows. The change of the F1-score due to Concept drift! 🚨

This process is repeated for every new chunk of data that we get. πŸ”

santiviquezΒ 
posted an update 10 months ago
view post
Post
LLM hallucination detection papers be like:

* check the image to get the joke πŸ‘€
santiviquezΒ 
posted an update 10 months ago
view post
Post
Fantastic Beasts (*Hallucinations*) and Where to Find Them πŸ”ŽπŸ§Œ

This paper breaks down LLM hallucinations into six different types:

1️⃣ Entity: Involves errors in nouns. Changing that single entity can make the sentence correct.

2️⃣ Relation: Involves errors in verbs, prepositions, or adjectives. They can be fixed by correcting the relation.

3️⃣ Contradictory: Sentences that contradict factually correct information.

4️⃣ Invented: When the LLM generates sentences with concepts that don't exist in the real world.

5️⃣ Subjective: When the LLM generates sentences influenced by personal beliefs, feelings, biases, etc.

6️⃣ Unverifiable: When the LLM comes up with sentences containing information that can't be verified. E.g., Personal or private matters.

The first two types of hallucinations are relatively easy to correct, given that we can rewrite them by changing the entity or relation. However, the other four would mostly need to be removed to make the sentence factually correct.

Paper: Fine-grained Hallucination Detection and Editing for Language Models (2401.06855)
santiviquezΒ 
posted an update 10 months ago
view post
Post
So, I have this idea to (potentially) improve uncertainty quantification for LLM hallucination detection.

The premise is that not all output tokens of a generated response share the same importance. Hallucinations are more dangerous in the form of a noun, date, number, etc.

The idea is to have a "token selection" layer that filters the output token probabilities sequence. Then, we use only the probabilities of the relevant tokens to calculate uncertainty quantification metrics.

The big question is how we know which tokens are the relevant ones. πŸ€”

My idea is to use the output sequence (decoded one) and use an NLP model (it doesn't need to be a fancy one) to do entity recognition and part-of-speech tagging to the output sequence and then do uncertainty quantification only on the entities that we have set as relevant (nouns, dates, numbers, etc).

What are your thoughts? Have you seen anyone try this before?

Curious to see if anyone has tried this before and know if this would have an impact on the correlation with human-annotated evaluations.
  • 3 replies
Β·
santiviquezΒ 
posted an update 10 months ago
view post
Post
Eigenvalues to the rescue? πŸ›ŸπŸ€”

I found out about this paper thanks to @gsarti 's post from last week; I got curious, so I want to post my take on it. πŸ€—

The paper proposes a new metric called EigenScore to detect LLM hallucinations. πŸ“„

Their idea is that given an input question, they generate K different answers, take their internal embedding states, calculate a covariance matrix with them, and use it to calculate an EigenScore.

We can think of the EigenScore as the mean of the eigenvalues of the covariance matrix of the embedding space of the K-generated answers.

❓But why eigenvalues?

Well, if the K generations have similar semantics, the sentence embeddings will be highly correlated, and most eigenvalues will be close to 0.

On the other hand, if the LLM hallucinates, the K generations will have diverse semantics, and the eigenvalues will be significantly different from 0.

The idea is pretty neat and shows better results when compared to other methods like sequence probabilities, length-normalized entropy, and other uncertainty quantification-based methods.

πŸ’­ What I'm personally missing from the paper is that they don't compare their results with other methods like LLM-Eval and SelfcheckGPT. They do mention that EigenScore is much cheaper to implement than SelfcheckGPT, but that's all on the topic.

Paper: INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection (2402.03744)
santiviquezΒ 
posted an update 10 months ago
view post
Post
Hey GPT, check yourself...

Here is a black-box method for hallucination detection that shows strong correlation with human annotations. πŸ”₯

πŸ’‘ The idea is the following: ask GPT, or any other powerful LLM, to sample multiple answers for the same prompt, and then ask it if these answers align with the statements in the original output. Make it say yes/no and measure the frequency with which the generated samples support the original statements.

This method is called SelfCheckGPT with Prompt and shows very nice results. πŸ‘€

The downside, we have to do many LLM calls just to evaluate a single generated paragraph... πŸ™ƒ

More details and variations of this method are in the paper: SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning (2308.00436)
santiviquezΒ 
posted an update 10 months ago
santiviquezΒ 
posted an update 11 months ago
view post
Post
What if the retrieval goes wrong? πŸ•

Retrieval Augmented Generation (RAG) is a strategy to alleviate LLM hallucinations and improve the quality of generated responses.

A standard RAG architecture has two main blocks: a Retriever and a Generator.

1️⃣ When the system receives an input sequence, it uses the Retriever to retrieve the top-K most relevant documents associated with the input sequence. These documents typically come from an external source (e.g., Wikipedia) and are then concatenated to the original input's context.

2️⃣ It then uses the Generator to generate a response given the gathered information in the first step.

But what happens if the retrieval goes wrong and the retrieved documents are of very low quality?

Well, in such cases, the generated response will probably be of low quality, too. 🫠

But here is where CRAG (Corrective RAG) *might* help. I say it might help because the paper is very new β€” only one week old, and I don't know if someone has actually tried this in practice πŸ˜…

However, the idea is to add a Knowledge Correction block between the Retrieval and Generation steps to evaluate the retrieved documents and correct them if necessary.

This step goes as follows:

🟒 If the documents are correct, they will be refined into more precise knowledge strips and concatenated to the original context to generate a response.

πŸ”΄ If the documents are incorrect, they will be discarded, and instead, the system searches the web for complementary knowledge. This external knowledge is then concatenated to the original context to generate a response.

🟑 If the documents are ambiguous, a combination of the previous two resolutions is triggered.

The experimental results from the paper show how the CRAG strategy outperforms traditional RAG approaches in both short and long-form text generation tasks.

Paper: Corrective Retrieval Augmented Generation (2401.15884)