santiviquez (Santiago Viquez)

posted an update 3 days ago

Post

1471

I ran 580 experiments (yes, 580 🤯) to check if we can quantify data drift's impact on model performance using only drift metrics.

For these experiments, I built a technique that relies on drift signals to estimate model performance. I compared its results against the current SoTA performance estimation methods and checked which technique performs best.

The plot below summarizes the general results. It measures the quality of performance estimation versus the absolute performance change. (The lower, the better).

Full experiment: https://www.nannyml.com/blog/data-drift-estimate-model-performance

In it, I describe the setup, datasets, models, benchmarking methods, and the code used in the project.

posted an update about 1 month ago

Post

1556

Looking for someone with +10 years of experience training Deep Kolmogorov-Arnold Networks.

Any suggestions?

posted an update about 2 months ago

Post

2042

More open research updates 🧵

Performance estimation is currently the best way to quantify the impact of data drift on model performance. 💡

I've been benchmarking performance estimation methods (CBPE and M-CBPE) against data drift signals.

I'm using drift results as features for many regression algorithms, and then I'm taking those to estimate the model's performance. Finally, I'm measuring the Mean Absolute Error (MAE) between the regression models' predictions and actual performance.

So far, for all my experiments, performance estimation methods do better than drift signals. 👨‍🔬

Bear in mind that these are some early results, I'm running the flow on more datasets as we speak.

Hopefully, by next week, I will have more results to share 👀

posted an update 2 months ago

Post

1346

How would you benchmark performance estimation algorithms vs data drift signals?

I'm working on a benchmarking analysis, and I'm currently doing the following:

- Get univariate and multivariate drift signals and measure their correlation with realized performance.
- Use drift signals as features of a regression model to predict the model's performance.
- Use drift signals as features of a classification model to predict a performance drop.
- Compare all the above experiments with results from Performance Estimation algorithms.

Any other ideas?

replied to gsarti's post 2 months ago

Nicee, I'll take a look 👀

posted an update 3 months ago

Post

People in Paris 🇫🇷 🥐

Next week we'll be hosting our first Post-Deployment Data Science Meetup in Paris!

My boss will be talking about Quantifying the Impact of Data Drift on Model
Performance. 👀

The event is completely free, and there's only space for 50 people, so if you are interested, RSVP as soon as possible 🤗

🗓️ Thursday, March 14
🕠 5:30 PM - 8:30 PM GMT+1
🔗 RSVP: https://lu.ma/postdeploymentparis

posted an update 3 months ago

Post

Where I work, we are obsessed with what happens to a model's performance after it has been deployed. We call this post-deployment data science.

Let me tell you about a post-deployment data science algorithm that we recently developed to measure the impact of Concept Drift on a model's performance.

How can we detect Concept Drift? 🤔

All ML models are designed to do one thing: learning a probability distribution in the form of P(y|X). In other words, they try to learn how to model an outcome 'y' given the input variables 'X'. 🧠

This probability distribution, P(y|X), is also called Concept. Therefore, if the Concept changes, the model may become invalid.

❓But how do we know if there is a new Concept in our data?
❓Or, more important, how do we measure if the new Concept is affecting the model's performance?

💡 We came up with a clever solution where the main ingredients are a reference dataset, one where the model's performance is known, and a dataset with the latest data we would like to monitor.

👣 Step-by-Step solution:

1️⃣ We start by training an internal model on a chunk of the latest data. ➡️ This allows us to learn the new possible Concept presented in the data.

2️⃣ Next, we use the internal model to make predictions on the reference dataset.

3️⃣ We then estimate the model's performance on the reference dataset, assuming the model's predictions on the monitoring data as ground truth.

4️⃣ If the estimated performance of the internal model and the actual monitored model are very different, we then say that there has been a Concept Drift.

To quantify how this Concept impacts performance, we subtract the actual model's performance on reference from the estimated performance and report a delta of the performance metric. ➡️ This is what the plot below shows. The change of the F1-score due to Concept drift! 🚨

This process is repeated for every new chunk of data that we get. 🔁

posted an update 3 months ago

Post

LLM hallucination detection papers be like:

* check the image to get the joke 👀

posted an update 3 months ago

Post

Fantastic Beasts (*Hallucinations*) and Where to Find Them 🔎🧌

This paper breaks down LLM hallucinations into six different types:

1️⃣ Entity: Involves errors in nouns. Changing that single entity can make the sentence correct.

2️⃣ Relation: Involves errors in verbs, prepositions, or adjectives. They can be fixed by correcting the relation.

3️⃣ Contradictory: Sentences that contradict factually correct information.

4️⃣ Invented: When the LLM generates sentences with concepts that don't exist in the real world.

5️⃣ Subjective: When the LLM generates sentences influenced by personal beliefs, feelings, biases, etc.

6️⃣ Unverifiable: When the LLM comes up with sentences containing information that can't be verified. E.g., Personal or private matters.

The first two types of hallucinations are relatively easy to correct, given that we can rewrite them by changing the entity or relation. However, the other four would mostly need to be removed to make the sentence factually correct.

Paper: Fine-grained Hallucination Detection and Editing for Language Models (2401.06855)

replied to their post 4 months ago

omg this is super cool! Definitely ping me when you have a demo.

replied to their post 4 months ago

@gsarti curious to know if you have seen something like this. It is very similar to a weighted version of UQ, but not exactly... haha

posted an update 4 months ago

Post

So, I have this idea to (potentially) improve uncertainty quantification for LLM hallucination detection.

The premise is that not all output tokens of a generated response share the same importance. Hallucinations are more dangerous in the form of a noun, date, number, etc.

The idea is to have a "token selection" layer that filters the output token probabilities sequence. Then, we use only the probabilities of the relevant tokens to calculate uncertainty quantification metrics.

The big question is how we know which tokens are the relevant ones. 🤔

My idea is to use the output sequence (decoded one) and use an NLP model (it doesn't need to be a fancy one) to do entity recognition and part-of-speech tagging to the output sequence and then do uncertainty quantification only on the entities that we have set as relevant (nouns, dates, numbers, etc).

What are your thoughts? Have you seen anyone try this before?

Curious to see if anyone has tried this before and know if this would have an impact on the correlation with human-annotated evaluations.

3 replies

·

posted an update 4 months ago

Post

Eigenvalues to the rescue? 🛟🤔

I found out about this paper thanks to @gsarti 's post from last week; I got curious, so I want to post my take on it. 🤗

The paper proposes a new metric called EigenScore to detect LLM hallucinations. 📄

Their idea is that given an input question, they generate K different answers, take their internal embedding states, calculate a covariance matrix with them, and use it to calculate an EigenScore.

We can think of the EigenScore as the mean of the eigenvalues of the covariance matrix of the embedding space of the K-generated answers.

❓But why eigenvalues?

Well, if the K generations have similar semantics, the sentence embeddings will be highly correlated, and most eigenvalues will be close to 0.

On the other hand, if the LLM hallucinates, the K generations will have diverse semantics, and the eigenvalues will be significantly different from 0.

The idea is pretty neat and shows better results when compared to other methods like sequence probabilities, length-normalized entropy, and other uncertainty quantification-based methods.

💭 What I'm personally missing from the paper is that they don't compare their results with other methods like LLM-Eval and SelfcheckGPT. They do mention that EigenScore is much cheaper to implement than SelfcheckGPT, but that's all on the topic.

Paper: INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection (2402.03744)

posted an update 4 months ago

Post

Hey GPT, check yourself...

Here is a black-box method for hallucination detection that shows strong correlation with human annotations. 🔥

💡 The idea is the following: ask GPT, or any other powerful LLM, to sample multiple answers for the same prompt, and then ask it if these answers align with the statements in the original output. Make it say yes/no and measure the frequency with which the generated samples support the original statements.

This method is called SelfCheckGPT with Prompt and shows very nice results. 👀

The downside, we have to do many LLM calls just to evaluate a single generated paragraph... 🙃

More details and variations of this method are in the paper: SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning (2308.00436)

posted an update 4 months ago

Post

Created a collection with all the LLM hallucination and evaluation papers I've been reading over the last few weeks 📄📄

santiviquez/llm-hallucination-detection-papers-65c4d2399096960aa80776d3

Which one should I add to the list?

posted an update 4 months ago

Post

What if the retrieval goes wrong? 🐕

Retrieval Augmented Generation (RAG) is a strategy to alleviate LLM hallucinations and improve the quality of generated responses.

A standard RAG architecture has two main blocks: a Retriever and a Generator.

1️⃣ When the system receives an input sequence, it uses the Retriever to retrieve the top-K most relevant documents associated with the input sequence. These documents typically come from an external source (e.g., Wikipedia) and are then concatenated to the original input's context.

2️⃣ It then uses the Generator to generate a response given the gathered information in the first step.

But what happens if the retrieval goes wrong and the retrieved documents are of very low quality?

Well, in such cases, the generated response will probably be of low quality, too. 🫠

But here is where CRAG (Corrective RAG) *might* help. I say it might help because the paper is very new — only one week old, and I don't know if someone has actually tried this in practice 😅

However, the idea is to add a Knowledge Correction block between the Retrieval and Generation steps to evaluate the retrieved documents and correct them if necessary.

This step goes as follows:

🟢 If the documents are correct, they will be refined into more precise knowledge strips and concatenated to the original context to generate a response.

🔴 If the documents are incorrect, they will be discarded, and instead, the system searches the web for complementary knowledge. This external knowledge is then concatenated to the original context to generate a response.

🟡 If the documents are ambiguous, a combination of the previous two resolutions is triggered.

The experimental results from the paper show how the CRAG strategy outperforms traditional RAG approaches in both short and long-form text generation tasks.

Paper: Corrective Retrieval Augmented Generation (2401.15884)

replied to victor's post 4 months ago

🔥

replied to victor's post 4 months ago

😂

posted an update 4 months ago

Post

Super excited to share my project, ageML, here! 😊

ageML is a Python library I've been building to study the temporal performance degradation of ML models.

The goal of the project is to facilitate the exploration of performance degradation by providing tools for people to easily test how their models would evolve over time when trained and evaluated on different subsets of their data.

⭐ Check it out: https://github.com/santiviquez/ageml

replied to their post 4 months ago

This is definitely on my list. Haven't gone through the paper, but planning to read it this week haha

posted an update 4 months ago

Post

Understanding BARTScore 🛹

BARTScore is a text-generation evaluation metric that treats model evaluation as a text-generation task 🔄

Other metrics approach the evaluation problem from different ML task perspectives; for instance, ROUGE and BLUE formulate it as an unsupervised matching task, BLUERT and COMET as a supervised regression, and BEER as a supervised ranking task.

Meanwhile, BARTScore formulates it as a text-generation task. Its idea is to leverage BART's pre-trained contextual embeddings to return a score that measures either the faithfulness, precision, recall, or F-score response of the main text-generation model.

For example, if we want to measure faithfulness, the way it works is that we would take the source and the generated text from our model and use BART to calculate the log token probability of the generated text given the source; we can then weight those results and return the sum.

BARTScore correlates nicely with human scores, and it is relatively simple to implement.

📑 Here is the original BARTScore paper: BARTScore: Evaluating Generated Text as Text Generation (2106.11520)
🧑‍💻 And the GitHub repo to use this metric: https://github.com/neulab/BARTScore

replied to their post 4 months ago

Here is a colorblind-friendly option :)

posted an update 4 months ago

Post

Some of my results from experimenting with hallucination detection techniques for LLMs 🫨🔍

First, the two main ideas used in the experiments—using token probabilities and LLM-Eval scores—are taken from these three papers:

1. Looking for a Needle in a Haystack: A Comprehensive Study of Hallucinations in Neural Machine Translation (2208.05309)
2. SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models (2303.08896)
3. LLM-Eval: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models (2305.13711)

In the first two, the authors claim that computing the average of the sentence-level token probabilities is the best heuristic for detecting hallucinations. And from my results, we do see a weak positive correlation between average token probabilities and ground truth. 🤔

The nice thing about this method is that it comes with almost no implementation cost since we only need the output token probabilities from the generated text, so it is straightforward to implement.

The third paper proposes an evaluation shema where we do an extra call to an LLM and kindly ask it to rate on a scale from 0 to 5 how good the generated text is on a set of different criteria. 📝🤖

I was able to reproduce similar results to those in the paper. There is a moderate positive correlation between the ground truth scores and the ones produced by the LLM.

Of course, this method is much more expensive since we would need one extra call to the LLM for every prediction that we would like to evaluate, and it is also very sensitive to prompt engineering. 🤷

2 replies

·

replied to their post 4 months ago

Yes, of course, I was actually gonna add the explanation as a comment, but I forgot 🙃

The idea is that models have confident and less confident areas. The confidence is influenced by the characteristics and distribution of the training data.

In the example above, during testing, the model classifies all data points almost perfectly. And we observe only a small portion of them gathering in the center (the model's less confident area).

However, in production, more and more examples start coming from the conflicted region. A shift like that one will definitely translate into a performance drop.

So, you need monitoring to realize that the model might be underperforming.

The issue is that monitoring performance changes in production is hard because we rarely have ground truth there. The good news is that we could monitor the estimated performance instead!

posted an update 4 months ago

Post

Had a lot of fun making this plot today.

If someone ever asks you why you need ML monitoring, show them this picture 😂

3 replies

·

posted an update 4 months ago

Post

Pretty novel idea on how to estimate *semantic* uncertainty. 🤔

Text generation tasks are challenging because a sentence can be written in multiple ways but still preserve its meaning.

For instance, "France's capital is Paris" means the same as "Paris is France's capital." 🇫🇷

In uncertainty quantification, we often look at token-level probabilities to quantify how "confident" an LLM is about its output. However, in this paper, the authors look at uncertainty at a meaning level.

Their motivation is that meanings are especially important for LLMs' trustworthiness; a system can be reliable even with many different ways to say the same thing, but answering with inconsistent meanings shows poor reliability.

To estimate semantic uncertainty, they introduce an algorithm for clustering sequences that mean the same thing, based on the principle that two sentences mean the same thing if we can infer one from the other. 🔄🤝

Then, they determine the likelihood of each meaning and estimate the semantic entropy by summing probabilities that share a meaning.

There's a lot more to it, but their results look quite nice when compared with non-semantic approaches.

Paper: Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation (2302.09664)

replied to their post 4 months ago

Nice! Thank you, I'll take a look

replied to their post 4 months ago

Ohh that’s so cool! I actually played with the space last week when I was reading the paper. Don’t remember how I found it 🤔

posted an update 4 months ago

Post

Confidence * may be * all you need.

A simple average of the log probabilities of the output tokens from an LLM might be all it takes to tell if the model is hallucinating.🫨

The idea is that if a model is not confident (low output token probabilities), the model may be inventing random stuff.

In these two papers:
1. https://aclanthology.org/2023.eacl-main.75/
2. https://arxiv.org/abs/2303.08896

The authors claim that this simple method is the best heuristic for detecting hallucinations. The beauty is that it only uses the generated token probabilities, so it can be implemented at inference time ⚡

4 replies

·

replied to abhishek's post 5 months ago

this is fine 🐶🔥

replied to abhishek's post 5 months ago

ohhh @victor can you add me on the list too? 😅

replied to abhishek's post 5 months ago

helloooo 👀

Santiago Viquez

AI & ML interests

Articles

Are your NLP models deteriorating post-deployment? Let’s use unlabelled data to find out

Organizations

santiviquez's activity