NannyML (NannyML)

posted an update 6 months ago

Post

1513

Professors should ask students to write blog posts based on their final projects instead of having them do paper-like reports.

A single blog post, accessible to the entire internet, can have a greater career impact than dozens of reports that nobody will read.

santiviquez

posted an update 6 months ago

Post

491

Some exciting news...

We are open-sourcing The Little Book of ML Metrics! 🎉

The book that will be on every data scientist's desk is open source.

What does that mean?

It means hundreds of people can review it, contribute to it, and help us improve it before it's finished!

This also means that everyone will have free access to the digital version!

Meanwhile, the high-quality printed edition will be available for purchase as it has been for a while.

Revenue from printed copies will help us support further development and maintenance of the book. Not to mention that reviewers and contributors will receive revenue sharing through their affiliate links. 🙌

Check out the book repo (make sure to leave a star 🌟):

https://github.com/NannyML/The-Little-Book-of-ML-Metrics

santiviquez

posted an update 7 months ago

Post

443

We can’t think in more than three dimensions.

But we have no problem doing math and writing computer programs in many dimensions. It just works.

I find that extremely crazy.

4 replies

·

santiviquez

posted an update 7 months ago

Post

427

ML people on a long flight

(See picture)

santiviquez

posted an update 8 months ago

Post

467

Some personal and professional news ✨

I'm writing a book on ML metrics.

Together with Wojtek Kuberski, we’re creating the missing piece of every ML university program and online course: a book solely dedicated to Machine Learning metrics!

The book will cover the following types of metrics:
• Regression
• Classification
• Clustering
• Ranking
• Vision
• Text
• GenAI
• Bias and Fairness

👉 check out the book: https://www.nannyml.com/metrics

2 replies

·

santiviquez

posted an update 10 months ago

Post

1045

They: you need ground truth to measure performance! 😠

NannyML: hold my beer...

santiviquez

posted an update 10 months ago

Post

950

Just published a new article 😊

https://huggingface.co/blog/santiviquez/data-drift-estimate-model-performance

santiviquez

posted an update 10 months ago

Post

1568

I ran 580 experiments (yes, 580 🤯) to check if we can quantify data drift's impact on model performance using only drift metrics.

For these experiments, I built a technique that relies on drift signals to estimate model performance. I compared its results against the current SoTA performance estimation methods and checked which technique performs best.

The plot below summarizes the general results. It measures the quality of performance estimation versus the absolute performance change. (The lower, the better).

Full experiment: https://www.nannyml.com/blog/data-drift-estimate-model-performance

In it, I describe the setup, datasets, models, benchmarking methods, and the code used in the project.

santiviquez

posted an update 11 months ago

Post

1573

Looking for someone with +10 years of experience training Deep Kolmogorov-Arnold Networks.

Any suggestions?

santiviquez

posted an update 12 months ago

Post

2051

More open research updates 🧵

Performance estimation is currently the best way to quantify the impact of data drift on model performance. 💡

I've been benchmarking performance estimation methods (CBPE and M-CBPE) against data drift signals.

I'm using drift results as features for many regression algorithms, and then I'm taking those to estimate the model's performance. Finally, I'm measuring the Mean Absolute Error (MAE) between the regression models' predictions and actual performance.

So far, for all my experiments, performance estimation methods do better than drift signals. 👨‍🔬

Bear in mind that these are some early results, I'm running the flow on more datasets as we speak.

Hopefully, by next week, I will have more results to share 👀

santiviquez

posted an update about 1 year ago

Post

1351

How would you benchmark performance estimation algorithms vs data drift signals?

I'm working on a benchmarking analysis, and I'm currently doing the following:

- Get univariate and multivariate drift signals and measure their correlation with realized performance.
- Use drift signals as features of a regression model to predict the model's performance.
- Use drift signals as features of a classification model to predict a performance drop.
- Compare all the above experiments with results from Performance Estimation algorithms.

Any other ideas?

santiviquez

posted an update about 1 year ago

Post

People in Paris 🇫🇷 🥐

Next week we'll be hosting our first Post-Deployment Data Science Meetup in Paris!

My boss will be talking about Quantifying the Impact of Data Drift on Model
Performance. 👀

The event is completely free, and there's only space for 50 people, so if you are interested, RSVP as soon as possible 🤗

🗓️ Thursday, March 14
🕠 5:30 PM - 8:30 PM GMT+1
🔗 RSVP: https://lu.ma/postdeploymentparis

santiviquez

posted an update about 1 year ago

Post

Where I work, we are obsessed with what happens to a model's performance after it has been deployed. We call this post-deployment data science.

Let me tell you about a post-deployment data science algorithm that we recently developed to measure the impact of Concept Drift on a model's performance.

How can we detect Concept Drift? 🤔

All ML models are designed to do one thing: learning a probability distribution in the form of P(y|X). In other words, they try to learn how to model an outcome 'y' given the input variables 'X'. 🧠

This probability distribution, P(y|X), is also called Concept. Therefore, if the Concept changes, the model may become invalid.

❓But how do we know if there is a new Concept in our data?
❓Or, more important, how do we measure if the new Concept is affecting the model's performance?

💡 We came up with a clever solution where the main ingredients are a reference dataset, one where the model's performance is known, and a dataset with the latest data we would like to monitor.

👣 Step-by-Step solution:

1️⃣ We start by training an internal model on a chunk of the latest data. ➡️ This allows us to learn the new possible Concept presented in the data.

2️⃣ Next, we use the internal model to make predictions on the reference dataset.

3️⃣ We then estimate the model's performance on the reference dataset, assuming the model's predictions on the monitoring data as ground truth.

4️⃣ If the estimated performance of the internal model and the actual monitored model are very different, we then say that there has been a Concept Drift.

To quantify how this Concept impacts performance, we subtract the actual model's performance on reference from the estimated performance and report a delta of the performance metric. ➡️ This is what the plot below shows. The change of the F1-score due to Concept drift! 🚨

This process is repeated for every new chunk of data that we get. 🔁

santiviquez

posted an update about 1 year ago

Post

LLM hallucination detection papers be like:

* check the image to get the joke 👀

santiviquez

posted an update about 1 year ago

Post

Fantastic Beasts (*Hallucinations*) and Where to Find Them 🔎🧌

This paper breaks down LLM hallucinations into six different types:

1️⃣ Entity: Involves errors in nouns. Changing that single entity can make the sentence correct.

2️⃣ Relation: Involves errors in verbs, prepositions, or adjectives. They can be fixed by correcting the relation.

3️⃣ Contradictory: Sentences that contradict factually correct information.

4️⃣ Invented: When the LLM generates sentences with concepts that don't exist in the real world.

5️⃣ Subjective: When the LLM generates sentences influenced by personal beliefs, feelings, biases, etc.

6️⃣ Unverifiable: When the LLM comes up with sentences containing information that can't be verified. E.g., Personal or private matters.

The first two types of hallucinations are relatively easy to correct, given that we can rewrite them by changing the entity or relation. However, the other four would mostly need to be removed to make the sentence factually correct.

Paper: Fine-grained Hallucination Detection and Editing for Language Models (2401.06855)

santiviquez

posted an update about 1 year ago

Post

So, I have this idea to (potentially) improve uncertainty quantification for LLM hallucination detection.

The premise is that not all output tokens of a generated response share the same importance. Hallucinations are more dangerous in the form of a noun, date, number, etc.

The idea is to have a "token selection" layer that filters the output token probabilities sequence. Then, we use only the probabilities of the relevant tokens to calculate uncertainty quantification metrics.

The big question is how we know which tokens are the relevant ones. 🤔

My idea is to use the output sequence (decoded one) and use an NLP model (it doesn't need to be a fancy one) to do entity recognition and part-of-speech tagging to the output sequence and then do uncertainty quantification only on the entities that we have set as relevant (nouns, dates, numbers, etc).

What are your thoughts? Have you seen anyone try this before?

Curious to see if anyone has tried this before and know if this would have an impact on the correlation with human-annotated evaluations.

3 replies

·

santiviquez

posted an update about 1 year ago

Post

Eigenvalues to the rescue? 🛟🤔

I found out about this paper thanks to @gsarti 's post from last week; I got curious, so I want to post my take on it. 🤗

The paper proposes a new metric called EigenScore to detect LLM hallucinations. 📄

Their idea is that given an input question, they generate K different answers, take their internal embedding states, calculate a covariance matrix with them, and use it to calculate an EigenScore.

We can think of the EigenScore as the mean of the eigenvalues of the covariance matrix of the embedding space of the K-generated answers.

❓But why eigenvalues?

Well, if the K generations have similar semantics, the sentence embeddings will be highly correlated, and most eigenvalues will be close to 0.

On the other hand, if the LLM hallucinates, the K generations will have diverse semantics, and the eigenvalues will be significantly different from 0.

The idea is pretty neat and shows better results when compared to other methods like sequence probabilities, length-normalized entropy, and other uncertainty quantification-based methods.

💭 What I'm personally missing from the paper is that they don't compare their results with other methods like LLM-Eval and SelfcheckGPT. They do mention that EigenScore is much cheaper to implement than SelfcheckGPT, but that's all on the topic.

Paper: INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection (2402.03744)

santiviquez

posted an update about 1 year ago

Post

Hey GPT, check yourself...

Here is a black-box method for hallucination detection that shows strong correlation with human annotations. 🔥

💡 The idea is the following: ask GPT, or any other powerful LLM, to sample multiple answers for the same prompt, and then ask it if these answers align with the statements in the original output. Make it say yes/no and measure the frequency with which the generated samples support the original statements.

This method is called SelfCheckGPT with Prompt and shows very nice results. 👀

The downside, we have to do many LLM calls just to evaluate a single generated paragraph... 🙃

More details and variations of this method are in the paper: SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning (2308.00436)

santiviquez

posted an update about 1 year ago

Post

Created a collection with all the LLM hallucination and evaluation papers I've been reading over the last few weeks 📄📄

santiviquez/llm-hallucination-detection-papers-65c4d2399096960aa80776d3

Which one should I add to the list?

santiviquez

posted an update about 1 year ago

Post

What if the retrieval goes wrong? 🐕

Retrieval Augmented Generation (RAG) is a strategy to alleviate LLM hallucinations and improve the quality of generated responses.

A standard RAG architecture has two main blocks: a Retriever and a Generator.

1️⃣ When the system receives an input sequence, it uses the Retriever to retrieve the top-K most relevant documents associated with the input sequence. These documents typically come from an external source (e.g., Wikipedia) and are then concatenated to the original input's context.

2️⃣ It then uses the Generator to generate a response given the gathered information in the first step.

But what happens if the retrieval goes wrong and the retrieved documents are of very low quality?

Well, in such cases, the generated response will probably be of low quality, too. 🫠

But here is where CRAG (Corrective RAG) *might* help. I say it might help because the paper is very new — only one week old, and I don't know if someone has actually tried this in practice 😅

However, the idea is to add a Knowledge Correction block between the Retrieval and Generation steps to evaluate the retrieved documents and correct them if necessary.

This step goes as follows:

🟢 If the documents are correct, they will be refined into more precise knowledge strips and concatenated to the original context to generate a response.

🔴 If the documents are incorrect, they will be discarded, and instead, the system searches the web for complementary knowledge. This external knowledge is then concatenated to the original context to generate a response.

🟡 If the documents are ambiguous, a combination of the previous two resolutions is triggered.

The experimental results from the paper show how the CRAG strategy outperforms traditional RAG approaches in both short and long-form text generation tasks.

Paper: Corrective Retrieval Augmented Generation (2401.15884)

NannyML

AI & ML interests

NannyML's activity

AI & ML interests

Team members 1

NannyML's activity