Nandan Thakur's picture

Nandan Thakur

nthakur

·

https://thakur-nandan.github.io

AI & ML interests

NLP, IR, QA

Recent Activity

reacted to clem's post with 🔥 3 days ago

Before 2020, most of the AI field was open and collaborative. For me, that was the key factor that accelerated scientific progress and made the impossible possible—just look at the “T” in ChatGPT, which comes from the Transformer architecture openly shared by Google. Then came the myth that AI was too dangerous to share, and companies started optimizing for short-term revenue. That led many major AI labs and researchers to stop sharing and collaborating. With OAI and sama now saying they're willing to share open weights again, we have a real chance to return to a golden age of AI progress and democratization—powered by openness and collaboration, in the US and around the world. This is incredibly exciting. Let’s go, open science and open-source AI!

reacted to their post with 🔥 3 days ago

Last year, I curated & generated a few multilingual SFT and DPO datasets by translating English SFT/DPO datasets into 9-10 languages using the https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2 model. I hope it helps the community for pretraining/instruction tuning multilingual LLMs! I added a small diagram to briefly describe which datasets are added and their sources. Happy to collaborate in either using these datasets for instruction FT, or wishes to extend translated versions of newer SFT/DPO english datasets! https://huggingface.co/collections/nthakur/multilingual-sft-and-dpo-datasets-67eaf56fe3feca5a57cf7d74

updated a collection 3 days ago

🏜️MIRAGE-Bench [NAACL'25]

View all activity

Organizations

nthakur's activity

upvoted 2 collections 10 days ago

🌐 NoMIRACL Dataset [EMNLP'24]

A collection of multilingual relevance assessment datasets. We also have SFT fine-tuned models (Mistral-7B & Llama-3 8B) • 7 items • Updated 3 days ago • 1

🏜️MIRAGE-Bench [NAACL'25]

Dataset Collection from the MIRAGE-Bench paper • 13 items • Updated 3 days ago • 1

upvoted a collection about 1 month ago

DRAMA

A collection of small (sub-1B) multilingual dense retrievers that generalize well across a number of tasks and languages. • 3 items • Updated Feb 26 • 5

upvoted a paper 2 months ago

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Paper • 2501.12948 • Published Jan 22 • 368

upvoted a paper 4 months ago

NoMIRACL: Knowing When You Don't Know for Robust Multilingual Retrieval-Augmented Generation

Paper • 2312.11361 • Published Dec 18, 2023 • 1

upvoted a paper 10 months ago

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Paper • 2406.01574 • Published Jun 3, 2024 • 46

upvoted an article 10 months ago

Article

Training and Finetuning Embedding Models with Sentence Transformers v3

May 28, 2024

• 205

upvoted a collection 11 months ago

🦢SWIM-IR Dataset [NAACL'24]

29 million Synthetic Wikipedia-based Multilingual Retrieval Training Pairs. • 4 items • Updated 3 days ago • 7

upvoted a paper 11 months ago

Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval

Paper • 2311.05800 • Published Nov 10, 2023 • 3

upvoted a paper about 1 year ago

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

Paper • 2104.08663 • Published Apr 17, 2021 • 3