3 12 61

Nandan Thakur

nthakur

https://thakur-nandan.github.io

AI & ML interests

NLP, IR, QA

Recent Activity

reacted to clem's post with 🔥 3 days ago

Before 2020, most of the AI field was open and collaborative. For me, that was the key factor that accelerated scientific progress and made the impossible possible—just look at the “T” in ChatGPT, which comes from the Transformer architecture openly shared by Google. Then came the myth that AI was too dangerous to share, and companies started optimizing for short-term revenue. That led many major AI labs and researchers to stop sharing and collaborating. With OAI and sama now saying they're willing to share open weights again, we have a real chance to return to a golden age of AI progress and democratization—powered by openness and collaboration, in the US and around the world. This is incredibly exciting. Let’s go, open science and open-source AI!

reacted to their post with 🔥 3 days ago

Last year, I curated & generated a few multilingual SFT and DPO datasets by translating English SFT/DPO datasets into 9-10 languages using the https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2 model. I hope it helps the community for pretraining/instruction tuning multilingual LLMs! I added a small diagram to briefly describe which datasets are added and their sources. Happy to collaborate in either using these datasets for instruction FT, or wishes to extend translated versions of newer SFT/DPO english datasets! https://huggingface.co/collections/nthakur/multilingual-sft-and-dpo-datasets-67eaf56fe3feca5a57cf7d74

updated a collection 3 days ago

🏜️MIRAGE-Bench [NAACL'25]

View all activity

Organizations

nthakur's activity

liked 5 datasets 3 days ago

liked a dataset 22 days ago

habedi/stack-exchange-dataset

Viewer • Updated Nov 29, 2023 • 82.2k • 344 • 6

liked a dataset 25 days ago

nthakur/bge-retrieval-data

Viewer • Updated 22 days ago • 680k • 332 • 1

liked 5 models about 1 month ago

facebook/drama-base

Qwen/Qwen2.5-7B

Text Generation • Updated Sep 25, 2024 • 419k • 162

intfloat/e5-mistral-7b-instruct

Feature Extraction • Updated Apr 23, 2024 • 164k • • 504

Qwen/Qwen2.5-3B

Text Generation • Updated Sep 20, 2024 • 294k • • 97

nthakur/contriever-base-msmarco

liked a dataset 2 months ago

nthakur/bge-full-data

Viewer • Updated Feb 4 • 1.6M • 209 • 1

liked 2 models 2 months ago

meta-llama/Llama-3.2-1B

Text Generation • Updated Oct 24, 2024 • 3.42M • • 1.78k

Alibaba-NLP/gte-modernbert-base

liked a dataset 2 months ago

cfli/bge-full-data

Updated Oct 11, 2024 • 541 • 34

liked a dataset 5 months ago

google/frames-benchmark

Viewer • Updated Oct 15, 2024 • 824 • 1.95k • 194

liked a dataset 7 months ago

princeton-nlp/SWE-bench

Viewer • Updated Mar 3 • 21.5k • 53.1k • 108

liked a model 8 months ago

BAAI/bge-reranker-v2-m3

Text Classification • Updated Jun 24, 2024 • 1.17M • • 587

liked a dataset 8 months ago

argilla/distilabel-intel-orca-dpo-pairs

Viewer • Updated 16 days ago • 12.9k • 4.13k • 172