Nandan Thakur's picture

Nandan Thakur

nthakur

·

https://thakur-nandan.github.io

AI & ML interests

NLP, IR, QA

Recent Activity

updated a dataset about 11 hours ago

nthakur/bge-retrieval-data-gpt4o-7-datasets-680K-removed

published a dataset about 11 hours ago

nthakur/bge-retrieval-data-gpt4o-7-datasets-680K-removed

updated a dataset about 12 hours ago

nthakur/bge-retrieval-data-gpt4o-7-datasets-680K-replaced

View all activity

Organizations

Posts 2

Post

1546

Last year, I curated & generated a few multilingual SFT and DPO datasets by translating English SFT/DPO datasets into 9-10 languages using the mistralai/Mistral-7B-Instruct-v0.2 model.

I hope it helps the community for pretraining/instruction tuning multilingual LLMs! I added a small diagram to briefly describe which datasets are added and their sources.

Happy to collaborate in either using these datasets for instruction FT, or wishes to extend translated versions of newer SFT/DPO english datasets!

nthakur/multilingual-sft-and-dpo-datasets-67eaf56fe3feca5a57cf7d74

Post

3452

🦢 The SWIM-IR dataset contains 29 million text-retrieval training pairs across 27 diverse languages. It is one of the largest synthetic multilingual datasets generated using PaLM 2 on Wikipedia! 🔥🔥

SWIM-IR dataset contains three subsets :
- Cross-lingual:nthakur/swim-ir-cross-lingual
- Monolingual: nthakur/swim-ir-monolingual
- Indic Cross-lingual: nthakur/indic-swim-ir-cross-lingual

Check it out:
https://huggingface.co/collections/nthakur/swim-ir-dataset-662ddaecfc20896bf14dd9b7

Collections 5

Papers 12

arxiv:2502.13595

arxiv:2410.13716

arxiv:2406.16828

arxiv:2312.11361

models 34

nthakur/Mistral-7B-Instruct-v0.2-mirage-bench-sft-teacher-mixtral

Updated 10 days ago • 3

nthakur/Meta-Llama-3-8B-Instruct-mirage-bench-sft

Updated 10 days ago • 9

nthakur/Mistral-7B-Instruct-v0.2-mirage-bench-sft

Updated 10 days ago • 6

nthakur/Mistral-7B-Instruct-v0.2-multilingual-dpo-v1.0-v2

Updated Aug 23, 2024 • 1

nthakur/Mistral-7B-Instruct-v0.2-multilingual-dpo-v1.0-final

Updated Aug 13, 2024

nthakur/Meta-Llama-3-8B-Instruct-mirage-all-teacher-instruct-llama-3-sft

Updated Aug 13, 2024 • 1

nthakur/Mistral-7B-Instruct-v0.2-mirage-all-teacher-instruct-mistral-sft

Updated Aug 13, 2024 • 3

nthakur/Mistral-7B-Instruct-v0.2-multilingual-dpo-v1.0

Updated Aug 12, 2024

nthakur/Mistral-7B-Instruct-v0.2-multilingual-deita-10k-v0-sft-v0.1

Updated Aug 12, 2024 • 2

nthakur/Meta-Llama-3-8B-Instruct-mirage-bench-sft-teacher-llama-3

Updated Aug 10, 2024 • 2

datasets 106

nthakur/bge-retrieval-data-gpt4o-7-datasets-680K-removed

Viewer • Updated about 11 hours ago • 324k

nthakur/bge-retrieval-data-gpt4o-7-datasets-680K-replaced

Viewer • Updated about 12 hours ago • 649k • 8

nthakur/bge-retrieval-data-7-datasets-680K-replaced

Viewer • Updated about 18 hours ago • 675k • 24

nthakur/bge-retrieval-data-7-datasets-680K-removed

Viewer • Updated about 18 hours ago • 554k • 10

nthakur/bge-retrieval-data-gpt4o-7-datasets-400K-removed

Viewer • Updated 3 days ago • 248k • 13

nthakur/bge-retrieval-data-gpt4o-7-datasets-400K-replaced

Viewer • Updated 3 days ago • 390k • 27

nthakur/bge-retrieval-data-gpt4o-7-datasets-250K-removed

Viewer • Updated 3 days ago • 151k • 12

nthakur/bge-retrieval-data-gpt4o-7-datasets-250K-replaced

Viewer • Updated 3 days ago • 248k • 20

nthakur/bge-retrieval-data-gpt4o-7-datasets-100K-removed

Viewer • Updated 3 days ago • 61k • 11

nthakur/bge-retrieval-data-gpt4o-7-datasets-100K-replaced

Viewer • Updated 3 days ago • 93.6k • 21