478 166 517

Daniel van Strien PRO

davanstrien

https://danielvanstrien.xyz/

vanstriendaniel

davanstrien

AI & ML interests

Machine Learning Librarian

Articles

Synthetic dataset generation techniques: generating custom sentence similarity data

about 8 hours ago

• 7

Synthetic dataset generation techniques: Self-Instruct

8 days ago

• 5

Can we create pedagogically valuable multi-turn synthetic datasets from Cosmopedia?

16 days ago

• 6

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Mar 20

• 22

Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model

Aug 22, 2023

• 9

Huggy Lingo: Using Machine Learning to Improve Language Metadata on the Hugging Face Hub

Aug 2, 2023

The Hugging Face Hub for Galleries, Libraries, Archives and Museums

Jun 12, 2023

• 1

Introducing BERTopic Integration with Hugging Face Hub

May 31, 2023

Jupyter X Hugging Face

Mar 23, 2023

• 2

Image search with 🤗 datasets

Mar 16, 2022

Organizations

davanstrien's activity

upvoted a paper 1 day ago

Retrieving Texts based on Abstract Descriptions

Paper • 2305.12517 • Published May 21, 2023 • 1

upvoted an article 1 day ago

Article

Synthetic data: save money, time and carbon with open source

Feb 16

• 27

upvoted a paper 2 days ago

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

Paper • 2405.11143 • Published 4 days ago • 27

upvoted a collection 3 days ago

Phi-3

Collection

Phi-3 family of small language and multi-modal models. Language models are available in short- and long-context lengths. • 20 items • Updated 1 day ago • 266

upvoted 2 papers 5 days ago

MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels

Paper • 2405.07526 • Published 10 days ago • 14

LoRA Learns Less and Forgets Less

Paper • 2405.09673 • Published 8 days ago • 65

upvoted a collection 7 days ago

Arabic DPO Datasets

Collection

Our synthetic DPO datasets for Arabic. • 3 items • Updated 7 days ago • 2

upvoted a paper 7 days ago

ParaNames 1.0: Creating an Entity Name Corpus for 400+ Languages using Wikidata

Paper • 2405.09496 • Published 8 days ago • 3

upvoted a paper 8 days ago

RLHF Workflow: From Reward Modeling to Online RLHF

Paper • 2405.07863 • Published 10 days ago • 55

upvoted 2 papers 9 days ago

Becoming self-instruct: introducing early stopping criteria for minimal instruct tuning

Paper • 2307.03692 • Published Jul 5, 2023 • 24

Self-Alignment with Instruction Backtranslation

Paper • 2308.06259 • Published Aug 11, 2023 • 38

upvoted an article 9 days ago

Article

Introducing the Open Arabic LLM Leaderboard

10 days ago

• 45

upvoted 2 papers 15 days ago

Typhoon: Thai Large Language Models

Paper • 2312.13951 • Published Dec 21, 2023 • 4

Optimizing Language Model's Reasoning Abilities with Weak Supervision

Paper • 2405.04086 • Published 16 days ago • 1

upvoted a paper 17 days ago

Aloe: A Family of Fine-tuned Open Healthcare LLMs

Paper • 2405.01886 • Published 20 days ago • 2

upvoted a paper 18 days ago

WizardCoder: Empowering Code Large Language Models with Evol-Instruct

Paper • 2306.08568 • Published Jun 14, 2023 • 27

upvoted an article 20 days ago

Article

🧑‍⚖️ "Replacing Judges with Juries" using distilabel

•

20 days ago

• 14

upvoted a collection 20 days ago

Domain Specific Data

Collection

This is a collection of tools for building domain specific datasets using human domain expertise and synthetic data generation. • 3 items • Updated Apr 15 • 2

upvoted a paper 21 days ago

HistNERo: Historical Named Entity Recognition for the Romanian Language

Paper • 2405.00155 • Published 23 days ago • 3

upvoted a paper 23 days ago

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

Paper • 2404.18796 • Published 24 days ago • 63

upvoted an article 23 days ago

Article

Jupyter X Hugging Face

Mar 23, 2023

• 2

upvoted an article 24 days ago

Article

⚗️ 🧑🏼‍🌾 Let's grow some Domain Specific Datasets together

•

24 days ago

• 26

upvoted 2 papers 24 days ago

Building a Japanese Document-Level Relation Extraction Dataset Assisted by Cross-Lingual Transfer

Paper • 2404.16506 • Published 28 days ago • 1

Can a Multichoice Dataset be Repurposed for Extractive Question Answering?

Paper • 2404.17342 • Published 27 days ago • 1

upvoted 2 articles 27 days ago

Article

Post-OCR-Correction: 1 billion words dataset of automated OCR correction by LLM

•

27 days ago

• 10

Article

🦙⚗️ Using Llama3 and distilabel to build fine-tuning datasets

•

27 days ago

• 55

upvoted an article 29 days ago

Article

The Hugging Face Hub for Galleries, Libraries, Archives and Museums

Jun 12, 2023

• 1

upvoted a paper 30 days ago

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Paper • 2404.14219 • Published Apr 22 • 235

upvoted an article about 1 month ago

Article

Releasing Youtube-Commons: a massive open corpus for conversational and multimodal data

•

Apr 18

• 20

upvoted a paper about 1 month ago

nEMO: Dataset of Emotional Speech in Polish

Paper • 2404.06292 • Published Apr 9 • 1

upvoted an article about 1 month ago

Article

Text2SQL using Hugging Face Dataset Viewer API and Motherduck DuckDB-NSQL-7B

Apr 4

• 20

upvoted 3 articles about 2 months ago

Article

Deploying 🤗 Hub models in Vertex AI

•

Feb 27

• 3

Article

Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset

Mar 15

• 5

Article

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Mar 20

• 22

upvoted a paper about 2 months ago

CodeEditorBench: Evaluating Code Editing Capability of Large Language Models

Paper • 2404.03543 • Published Apr 4 • 15

upvoted a collection about 2 months ago

boulderspot

Collection

find places to climb outside from aerial imagery • 4 items • Updated Apr 1 • 3

upvoted 3 papers about 2 months ago

ORPO: Monolithic Preference Optimization without Reference Model

Paper • 2403.07691 • Published Mar 12 • 57

ChroniclingAmericaQA: A Large-scale Question Answering Dataset based on Historical American Newspaper Pages

Paper • 2403.17859 • Published Mar 26 • 2

QuRating: Selecting High-Quality Data for Training Language Models

Paper • 2402.09739 • Published Feb 15 • 3

upvoted 2 collections 2 months ago

Preference Datasets for KTO

Collection

This collection contains a list of curated preference datasets for KTO fine-tuning for intent alignment of LLMs through signals. • 5 items • Updated Mar 19 • 10

Common Corpus

Collection

The largest public domain dataset for training LLMs. • 26 items • Updated Mar 20 • 103

upvoted a paper 2 months ago

KTO: Model Alignment as Prospect Theoretic Optimization

Paper • 2402.01306 • Published Feb 2 • 11

upvoted 3 collections 2 months ago

DIBT Prompt Collective Outputs

Collection

An overview of some of the outputs from the first Data is Better Together task focused on rating the quality of prompts • 5 items • Updated Mar 12 • 1

2024 Paper Reading Sessions

Collection

Contains some papers that we are reading through internally hosted paper reading sessions. Topics: LLMs, RLHF, synthetic data, benchmarking, etc. • 2 items • Updated Jan 8 • 2

DIBT Prompt collective SPIN

Collection

This collection contains resources related to the replication of SPIN with the dibt prompt collective dataset • 8 items • Updated Mar 12 • 7

upvoted a paper 3 months ago

SaulLM-7B: A pioneering Large Language Model for Law

Paper • 2403.03883 • Published Mar 6 • 66

upvoted a collection 3 months ago

UDOP

Collection

UDOP is a general multimodal model for document AI • 4 items • Updated 1 day ago • 20

upvoted 12 papers 3 months ago

Major TOM: Expandable Datasets for Earth Observation

Paper • 2402.12095 • Published Feb 19 • 8

Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models

Paper • 2403.00231 • Published Mar 1 • 1

NToP: NeRF-Powered Large-scale Dataset Generation for 2D and 3D Human Pose Estimation in Top-View Fisheye Images

Paper • 2402.18196 • Published Feb 28 • 1

The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning

Paper • 2312.01552 • Published Dec 4, 2023 • 26

OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement

Paper • 2402.14658 • Published Feb 22 • 77

A Survey on Data Selection for LLM Instruction Tuning

Paper • 2402.05123 • Published Feb 4 • 3

Airavata: Introducing Hindi Instruction-tuned LLM

Paper • 2401.15006 • Published Jan 26 • 3

Diversity Measurement and Subset Selection for Instruction Tuning Datasets

Paper • 2402.02318 • Published Feb 4 • 2

DsDm: Model-Aware Dataset Selection with Datamodels

Paper • 2401.12926 • Published Jan 23 • 2

SelectLLM: Can LLMs Select Important Instructions to Annotate?

Paper • 2401.16553 • Published Jan 29 • 3

LESS: Selecting Influential Data for Targeted Instruction Tuning

Paper • 2402.04333 • Published Feb 6 • 3

Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning

Paper • 2402.04833 • Published Feb 7 • 6

upvoted a collection 3 months ago

WebLINX

Collection

https://mcgill-nlp.github.io/weblinx • 10 items • Updated 28 days ago • 4

Daniel van Strien PRO

AI & ML interests

Articles

Synthetic dataset generation techniques: generating custom sentence similarity data

Synthetic dataset generation techniques: Self-Instruct

Can we create pedagogically valuable multi-turn synthetic datasets from Cosmopedia?

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Data is better together

Extracting Insights from Model Cards Using Open Large Language Models

Creating open machine learning datasets? Share them on the Hugging Face Hub!

Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model

Huggy Lingo: Using Machine Learning to Improve Language Metadata on the Hugging Face Hub

The Hugging Face Hub for Galleries, Libraries, Archives and Museums

Introducing BERTopic Integration with Hugging Face Hub

Jupyter X Hugging Face

Image search with 🤗 datasets

Organizations

davanstrien's activity

Synthetic data: save money, time and carbon with open source

Introducing the Open Arabic LLM Leaderboard

🧑‍⚖️ "Replacing Judges with Juries" using distilabel

Jupyter X Hugging Face

⚗️ 🧑🏼‍🌾 Let's grow some Domain Specific Datasets together

Post-OCR-Correction: 1 billion words dataset of automated OCR correction by LLM

🦙⚗️ Using Llama3 and distilabel to build fine-tuning datasets

The Hugging Face Hub for Galleries, Libraries, Archives and Museums

Releasing Youtube-Commons: a massive open corpus for conversational and multimodal data

Text2SQL using Hugging Face Dataset Viewer API and Motherduck DuckDB-NSQL-7B

Deploying 🤗 Hub models in Vertex AI

Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models