475 160 514

Daniel van Strien PRO

davanstrien

https://danielvanstrien.xyz/

vanstriendaniel

davanstrien

AI & ML interests

Machine Learning Librarian

Articles

Synthetic dataset generation techniques: Self-Instruct

3 days ago

• 3

Can we create pedagogically valuable multi-turn synthetic datasets from Cosmopedia?

11 days ago

• 6

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Mar 20

• 17

Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model

Aug 22, 2023

• 9

Huggy Lingo: Using Machine Learning to Improve Language Metadata on the Hugging Face Hub

Aug 2, 2023

The Hugging Face Hub for Galleries, Libraries, Archives and Museums

Jun 12, 2023

• 1

Introducing BERTopic Integration with Hugging Face Hub

May 31, 2023

Jupyter X Hugging Face

Mar 23, 2023

• 2

Image search with 🤗 datasets

Mar 16, 2022

Organizations

davanstrien's activity

upvoted a collection 1 day ago

Arabic DPO Datasets

Collection

Our synthetic DPO datasets for Arabic. • 3 items • Updated 1 day ago • 2

upvoted a paper 2 days ago

ParaNames 1.0: Creating an Entity Name Corpus for 400+ Languages using Wikidata

Paper • 2405.09496 • Published 2 days ago • 1

upvoted 3 papers 3 days ago

RLHF Workflow: From Reward Modeling to Online RLHF

Paper • 2405.07863 • Published 5 days ago • 51

Becoming self-instruct: introducing early stopping criteria for minimal instruct tuning

Paper • 2307.03692 • Published Jul 5, 2023 • 24

Self-Alignment with Instruction Backtranslation

Paper • 2308.06259 • Published Aug 11, 2023 • 38

upvoted an article 4 days ago

Article

Introducing the Open Arabic LLM Leaderboard

4 days ago

• 37

upvoted 2 papers 10 days ago

Typhoon: Thai Large Language Models

Paper • 2312.13951 • Published Dec 21, 2023 • 4

Optimizing Language Model's Reasoning Abilities with Weak Supervision

Paper • 2405.04086 • Published 11 days ago • 1

upvoted 2 papers 12 days ago

Aloe: A Family of Fine-tuned Open Healthcare LLMs

Paper • 2405.01886 • Published 15 days ago • 2

WizardCoder: Empowering Code Large Language Models with Evol-Instruct

Paper • 2306.08568 • Published Jun 14, 2023 • 27

upvoted an article 15 days ago

Article

🧑‍⚖️ "Replacing Judges with Juries" using distilabel

•

15 days ago

• 14

upvoted a collection 15 days ago

Domain Specific Data

Collection

This is a collection of tools for building domain specific datasets using human domain expertise and synthetic data generation. • 3 items • Updated Apr 15 • 2

upvoted a paper 16 days ago

HistNERo: Historical Named Entity Recognition for the Romanian Language

Paper • 2405.00155 • Published 17 days ago • 2

upvoted a paper 18 days ago

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

Paper • 2404.18796 • Published 19 days ago • 62

upvoted an article 18 days ago

Article

Jupyter X Hugging Face

Mar 23, 2023

• 2

upvoted an article 19 days ago

Article

⚗️ 🧑🏼‍🌾 Let's grow some Domain Specific Datasets together

•

19 days ago

• 25

upvoted 2 papers 19 days ago

Building a Japanese Document-Level Relation Extraction Dataset Assisted by Cross-Lingual Transfer

Paper • 2404.16506 • Published 23 days ago • 1

Can a Multichoice Dataset be Repurposed for Extractive Question Answering?

Paper • 2404.17342 • Published 22 days ago • 1

upvoted 2 articles 22 days ago

Article

Post-OCR-Correction: 1 billion words dataset of automated OCR correction by LLM

•

22 days ago

• 10

Article

🦙⚗️ Using Llama3 and distilabel to build fine-tuning datasets

•

21 days ago

• 54

upvoted an article 24 days ago

Article

The Hugging Face Hub for Galleries, Libraries, Archives and Museums

Jun 12, 2023

• 1

upvoted a paper 24 days ago

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Paper • 2404.14219 • Published 26 days ago • 230

upvoted an article 30 days ago

Article

Releasing Youtube-Commons: a massive open corpus for conversational and multimodal data

•

30 days ago

• 20

upvoted a paper about 1 month ago

nEMO: Dataset of Emotional Speech in Polish

Paper • 2404.06292 • Published Apr 9 • 1

upvoted 4 articles about 1 month ago

Article

Text2SQL using Hugging Face Dataset Viewer API and Motherduck DuckDB-NSQL-7B

Apr 4

• 20

Article

Deploying 🤗 Hub models in Vertex AI

•

Feb 27

• 3

Article

Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset

Mar 15

• 4

Article

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Mar 20

• 17

upvoted a paper about 1 month ago

CodeEditorBench: Evaluating Code Editing Capability of Large Language Models

Paper • 2404.03543 • Published Apr 4 • 15

upvoted a collection about 2 months ago

boulderspot

Collection

find places to climb outside from aerial imagery • 4 items • Updated Apr 1 • 3

upvoted 3 papers about 2 months ago

ORPO: Monolithic Preference Optimization without Reference Model

Paper • 2403.07691 • Published Mar 12 • 54

ChroniclingAmericaQA: A Large-scale Question Answering Dataset based on Historical American Newspaper Pages

Paper • 2403.17859 • Published Mar 26 • 2

QuRating: Selecting High-Quality Data for Training Language Models

Paper • 2402.09739 • Published Feb 15 • 3

upvoted 2 collections about 2 months ago

Preference Datasets for KTO

Collection

This collection contains a list of curated preference datasets for KTO fine-tuning for intent alignment of LLMs through signals. • 5 items • Updated Mar 19 • 10

Common Corpus

Collection

The largest public domain dataset for training LLMs. • 26 items • Updated Mar 20 • 102

upvoted a paper 2 months ago

KTO: Model Alignment as Prospect Theoretic Optimization

Paper • 2402.01306 • Published Feb 2 • 11

upvoted 3 collections 2 months ago

DIBT Prompt Collective Outputs

Collection

An overview of some of the outputs from the first Data is Better Together task focused on rating the quality of prompts • 5 items • Updated Mar 12 • 1

2024 Paper Reading Sessions

Collection

Contains some papers that we are reading through internally hosted paper reading sessions. Topics: LLMs, RLHF, synthetic data, benchmarking, etc. • 2 items • Updated Jan 8 • 2

DIBT Prompt collective SPIN

Collection

This collection contains resources related to the replication of SPIN with the dibt prompt collective dataset • 8 items • Updated Mar 12 • 7

upvoted a paper 2 months ago

SaulLM-7B: A pioneering Large Language Model for Law

Paper • 2403.03883 • Published Mar 6 • 65

upvoted a collection 2 months ago

UDOP

Collection

UDOP is a general multimodal model for document AI • 4 items • Updated 9 days ago • 20

upvoted 2 papers 2 months ago

Major TOM: Expandable Datasets for Earth Observation

Paper • 2402.12095 • Published Feb 19 • 8

Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models

Paper • 2403.00231 • Published Mar 1 • 1

upvoted 10 papers 3 months ago

NToP: NeRF-Powered Large-scale Dataset Generation for 2D and 3D Human Pose Estimation in Top-View Fisheye Images

Paper • 2402.18196 • Published Feb 28 • 1

The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning

Paper • 2312.01552 • Published Dec 4, 2023 • 26

OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement

Paper • 2402.14658 • Published Feb 22 • 77

A Survey on Data Selection for LLM Instruction Tuning

Paper • 2402.05123 • Published Feb 4 • 3

Airavata: Introducing Hindi Instruction-tuned LLM

Paper • 2401.15006 • Published Jan 26 • 3

Diversity Measurement and Subset Selection for Instruction Tuning Datasets

Paper • 2402.02318 • Published Feb 4 • 2

DsDm: Model-Aware Dataset Selection with Datamodels

Paper • 2401.12926 • Published Jan 23 • 2

SelectLLM: Can LLMs Select Important Instructions to Annotate?

Paper • 2401.16553 • Published Jan 29 • 3

LESS: Selecting Influential Data for Targeted Instruction Tuning

Paper • 2402.04333 • Published Feb 6 • 3

Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning

Paper • 2402.04833 • Published Feb 7 • 6

upvoted 7 collections 3 months ago

WebLINX

Collection

https://mcgill-nlp.github.io/weblinx • 10 items • Updated 22 days ago • 4

LLM as a Judge

Collection

Curated resources that support the use of LLMs to serve as automatic evaluators of other LLM outputs. • 15 items • Updated 1 day ago • 15

Information Extraction Datasets

Collection

Collection of datasest for various information extraction tasks. • 3 items • Updated Feb 9 • 5

datasets-SPIN

Collection

Generated synthetic data used to finetune SPIN. • 8 items • Updated Feb 9 • 10

Model Merging

Collection

Model Merging is a very popular technique nowadays in LLM. Here is a chronological list of papers on the space that will help you get started with it! • 28 items • Updated Mar 23 • 180

LLM Leaderboard best models ❤️‍🔥

Collection

A daily uploaded list of models with best evaluations on the LLM leaderboard: • 70 items • Updated 1 day ago • 304

Qwen1.5

Collection

Qwen1.5 is the improved version of Qwen, the large language model series developed by Alibaba Cloud. • 55 items • Updated 5 days ago • 169

Daniel van Strien PRO

AI & ML interests

Articles

Synthetic dataset generation techniques: Self-Instruct

Can we create pedagogically valuable multi-turn synthetic datasets from Cosmopedia?

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Data is better together

Extracting Insights from Model Cards Using Open Large Language Models

Creating open machine learning datasets? Share them on the Hugging Face Hub!

Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model

Huggy Lingo: Using Machine Learning to Improve Language Metadata on the Hugging Face Hub

The Hugging Face Hub for Galleries, Libraries, Archives and Museums

Introducing BERTopic Integration with Hugging Face Hub

Jupyter X Hugging Face

Image search with 🤗 datasets

Organizations

davanstrien's activity

Introducing the Open Arabic LLM Leaderboard

🧑‍⚖️ "Replacing Judges with Juries" using distilabel

Jupyter X Hugging Face

⚗️ 🧑🏼‍🌾 Let's grow some Domain Specific Datasets together

Post-OCR-Correction: 1 billion words dataset of automated OCR correction by LLM

🦙⚗️ Using Llama3 and distilabel to build fine-tuning datasets

The Hugging Face Hub for Galleries, Libraries, Archives and Museums

Releasing Youtube-Commons: a massive open corpus for conversational and multimodal data

Text2SQL using Hugging Face Dataset Viewer API and Motherduck DuckDB-NSQL-7B

Deploying 🤗 Hub models in Vertex AI

Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models