479 171 526

Daniel van Strien PRO

davanstrien

https://danielvanstrien.xyz/

vanstriendaniel

davanstrien

AI & ML interests

Machine Learning Librarian

Articles

Synthetic dataset generation techniques: generating custom sentence similarity data

11 days ago

• 11

Synthetic dataset generation techniques: Self-Instruct

19 days ago

• 5

Can we create pedagogically valuable multi-turn synthetic datasets from Cosmopedia?

27 days ago

• 6

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Mar 20

• 24

Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model

Aug 22, 2023

• 11

Huggy Lingo: Using Machine Learning to Improve Language Metadata on the Hugging Face Hub

Aug 2, 2023

The Hugging Face Hub for Galleries, Libraries, Archives and Museums

Jun 12, 2023

• 1

Introducing BERTopic Integration with Hugging Face Hub

May 31, 2023

• 1

Jupyter X Hugging Face

Mar 23, 2023

• 2

Image search with 🤗 datasets

Mar 16, 2022

• 1

Organizations

davanstrien's activity

upvoted an article 2 days ago

Article

Training and Finetuning Embedding Models with Sentence Transformers v3

6 days ago

• 63

upvoted a collection 3 days ago

Arabic NoRobots DPO Datasets

Collection

Our synthetic DPO datasets for Arabic NoRobots. • 4 items • Updated 5 days ago • 3

upvoted a paper 3 days ago

An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Models are Task-specific Classifiers

Paper • 2403.02839 • Published Mar 5 • 1

upvoted an article 4 days ago

Article

⚗️ 🔥 Building High-Quality Datasets with distilabel and Prometheus 2

•

5 days ago

• 20

upvoted a collection 5 days ago

sentence-transformers-from-synthetic-data

Collection

Example of using distilabel to generate synthetic triplets data for fine-tuning a Sentence Transformer model • 3 items • Updated 3 days ago • 15

upvoted a paper 11 days ago

Retrieving Texts based on Abstract Descriptions

Paper • 2305.12517 • Published May 21, 2023 • 2

upvoted an article 12 days ago

Article

Synthetic data: save money, time and carbon with open source

Feb 16

• 29

upvoted a paper 13 days ago

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

Paper • 2405.11143 • Published 14 days ago • 33

upvoted a collection 14 days ago

Phi-3

Collection

Phi-3 family of small language and multi-modal models. Language models are available in short- and long-context lengths. • 22 items • Updated 3 days ago • 302

upvoted 2 papers 15 days ago

MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels

Paper • 2405.07526 • Published 21 days ago • 14

LoRA Learns Less and Forgets Less

Paper • 2405.09673 • Published 18 days ago • 73

upvoted a collection 17 days ago

Arabic Aya DPO Datasets

Collection

Our synthetic DPO datasets for Arabic Aya. • 4 items • Updated 5 days ago • 3

upvoted a paper 18 days ago

ParaNames 1.0: Creating an Entity Name Corpus for 400+ Languages using Wikidata

Paper • 2405.09496 • Published 18 days ago • 3

upvoted 3 papers 19 days ago

RLHF Workflow: From Reward Modeling to Online RLHF

Paper • 2405.07863 • Published 21 days ago • 60

Becoming self-instruct: introducing early stopping criteria for minimal instruct tuning

Paper • 2307.03692 • Published Jul 5, 2023 • 24

Self-Alignment with Instruction Backtranslation

Paper • 2308.06259 • Published Aug 11, 2023 • 38

upvoted an article 20 days ago

Article

Introducing the Open Arabic LLM Leaderboard

20 days ago

• 47

upvoted 2 papers 26 days ago

Typhoon: Thai Large Language Models

Paper • 2312.13951 • Published Dec 21, 2023 • 4

Optimizing Language Model's Reasoning Abilities with Weak Supervision

Paper • 2405.04086 • Published 27 days ago • 1

upvoted 2 papers 28 days ago

Aloe: A Family of Fine-tuned Open Healthcare LLMs

Paper • 2405.01886 • Published about 1 month ago • 2

WizardCoder: Empowering Code Large Language Models with Evol-Instruct

Paper • 2306.08568 • Published Jun 14, 2023 • 27

upvoted an article about 1 month ago

Article

🧑‍⚖️ "Replacing Judges with Juries" using distilabel

•

about 1 month ago

• 14

upvoted a collection about 1 month ago

Domain Specific Data

Collection

This is a collection of tools for building domain specific datasets using human domain expertise and synthetic data generation. • 3 items • Updated Apr 15 • 2

upvoted 2 papers about 1 month ago

HistNERo: Historical Named Entity Recognition for the Romanian Language

Paper • 2405.00155 • Published Apr 30 • 4

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

Paper • 2404.18796 • Published Apr 29 • 66

upvoted 2 articles about 1 month ago

Article

Jupyter X Hugging Face

Mar 23, 2023

• 2

Article

⚗️ 🧑🏼‍🌾 Let's grow some Domain Specific Datasets together

•

Apr 29

• 27

upvoted 2 papers about 1 month ago

Building a Japanese Document-Level Relation Extraction Dataset Assisted by Cross-Lingual Transfer

Paper • 2404.16506 • Published Apr 25 • 1

Can a Multichoice Dataset be Repurposed for Extractive Question Answering?

Paper • 2404.17342 • Published Apr 26 • 1

upvoted 3 articles about 1 month ago

Article

Post-OCR-Correction: 1 billion words dataset of automated OCR correction by LLM

•

Apr 26

• 10

Article

🦙⚗️ Using Llama3 and distilabel to build fine-tuning datasets

•

Apr 26

• 55

Article

The Hugging Face Hub for Galleries, Libraries, Archives and Museums

Jun 12, 2023

• 1

upvoted a paper about 1 month ago

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Paper • 2404.14219 • Published Apr 22 • 239

upvoted an article about 2 months ago

Article

Releasing Youtube-Commons: a massive open corpus for conversational and multimodal data

•

Apr 18

• 20

upvoted a paper about 2 months ago

nEMO: Dataset of Emotional Speech in Polish

Paper • 2404.06292 • Published Apr 9 • 1

upvoted 4 articles about 2 months ago

Article

Text2SQL using Hugging Face Dataset Viewer API and Motherduck DuckDB-NSQL-7B

Apr 4

• 20

Article

Deploying 🤗 Hub models in Vertex AI

•

Feb 27

• 3

Article

Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset

Mar 15

• 5

Article

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Mar 20

• 24

upvoted a paper about 2 months ago

CodeEditorBench: Evaluating Code Editing Capability of Large Language Models

Paper • 2404.03543 • Published Apr 4 • 15

upvoted a collection 2 months ago

boulderspot

Collection

find places to climb outside from aerial imagery • 4 items • Updated Apr 1 • 3

upvoted 3 papers 2 months ago

ORPO: Monolithic Preference Optimization without Reference Model

Paper • 2403.07691 • Published Mar 12 • 58

ChroniclingAmericaQA: A Large-scale Question Answering Dataset based on Historical American Newspaper Pages

Paper • 2403.17859 • Published Mar 26 • 2

QuRating: Selecting High-Quality Data for Training Language Models

Paper • 2402.09739 • Published Feb 15 • 3

upvoted 2 collections 2 months ago

Preference Datasets for KTO

Collection

This collection contains a list of curated preference datasets for KTO fine-tuning for intent alignment of LLMs through signals. • 5 items • Updated Mar 19 • 10

Common Corpus

Collection

The largest public domain dataset for training LLMs. • 26 items • Updated Mar 20 • 103

upvoted a paper 3 months ago

KTO: Model Alignment as Prospect Theoretic Optimization

Paper • 2402.01306 • Published Feb 2 • 11

upvoted 3 collections 3 months ago

DIBT Prompt Collective Outputs

Collection

An overview of some of the outputs from the first Data is Better Together task focused on rating the quality of prompts • 5 items • Updated Mar 12 • 1

2024 Paper Reading Sessions

Collection

Contains some papers that we are reading through internally hosted paper reading sessions. Topics: LLMs, RLHF, synthetic data, benchmarking, etc. • 2 items • Updated Jan 8 • 2

DIBT Prompt collective SPIN

Collection

This collection contains resources related to the replication of SPIN with the dibt prompt collective dataset • 8 items • Updated Mar 12 • 7

upvoted a paper 3 months ago

SaulLM-7B: A pioneering Large Language Model for Law

Paper • 2403.03883 • Published Mar 6 • 67

upvoted a collection 3 months ago

UDOP

Collection

UDOP is a general multimodal model for document AI • 4 items • Updated 12 days ago • 20

upvoted 6 papers 3 months ago

Major TOM: Expandable Datasets for Earth Observation

Paper • 2402.12095 • Published Feb 19 • 8

Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models

Paper • 2403.00231 • Published Mar 1 • 1

NToP: NeRF-Powered Large-scale Dataset Generation for 2D and 3D Human Pose Estimation in Top-View Fisheye Images

Paper • 2402.18196 • Published Feb 28 • 1

The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning

Paper • 2312.01552 • Published Dec 4, 2023 • 26

OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement

Paper • 2402.14658 • Published Feb 22 • 78

A Survey on Data Selection for LLM Instruction Tuning

Paper • 2402.05123 • Published Feb 4 • 3

upvoted 2 papers 4 months ago

Airavata: Introducing Hindi Instruction-tuned LLM

Paper • 2401.15006 • Published Jan 26 • 3

Diversity Measurement and Subset Selection for Instruction Tuning Datasets

Paper • 2402.02318 • Published Feb 4 • 2

Daniel van Strien PRO

AI & ML interests

Articles

Synthetic dataset generation techniques: generating custom sentence similarity data

Synthetic dataset generation techniques: Self-Instruct

Can we create pedagogically valuable multi-turn synthetic datasets from Cosmopedia?

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Data is better together

Extracting Insights from Model Cards Using Open Large Language Models

Creating open machine learning datasets? Share them on the Hugging Face Hub!

Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model

Huggy Lingo: Using Machine Learning to Improve Language Metadata on the Hugging Face Hub

The Hugging Face Hub for Galleries, Libraries, Archives and Museums

Introducing BERTopic Integration with Hugging Face Hub

Jupyter X Hugging Face

Image search with 🤗 datasets

Organizations

davanstrien's activity

Training and Finetuning Embedding Models with Sentence Transformers v3

⚗️ 🔥 Building High-Quality Datasets with distilabel and Prometheus 2

Synthetic data: save money, time and carbon with open source

Introducing the Open Arabic LLM Leaderboard

🧑‍⚖️ "Replacing Judges with Juries" using distilabel

Jupyter X Hugging Face

⚗️ 🧑🏼‍🌾 Let's grow some Domain Specific Datasets together

Post-OCR-Correction: 1 billion words dataset of automated OCR correction by LLM

🦙⚗️ Using Llama3 and distilabel to build fine-tuning datasets

The Hugging Face Hub for Galleries, Libraries, Archives and Museums

Releasing Youtube-Commons: a massive open corpus for conversational and multimodal data

Text2SQL using Hugging Face Dataset Viewer API and Motherduck DuckDB-NSQL-7B

Deploying 🤗 Hub models in Vertex AI

Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models