joy larkin

joylarkin

https://cleverhack.com/2026

AI & ML interests

Global AI, Multilingual AI, European AI, Superintelligence, AGI, ASI, LLMs, World Models ••• AI Marketing/Comms, GTM, Ecosystem, Community

Recent Activity

updated a dataset 29 days ago

joylarkin/AI-Coding-Models

updated a dataset 29 days ago

joylarkin/AI-Coding-Tools

updated a dataset 30 days ago

joylarkin/openclaw-security-news

View all activity

Organizations

Posts 2

Post

2933

💬 Chat as a way to query SQL! The Airtrain AI team is happy to share a new Hugging Face Space that lets you interact with Hugging Face Hub datasets using a natural language chatbot. 🤗

Start Exploring 👉 airtrain-ai/hf-dataset-chat-to-sql

This Space is forked from davidberenstein1957/text-to-sql-hub-datasets by @davidberenstein1957 and features chat capability with improved table naming. The tool works with Hugging Face’s recently released in-browser DuckDB-based SQL query engine for datasets.

Post

3370

Introducing Fineweb-Edu-Fortified: An enhanced Fineweb-Edu dataset. 📚

This dataset is tailored for NLP tasks and helps streamline model training by offering a more refined, unique dataset. Perfect for startups and researchers looking for high-quality educational content to train, evaluate, or fine-tune AI models. The dataset is based on the Fineweb-Edu subset of the large Fineweb dataset and includes:

- Exact-match deduplication across all crawls
- Embeddings for each row using the TaylorAI/bge-micro model
- Count column indicating duplication frequency
- Includes data from 95 Common Crawl crawls (2013-2024)
- Rows have been reduced from 1.279B to 0.324B after deduplication
- It is comprised of ~375B tokens (down from 1,320B in Fineweb-Edu)

Access the entire Fineweb-Edu-Fortified dataset on Hugging Face → airtrain-ai/fineweb-edu-fortified

Try a semantic search demo via this Hugging Face Space → airtrain-ai/fineweb-edu-fortified-search-demo

Many thanks to the amazing @josh-sematic for his work on this project, the Fineweb/Fineweb-Edu team at Hugging Face for producing the original datasets and for their support during our work on Fineweb-Edu-Fortified, and also thanks to @underspirit for pointing out the reduction in dataset size that could be achieved via deduplication. 🤗

View all Posts

joy larkin

AI & ML interests

Recent Activity

Organizations

Posts 2

Collections 5

The ATOM Report: Measuring the Open Language Model Ecosystem

Deep Research Agents: A Systematic Examination And Roadmap

VCBench: Benchmarking LLMs in Venture Capital

The ATOM Report: Measuring the Open Language Model Ecosystem

Deep Research Agents: A Systematic Examination And Roadmap

VCBench: Benchmarking LLMs in Venture Capital

models 0

datasets 5

joylarkin/AI-Coding-Models

joylarkin/AI-Coding-Tools

joylarkin/openclaw-security-news

joylarkin/2026AIMarketMaps

joylarkin/cleverhack-llms-txt

joy larkin

AI & ML interests

Recent Activity

Organizations

Posts 2

Collections 5

models 0

datasets 5 Sort: Recently updated

datasets 5