Spaces:
Running
Running
File size: 2,777 Bytes
e685483 b28ea04 e685483 0ded85d b28ea04 bb97204 7c397f0 70248fc a392733 1e0494f b28ea04 1e0494f ff11f2b b28ea04 ff11f2b b28ea04 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
---
title: README
emoji: π«
colorFrom: yellow
colorTo: green
sdk: static
pinned: false
---
<p align="center">
<img src="https://raw.githubusercontent.com/ml6team/fondant/main/docs/art/fondant_banner.svg" alt="Fondant banner" height="200">
<i>Large-scale data processing made easy and reusable</i>
<br>
<a href="https://fondant.readthedocs.io/en/stable/"><strong>Explore the docs Β»</strong></a>
</p>
<p float="left" align="middle">
<a href="https://discord.gg/HnTdWhydGp"><img src="https://img.shields.io/badge/Discord-%235865F2.svg?style=for-the-badge&logo=discord&logoColor=white" alt="Discord badge" width="100"></a> <a href="https://www.github.com/ml6team/fondant"><img src="https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white" alt="Github badge" width="100"></a>
</p>
---
π« **Fondant is an open-source framework that aims to simplify and speed up large-scale data processing by making
containerized components reusable across pipelines and execution environments and shareable within the community.**
It offers:
- π§ Plug βnβ play composable pipelines for creating datasets for
- AI image generation model fine-tuning (Stable Diffusion, ControlNet)
- Large language model fine-tuning (LLaMA, Falcon)
- Code generation model fine-tuning (StarCoder)
- 𧱠Library of off-the-shelf reusable components for
- Extracting data from public sources such as Common Crawl, LAION, ...
- Filtering on
- Content, e.g. language, visual style, topic, format, aesthetics, etc.
- Context, e.g. copyright license, origin
- Metadata
- Removal of unwanted data such as toxic, NSFW or generated content
- Removal of unwanted data patterns such as societal bias
- Transforming data (resizing, cropping, reformatting, β¦)
- Tuning the data for model performance (normalization, deduplication, β¦)
- Enriching data (captioning, metadata generation, synthetics, β¦)
- Transparency, auditability, compliance
- π πΌοΈ ποΈ βΎοΈ Out of the box multimodal capabilities: text, images, video, etc.
- π Standardized, Python/Pandas-based way of creating custom components
- π Production-ready, scalable deployment
- βοΈ Multi-cloud integrations
## πͺ€ Why Fondant?
In the age of Foundation Models, control over your data is key and building pipelines
for large-scale data processing is costly, especially when they require advanced
machine learning-based operations. This need not be the case, however, if processing
components would be reusable and exchangeable and pipelines were easily composable.
Realizing this is the main vision behind Fondant.
<p align="right">(<a href="#chocolate_bar-fondant">back to top</a>)</p>
|