File size: 2,777 Bytes
e685483
 
b28ea04
e685483
 
 
 
 
 
0ded85d
b28ea04
bb97204
7c397f0
70248fc
 
 
 
a392733
1e0494f
 
 
 
 
 
b28ea04
1e0494f
 
 
ff11f2b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b28ea04
 
 
ff11f2b
 
 
 
 
b28ea04
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
---
title: README
emoji: 🍫
colorFrom: yellow
colorTo: green
sdk: static
pinned: false
---



<p align="center">
    <img src="https://raw.githubusercontent.com/ml6team/fondant/main/docs/art/fondant_banner.svg" alt="Fondant banner" height="200">
    <i>Large-scale data processing made easy and reusable</i>
    <br>
    <a href="https://fondant.readthedocs.io/en/stable/"><strong>Explore the docs Β»</strong></a>
</p>


<p float="left" align="middle">
  <a href="https://discord.gg/HnTdWhydGp"><img src="https://img.shields.io/badge/Discord-%235865F2.svg?style=for-the-badge&logo=discord&logoColor=white" alt="Discord badge" width="100"></a> <a href="https://www.github.com/ml6team/fondant"><img src="https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white" alt="Github badge" width="100"></a>
</p>


---
🍫 **Fondant is an open-source framework that aims to simplify and speed up large-scale data processing by making 
containerized components reusable across pipelines and execution environments and shareable within the community.**

It offers:
- πŸ”§ Plug β€˜n’ play composable pipelines for creating datasets for
    - AI image generation model fine-tuning (Stable Diffusion, ControlNet)
    - Large language model fine-tuning (LLaMA, Falcon)
    - Code generation model fine-tuning (StarCoder)
- 🧱 Library of off-the-shelf reusable components for
    - Extracting data from public sources such as Common Crawl, LAION, ...
    - Filtering on 
        - Content, e.g. language, visual style, topic, format, aesthetics, etc.
        - Context, e.g. copyright license, origin
        - Metadata
    - Removal of unwanted data such as toxic, NSFW or generated content
    - Removal of unwanted data patterns such as societal bias
    - Transforming data (resizing, cropping, reformatting, …)
    - Tuning the data for model performance (normalization, deduplication, …)
    - Enriching data (captioning, metadata generation, synthetics, …)
    - Transparency, auditability, compliance
- πŸ“– πŸ–ΌοΈ 🎞️ ♾️ Out of the box multimodal capabilities: text, images, video, etc.
- 🐍 Standardized, Python/Pandas-based way of creating custom components
- 🏭 Production-ready, scalable deployment
- ☁️ Multi-cloud integrations

## πŸͺ€ Why Fondant?

In the age of Foundation Models, control over your data is key and building pipelines
for large-scale data processing is costly, especially when they require advanced
machine learning-based operations. This need not be the case, however, if processing
components would be reusable and exchangeable and pipelines were easily composable. 
Realizing this is the main vision behind Fondant.

<p align="right">(<a href="#chocolate_bar-fondant">back to top</a>)</p>