janvanlooy commited on
Commit
ff11f2b
β€’
1 Parent(s): 3dc70f2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +33 -28
README.md CHANGED
@@ -11,44 +11,49 @@ pinned: false
11
  <img src="https://raw.githubusercontent.com/ml6team/fondant/main/docs/art/fondant_banner.svg" height="250px"/>
12
  </p>
13
  <p align="center">
14
- <i>Sweet data-centric foundation model fine-tuning</i>
15
  <br>
16
  <a href="https://fondant.readthedocs.io/en/stable/"><strong>Explore the docs Β»</strong></a>
17
  <br>
18
  <br>
19
  <a href="https://discord.gg/HnTdWhydGp"><img alt="Discord" src="https://dcbadge.vercel.app/api/server/HnTdWhydGp?style=flat-square"></a>
 
 
 
 
20
  </p>
21
 
22
  ---
23
- **Fondant helps you create high quality datasets to train or fine-tune foundation models such as:**
24
-
25
- - 🎨 Stable Diffusion
26
- - πŸ“„ GPT-like Large Language Models (LLMs)
27
- - πŸ”Ž CLIP
28
- - βœ‚οΈ Segment Anything (SAM)
29
- - βž• And many more
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
 
31
  ## πŸͺ€ Why Fondant?
32
 
33
- Foundation models simplify inference by solving multiple tasks across modalities with a simple
34
- prompt-based interface. But what they've gained in the front, they've lost in the back.
35
- **These models require enormous amounts of data, moving complexity towards data preparation**, and
36
- leaving few parties able to train their own models.
37
-
38
- We believe that **innovation is a group effort**, requiring collaboration. While the community has
39
- been building and sharing models, everyone is still building their data preparation from scratch.
40
- **Fondant is the platform where we meet to build and share data preparation workflows.**
41
-
42
- Fondant offers a framework to build **composable data preparation pipelines, with reusable
43
- components, optimized to handle massive datasets**. Stop building from scratch, and start
44
- reusing components to:
45
-
46
- - Extend your data with public datasets
47
- - Generate new modalities using captioning, segmentation, translation, image generation, ...
48
- - Distill knowledge from existing foundation models
49
- - Filter out low quality data
50
- - Deduplicate data
51
-
52
- And create high quality datasets to fine-tune your own foundation models.
53
 
54
  <p align="right">(<a href="#chocolate_bar-fondant">back to top</a>)</p>
 
11
  <img src="https://raw.githubusercontent.com/ml6team/fondant/main/docs/art/fondant_banner.svg" height="250px"/>
12
  </p>
13
  <p align="center">
14
+ <i>Large-scale data processing made easy and reusable</i>
15
  <br>
16
  <a href="https://fondant.readthedocs.io/en/stable/"><strong>Explore the docs Β»</strong></a>
17
  <br>
18
  <br>
19
  <a href="https://discord.gg/HnTdWhydGp"><img alt="Discord" src="https://dcbadge.vercel.app/api/server/HnTdWhydGp?style=flat-square"></a>
20
+ <a href="https://pypi.org/project/fondant/"><img alt="PyPI version" src="https://img.shields.io/pypi/v/fondant?color=brightgreen&style=flat-square"></a>
21
+ <a href="https://fondant.readthedocs.io/en/latest/license/"><img alt="License" src="https://img.shields.io/github/license/ml6team/fondant?style=flat-square&color=brightgreen"></a>
22
+ <a href="https://github.com/ml6team/fondant/actions/workflows/pipeline.yaml"><img alt="GitHub Workflow Status" src="https://img.shields.io/github/actions/workflow/status/ml6team/fondant/pipeline.yaml?style=flat-square"></a>
23
+ <a href="https://coveralls.io/github/ml6team/fondant?branch=main"><img alt="Coveralls" src="https://img.shields.io/coverallsCoverage/github/ml6team/fondant?style=flat-square"></a>
24
  </p>
25
 
26
  ---
27
+ 🍫**Fondant is an open-source framework that aims to simplify and speed up large-scale data processing by making
28
+ containerized components reusable across pipelines and execution environments and shareable within the community.**\
29
+ It offers:
30
+ - πŸ”§ Plug β€˜n’ play composable pipelines for creating datasets for
31
+ - AI image generation model fine-tuning (Stable Diffusion, ControlNet)
32
+ - Large language model fine-tuning (LLaMA, Falcon)
33
+ - Code generation model fine-tuning (StarCoder)
34
+ - 🧱 Library of off-the-shelf reusable components for
35
+ - Extracting data from public sources such as Common Crawl, LAION, ...
36
+ - Filtering on
37
+ - Content, e.g. language, visual style, topic, format, aesthetics, etc.
38
+ - Context, e.g. copyright license, origin
39
+ - Metadata
40
+ - Removal of unwanted data such as toxic, NSFW or generated content
41
+ - Removal of unwanted data patterns such as societal bias
42
+ - Transforming data (resizing, cropping, reformatting, …)
43
+ - Tuning the data for model performance (normalization, deduplication, …)
44
+ - Enriching data (captioning, metadata generation, synthetics, …)
45
+ - Transparency, auditability, compliance
46
+ - πŸ“– πŸ–ΌοΈ 🎞️ ♾️ Out of the box multimodal capabilities: text, images, video, etc.
47
+ - 🐍 Standardized, Python/Pandas-based way of creating custom components
48
+ - 🏭 Production-ready, scalable deployment
49
+ - ☁️ Multi-cloud integrations
50
 
51
  ## πŸͺ€ Why Fondant?
52
 
53
+ In the age of Foundation Models, control over your data is key and building pipelines
54
+ for large-scale data processing is costly, especially when they require advanced
55
+ machine learning-based operations. This need not be the case, however, if processing
56
+ components would be reusable and exchangeable and pipelines were easily composable.
57
+ Realizing this is the main vision behind Fondant.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
 
59
  <p align="right">(<a href="#chocolate_bar-fondant">back to top</a>)</p>