--- strip-comments: true bibliography: ["ref.bib"] format: revealjs: logo: ./figures/logo/sustech.png # footer: | #
# > #
slide-number: true multiplex: false show-notes: false theme: sustech.scss show-slide-number: all controls: false preview-links: true transition: "slide" preload-iframes: true view-distance: 10 width: 1280 height: 720 mermaid: theme: dark code-overflow: wrap callout-icon: false execute: echo: false revealjs-plugins: - verticator - codewindow - qrcode --- ## {.theme-title .center} ::: {.titlebox style="text-align:center; font-size: 2em;"} [Modeling on Internet-scale Data]{.adlery style="color:#320005;"} [Bingyi Jing@ML-Summit]{style="font-size:0.5em;"} [Apr 25th, 2024]{style="font-size:0.5em;"} ::: ## {.theme-content} :::: columns ::: {.column width="30%"} ::: ::: {.column width="70%"}![](./figures/etl-ai.jpg)
::: ::: {.column width="40%"} 多模态大模型的ETL流程正变得越来越复杂 - E: 数据模态多,来源复杂,拉取时间长 - T: 数据处理流程复杂 - L: 存储占用高 ::: :::: ## What's the Problem? {auto-animate=true} 多模态数据由于版权和存储原因,大多以下载链接的形式分发,获取速率受到限制 ```{=html} {{< include components/webvid.qmd >}} ``` ::: {style="text-align:center; font-size: 0.4em;"} webvid以url形式提供, 共包括10730233条数据 ::: ::: {.notes} - 这意味着国内需要使用昂贵的国际带宽来获取数据,对于一个小型数据中心, 下载相当于Sora训练量的数据集可能需要花费数年的时间。 - 即便只下载webvid这样中等规模的数据,下载和处理的时间可能也是训练的瓶颈. ::: ## What's the Problem? {auto-animate=true} 处理流程复杂耗时,甚至超过训练开销 :::: columns ::: {.column width="60%"} ::: {style="margin-top: 50px;"} ![](./figures/caption.jpg) ::: ::: ::: {.column width="40%"} GPT-4V (20s/it) ::: {style="font-size: 0.4em;"} An aerial video sweeps over a majestic ocean cliff with striatedrock formations in rich hues of red, green, and orange.The sun'srays enhance the colorful palette of the landscape, while the sea'sazure watersgently crash aqainst the clif's base. Visible are thetextured detais ofthe clif face and the contrast of the areeralaae and seaqrasscoating parts of the rock.Seabirds can beseen flying close to the water around the rocky outcrop. Thevideo conveys a serene yet dynamic coastal scenery, highlightingthe natural beauty and geological diversity of a rugged coastline ::: 🌋 **LLaVA-1.6-Yi-34B (3s/it)** ::: {style="font-size: 0.4em;"} A breathtakina aerial view of a rocky cliff iutting out into theocean.The cliff. adorned with vibrant qreen moss and patches ofyellow and red lichen, is bathed in the warm glow of the sun. Theocean, a deep blue, is dotted with small white waves that crashagainst the cliff's base. The perspective of the video is fromabove, providing a bird's eye view of the cliff and the surroundingocean.The sun's rays illuminate the scene, casting a beautifullight on the cliff and the ocean. The video is a stunningrepresentation of the raw beauty of nature. ::: ::: :::: ## What's the Problem? {auto-animate=true} > [Storage]{.red} plays an important role in AI training, and yet is one of the least talked-about aspects. As the GenAI training jobs become more multimodal over time, consuming large amounts of [ image, video, and text data ]{.red}, the need for data storage grows rapidly. [^llama3] - 要从原始数据中筛选出一亿分钟数据,可能意味着原始数据量高达数十PB以上 - 对于一般的小型数据中心,没有能力搭建适应视频预训练的存储设施。 [^llama3]: [Building Meta’s GenAI Infrastructure](https://engineering.fb.com/2024/03/12/data-center-engineering/building-metas-genai-infrastructure/) ## What's the Problem? {auto-animate=true} :::: columns ::: {.column width="50%"}![](./figures/etl-problem.webp)
::: ::: {.column width="50%"} ::: {.incremental} - 数据来源复杂 - 数据源不能立即被拉取 - 数据处理流程复杂 - 数据处理和模型训练耦合 - 数据量过大,无法一次性处理 - ... ::: ::: :::: ## What's the Problem? {auto-animate=true} :::: columns ::: {.column width="50%"} - 数据流离模型训练越来越远 - 仍然使用传统的方式处理数据,![](./figures/rain1.svg){width=80%}
## Streaming to the rescue {auto-animate=true .smaller} :::: columns ::: {.column width="60%"}![](./figures/rain1.svg){width=80%}
::: ::: {.column width="40%"} ::: {.incremental} - [x] 零启动开销 - [x] 数据处理进程和模型训练进程完全分离 - [x] 节点内通过`SharedMemory`通信, 节点间通过内存数据库通信 - [x] 数据处理集群拓扑与GPU拓扑无关, 可以动态调整 - [x] 定时sink数据库,允许回溯数据流 - [x] 确定性的数据切分和洗牌算法,确保回溯的一致性 ::: ::: :::: ## {auto-animate=true background="./figures/mosaicml-streaming-dataset-img-1.gif"} ::: {.notes} 每个云上shard内的样本具备确定性的切分和洗牌算法,确保回溯的一致性, 并与训练拓扑无关 ::: ## Training on the internet {auto-animate=true .smaller background="./figures/mosaicml-streaming-dataset-img-1.gif" background-opacity=0.25} 使用S3作为数据和权重的存储后端, 无缝进行不同规模的云迁移 ```{=html} {{< include components/cloud-switch.qmd >}} ``` ## Training on the internet {auto-animate=true .smaller} 引入DPU集群,允许将数据直接传输到GPU, 消除内存数据库的开销![](./figures/dpu.png){width=100%}
Powered by ::: {.notes} 与中立云服务商UCloud合作 ::: ## Training on the internet {auto-animate=true .smaller}![](./figures/rain2.svg)
## Training on the internet {auto-animate=true .smaller} :::: columns ::: {.column width="50%"} ![](./figures/rain2.svg) ::: ::: {.column width="50%"} - 进一步分离了数据处理和模型训练 - 使ETL与模型训练完全并行 ::: :::: ::: {.fragment} ```{=html} {{< include components/profile-stream.qmd >}} ``` ::: # {.theme-section} ::: {.title} Scaling Exact Attention ::: ## Efficient distributed training infra {auto-animate="true"} | | Flash-Attn-2 | FP8 (H100) | 3D Parallel + Zero | Padding Free | Fused Kernel | Static Graph | TGS[^l] | |------------:|:------------:|:----------:|:------------------:|:------------:|:------------:|:------------:|:---:| | Platformers | ✔️ | ✔️ | ✔️ | ✔️ | [100%]{style="color:red;"} | ✔️ | [3743]{style="color:red;"} | | Megatron-LM | ✖️ | ✔️ | ✔️ | ✖️ | 80% | ✖️ | 3581 | | Deepspeed | ✔️ | ✖️ | ✔️ | ✖️ | 60% | ✖️ |✖️ | | Colossal-ai | ✖️ | ✖️ | ✔️ | ✖️ | 40% | ✖️ | 2610 | [^l]: Training LLaMA2 7b on DGX (8*A100 40GB) with 4096 sequence Length ## Scaling exact attention to ultra long sequence {auto-animate="true"} ![](./figures/context_parallel.svg) ## Scaling exact attention to ultra long sequence {auto-animate="true"}![](./figures/computation_reduce.svg){width=80%}
## Scaling exact attention to ultra long sequence {auto-animate="true"} :::: columns ::: {.column width="50%"} ```{=html} {{< include ./components/seq-time.qmd >}} ``` ::: ::: {.column width="50%"} ```{=html} {{< include ./components/seq-tflops.qmd >}} ``` ::: :::: ## Scaling exact attention to ultra long sequence {auto-animate="true"} ```{=html} {{< include mocha.qmd >}} ``` # {.theme-end} ::: columns ::: {.column width="50%"} ::: {.r-fit-text} Thanks ::: ::: ::: {.column width="25%"} ::: {style="text-align:center;"} ![wechat](./figures/qr/code.png) ::: ::: ::: {.column width="25%"} ::: {style="text-align:center;"} ![e-mail](./figures/qr/mail-data.png) ::: ::: :::