# Datasets

## Datasets used for now

### HD-VG-130M

[HD-VG-130M](https://github.com/daooshee/HD-VG-130M?tab=readme-ov-file) comprises 130M text-video pairs. The caption is generated by BLIP-2. We find the cut and the text quality are relatively poor. It contains 20 splits. For OpenSora 1.0, we use the first split. We plan to use the whole dataset and re-process it.

### Inter4k

[Inter4k](https://github.com/alexandrosstergiou/Inter4K) is a dataset containing 1k video clips with 4K resolution. The dataset is proposed for super-resolution tasks. We use the dataset for HQ training. The videos are processed as mentioned [here](/README.md#data-processing).

### Pexels.com

[Pexels.com](https://www.pexels.com/) is a website that provides free stock photos and videos. We collect 19K video clips from this website for HQ training. The videos are processed as mentioned [here](/README.md#data-processing).

## Datasets watching list

We are also watching the following datasets and considering using them in the future, which depends on our disk space and the quality of the dataset.

| Name              | Size         | Description                   |
| ----------------- | ------------ | ----------------------------- |
| Panda-70M         | 70M videos   | High quality video-text pairs |
| WebVid-10M        | 10M videos   | Low quality                   |
| InternVid-10M-FLT | 10M videos   |                               |
| EGO4D             | 3670 hours   |                               |
| OpenDV-YouTube    | 1700 hours   |                               |
| VidProM           | 6.69M videos |                               |