Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added missing content in the docs && fixed i18n #165

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
Added missing content in the docs && fixed i18n
  • Loading branch information
celaraze committed Mar 20, 2024
commit c2f7760dde0781f89046115df4d78d95675aa18c
116 changes: 78 additions & 38 deletions README.md

Large diffs are not rendered by default.

436 changes: 229 additions & 207 deletions docs/README_zh.md → docs/zh_CN/README.md

Large diffs are not rendered by default.

65 changes: 65 additions & 0 deletions docs/zh_CN/acceleration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# 加速

Open-Sora 旨在为扩散模型提供一个高速训练框架。在 64 帧 512x512 视频上训练时,我们可以实现 **55%** 的训练速度加速。我们的框架支持训练
**1分钟1080p视频**。

## 加速的 Transformer

Open-Sora 通过以下方式提高训练速度:

- 内核优化,包括 [flash attention](https://github.com/Dao-AILab/flash-attention), 融合 layernorm 内核以及由 colossalAI
编译的内核。
- 混合并行性,包括 ZeRO。
- 用于更大批量的梯度检查点。

我们在图像上的训练速度可与 [OpenDiT](https://github.com/NUS-HPC-AI-Lab/OpenDiT) 相媲美,这是一个加速 DiT
训练的项目。训练速度是在批处理大小为 128、图像大小为 256x256 的 8 个 H800 GPU 上测量的。

| 模型 | 吞吐量 (img/s/GPU) | 吞吐量 (tokens/s/GPU) |
|----------|-----------------|--------------------|
| DiT | 100 | 26k |
| OpenDiT | 175 | 45k |
| OpenSora | 175 | 45k |

## 高效的 STDiT

我们的 STDiT 采用时空注意力对视频数据进行建模。与直接全神贯注在 Dit 相比,我们的 STDiT 随着帧数的增加而更有效率。我们当前的框架仅支持序列超长序列的并行性。

训练速度是在 8 个 H800 GPU 上测量的,应用了加速技术,GC 表示梯度检查点。
两者都具有像 PixArt 一样的 T5 调节。

| 模型 | 设置 | 吞吐量 (sample/s/GPU) | 吞吐量 (tokens/s/GPU) |
|------------------|----------------|--------------------|--------------------|
| DiT | 16x256 (4k) | 7.20 | 29k |
| STDiT | 16x256 (4k) | 7.00 | 28k |
| DiT | 16x512 (16k) | 0.85 | 14k |
| STDiT | 16x512 (16k) | 1.45 | 23k |
| DiT (GC) | 64x512 (65k) | 0.08 | 5k |
| STDiT (GC) | 64x512 (65k) | 0.40 | 25k |
| STDiT (GC, sp=2) | 360x512 (370k) | 0.10 | 18k |

使用 Video-VAE 在时间维度上进行 4 倍下采样时,24fps 视频有 450 帧。STDiT(28k tokens/s) 和 DiT 对图像 (高达 45k tokens/s)
两者之间的速度差距主要来自 T5 和 VAE 编码,以及时间注意力。

## 加速的编码器 (T5, VAE)

在训练过程中,文本由 T5 编码,视频由 VAE 编码。通常有两种方法可以加速训练:

1. 提前预处理文本和视频数据并保存到磁盘。
2. 在训练过程中对文本和视频数据进行编码,并加快编码过程。

对于选项 1,一个样本的 120 个令牌需要 1M 磁盘空间,而 64x64x64 的潜在可能需要 4M。考虑训练 包含 10M 视频剪辑的数据集,所需的总磁盘空间为
50TB。我们的存储系统目前还没有准备好 这种数据规模。

对于选项 2,我们提高了 T5 速度和内存要求。根据在[OpenDiT](https://github.com/NUS-HPC-AI-Lab/OpenDiT),我们发现 VAE
消耗了大量的 GPU 内存。因此,我们
将批大小拆分为较小的批大小,以便进行 VAE 编码。使用这两种技术,我们可以大大加快训练速度。

训练速度是在 8 个带有 STDiT 的 H800 GPU 上测量的。

| 加速模式 | 设置 | 吞吐量 (img/s/GPU) | 吞吐量 (tokens/s/GPU) |
|--------------|---------------|-----------------|--------------------|
| Baseline | 16x256 (4k) | 6.16 | 25k |
| w. faster T5 | 16x256 (4k) | 7.00 | 29k |
| Baseline | 64x512 (65k) | 0.94 | 15k |
| w. both | 64x512 (65k) | 1.45 | 23k |
File renamed without changes.
31 changes: 31 additions & 0 deletions docs/zh_CN/datasets.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# 数据集

## 正在使用的数据集

### HD-VG-130M

[HD-VG-130M](https://github.com/daooshee/HD-VG-130M?tab=readme-ov-file) 包括 130M 个文本视频对。标题是
由 BLIP-2 生成。我们发现剪切和文本质量相对较差。它包含 20 个拆分。对于 OpenSora 1.0,我们使用第一个拆分。我们计划使用整个数据集并对其进行重新处理。

### Inter4k

[Inter4k](https://github.com/alexandrosstergiou/Inter4K) 是一个包含分辨率为 4K 的 1k 视频剪辑的数据集。这个
数据集被提议用于超分辨率任务。我们使用数据集进行 HQ 训练。处理过的视频可以从这里找到 [这里](README.md#数据处理) 。

### Pexels.com

[Pexels.com](https://www.pexels.com/) 是一个提供免费库存照片和视频的网站。我们收集的 19K 视频
来自本网站的剪辑,用于高质量训练。处理过的视频可以从这里找到 [这里](README.md#数据处理) 。

## 数据集监视列表

我们也在关注以下数据集,并考虑在未来使用它们,这取决于我们的存储空间以及数据集的质量。

| 名称 | 大小 | 描述 |
|-------------------|--------------|-------------------------------|
| Panda-70M | 70M videos | High quality video-text pairs |
| WebVid-10M | 10M videos | Low quality |
| InternVid-10M-FLT | 10M videos | |
| EGO4D | 3670 hours | |
| OpenDV-YouTube | 1700 hours | |
| VidProM | 6.69M videos | |
47 changes: 47 additions & 0 deletions docs/zh_CN/report_v1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Open-Sora v1 Report

OpenAI's Sora is amazing at generating one minutes high quality videos. However, it reveals almost no information about its details. To make AI more "open", we are dedicated to build an open-source version of Sora. This report describes our first attempt to train a transformer-based video diffusion model.

## Efficiency in choosing the architecture

To lower the computational cost, we want to utilize existing VAE models. Sora uses spatial-temporal VAE to reduce the temporal dimensions. However, we found that there is no open-source high-quality spatial-temporal VAE model. [MAGVIT](https://github.com/google-research/magvit)'s 4x4x4 VAE is not open-sourced, while [VideoGPT](https://wilson1yan.github.io/videogpt/index.html)'s 2x4x4 VAE has a low quality in our experiments. Thus, we decided to use a 2D VAE (from [Stability-AI](https://huggingface.co/stabilityai/sd-vae-ft-mse-original)) in our first version.

The video training involves a large amount of tokens. Considering 24fps 1min videos, we have 1440 frames. With VAE downsampling 4x and patch size downsampling 2x, we have 1440x1024≈1.5M tokens. Full attention on 1.5M tokens leads to a huge computational cost. Thus, we use spatial-temporal attention to reduce the cost following [Latte](https://github.com/Vchitect/Latte).

As shown in the figure, we insert a temporal attention right after each spatial attention in STDiT (ST stands for spatial-temporal). This is similar to variant 3 in Latte's paper. However, we do not control a similar number of parameters for these variants. While Latte's paper claims their variant is better than variant 3, our experiments on 16x256x256 videos show that with same number of iterations, the performance ranks as: DiT (full) > STDiT (Sequential) > STDiT (Parallel) ≈ Latte. Thus, we choose STDiT (Sequential) out of efficiency. Speed benchmark is provided [here](/docs/acceleration.md#efficient-stdit).

![Architecture Comparison](https://i0.imgs.ovh/2024/03/15/eLk9D.png)

To focus on video generation, we hope to train the model based on a powerful image generation model. [PixArt-α](https://github.com/PixArt-alpha/PixArt-alpha) is an efficiently trained high-quality image generation model with T5-conditioned DiT structure. We initialize our model with PixArt-α and initialize the projection layer of inserted temporal attention with zero. This initialization preserves model's ability of image generation at beginning, while Latte's architecture cannot. The inserted attention increases the number of parameter from 580M to 724M.

![Architecture](https://i0.imgs.ovh/2024/03/16/erC1d.png)

Drawing from the success of PixArt-α and Stable Video Diffusion, we also adopt a progressive training strategy: 16x256x256 on 366K pretraining datasets, and then 16x256x256, 16x512x512, and 64x512x512 on 20K datasets. With scaled position embedding, this strategy greatly reduces the computational cost.

We also try to use a 3D patch embedder in DiT. However, with 2x downsampling on temporal dimension, the generated videos have a low quality. Thus, we leave the downsampling to temporal VAE in our next version. For now, we sample at every 3 frames with 16 frames training and every 2 frames with 64 frames training.

## Data is the key to high quality

We find that the number and quality of data have a great impact on the quality of generated videos, even larger than the model architecture and training strategy. At this time, we only prepared the first split (366K video clips) from [HD-VG-130M](https://github.com/daooshee/HD-VG-130M). The quality of these videos varies greatly, and the captions are not that accurate. Thus, we further collect 20k relatively high quality videos from [Pexels](https://www.pexels.com/), which provides free license videos. We label the video with LLaVA, an image captioning model, with three frames and a designed prompt. With designed prompt, LLaVA can generate good quality of captions.

![Caption](https://i0.imgs.ovh/2024/03/16/eXdvC.png)

As we lay more emphasis on the quality of data, we prepare to collect more data and build a video preprocessing pipeline in our next version.

## Training Details

With a limited training budgets, we made only a few exploration. We find learning rate 1e-4 is too large and scales down to 2e-5. When training with a large batch size, we find `fp16` less stable than `bf16` and may lead to generation failure. Thus, we switch to `bf16` for training on 64x512x512. For other hyper-parameters, we follow previous works.

## Loss curves

16x256x256 Pretraining Loss Curve

![16x256x256 Pretraining Loss Curve](https://i0.imgs.ovh/2024/03/16/erXQj.png)

16x256x256 HQ Training Loss Curve

![16x256x256 HQ Training Loss Curve](https://i0.imgs.ovh/2024/03/16/ernXv.png)

16x512x512 HQ Training Loss Curve

![16x512x512 HQ Training Loss Curve](https://i0.imgs.ovh/2024/03/16/erHBe.png)
178 changes: 178 additions & 0 deletions docs/zh_CN/structure.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,178 @@
# Repo & Config Structure

## Repo Structure

```plaintext
Open-Sora
├── README.md
├── docs
│ ├── acceleration.md -> Acceleration & Speed benchmark
│ ├── command.md -> Commands for training & inference
│ ├── datasets.md -> Datasets used in this project
│ ├── structure.md -> This file
│ └── report_v1.md -> Report for Open-Sora v1
├── scripts
│ ├── train.py -> diffusion training script
│ └── inference.py -> Report for Open-Sora v1
├── configs -> Configs for training & inference
├── opensora
│ ├── __init__.py
│ ├── registry.py -> Registry helper
│   ├── acceleration -> Acceleration related code
│   ├── dataset -> Dataset related code
│   ├── models
│   │   ├── layers -> Common layers
│   │   ├── vae -> VAE as image encoder
│   │   ├── text_encoder -> Text encoder
│   │   │   ├── classes.py -> Class id encoder (inference only)
│   │   │   ├── clip.py -> CLIP encoder
│   │   │   └── t5.py -> T5 encoder
│   │   ├── dit
│   │   ├── latte
│   │   ├── pixart
│   │   └── stdit -> Our STDiT related code
│   ├── schedulers -> Diffusion schedulers
│   │   ├── iddpm -> IDDPM for training and inference
│   │ └── dpms -> DPM-Solver for fast inference
│ └── utils
└── tools -> Tools for data processing and more
```

## Configs

Our config files follows [MMEgine](https://github.com/open-mmlab/mmengine). MMEngine will reads the config file (a `.py` file) and parse it into a dictionary-like object.

```plaintext
Open-Sora
└── configs -> Configs for training & inference
├── opensora -> STDiT related configs
│ ├── inference
│ │ ├── 16x256x256.py -> Sample videos 16 frames 256x256
│ │ ├── 16x512x512.py -> Sample videos 16 frames 512x512
│ │ └── 64x512x512.py -> Sample videos 64 frames 512x512
│ └── train
│ ├── 16x256x256.py -> Train on videos 16 frames 256x256
│ ├── 16x256x256.py -> Train on videos 16 frames 256x256
│ └── 64x512x512.py -> Train on videos 64 frames 512x512
├── dit -> DiT related configs
   │   ├── inference
   │   │   ├── 1x256x256-class.py -> Sample images with ckpts from DiT
   │   │   ├── 1x256x256.py -> Sample images with clip condition
   │   │   └── 16x256x256.py -> Sample videos
   │   └── train
   │     ├── 1x256x256.py -> Train on images with clip condition
   │      └── 16x256x256.py -> Train on videos
├── latte -> Latte related configs
└── pixart -> PixArt related configs
```

## Inference config demos

To change the inference settings, you can directly modify the corresponding config file. Or you can pass arguments to overwrite the config file ([config_utils.py](/opensora/utils/config_utils.py)). To change sampling prompts, you should modify the `.txt` file passed to the `--prompt_path` argument.

```plaintext
--prompt_path ./assets/texts/t2v_samples.txt -> prompt_path
--ckpt-path ./path/to/your/ckpt.pth -> model["from_pretrained"]
```

The explanation of each field is provided below.

```python
# Define sampling size
num_frames = 64 # number of frames
fps = 24 // 2 # frames per second (divided by 2 for frame_interval=2)
image_size = (512, 512) # image size (height, width)

# Define model
model = dict(
type="STDiT-XL/2", # Select model type (STDiT-XL/2, DiT-XL/2, etc.)
space_scale=1.0, # (Optional) Space positional encoding scale (new height / old height)
time_scale=2 / 3, # (Optional) Time positional encoding scale (new frame_interval / old frame_interval)
enable_flashattn=True, # (Optional) Speed up training and inference with flash attention
enable_layernorm_kernel=True, # (Optional) Speed up training and inference with fused kernel
from_pretrained="PRETRAINED_MODEL", # (Optional) Load from pretrained model
no_temporal_pos_emb=True, # (Optional) Disable temporal positional encoding (for image)
)
vae = dict(
type="VideoAutoencoderKL", # Select VAE type
from_pretrained="stabilityai/sd-vae-ft-ema", # Load from pretrained VAE
micro_batch_size=128, # VAE with micro batch size to save memory
)
text_encoder = dict(
type="t5", # Select text encoder type (t5, clip)
from_pretrained="./pretrained_models/t5_ckpts", # Load from pretrained text encoder
model_max_length=120, # Maximum length of input text
)
scheduler = dict(
type="iddpm", # Select scheduler type (iddpm, dpm-solver)
num_sampling_steps=100, # Number of sampling steps
cfg_scale=7.0, # hyper-parameter for classifier-free diffusion
)
dtype = "fp16" # Computation type (fp16, fp32, bf16)

# Other settings
batch_size = 1 # batch size
seed = 42 # random seed
prompt_path = "./assets/texts/t2v_samples.txt" # path to prompt file
save_dir = "./samples" # path to save samples
```

## Training config demos

```python
# Define sampling size
num_frames = 64
frame_interval = 2 # sample every 2 frames
image_size = (512, 512)

# Define dataset
root = None # root path to the dataset
data_path = "CSV_PATH" # path to the csv file
use_image_transform = False # True if training on images
num_workers = 4 # number of workers for dataloader

# Define acceleration
dtype = "bf16" # Computation type (fp16, bf16)
grad_checkpoint = True # Use gradient checkpointing
plugin = "zero2" # Plugin for distributed training (zero2, zero2-seq)
sp_size = 1 # Sequence parallelism size (1 for no sequence parallelism)

# Define model
model = dict(
type="STDiT-XL/2",
space_scale=1.0,
time_scale=2 / 3,
from_pretrained="YOUR_PRETRAINED_MODEL",
enable_flashattn=True, # Enable flash attention
enable_layernorm_kernel=True, # Enable layernorm kernel
)
vae = dict(
type="VideoAutoencoderKL",
from_pretrained="stabilityai/sd-vae-ft-ema",
micro_batch_size=128,
)
text_encoder = dict(
type="t5",
from_pretrained="./pretrained_models/t5_ckpts",
model_max_length=120,
shardformer=True, # Enable shardformer for T5 acceleration
)
scheduler = dict(
type="iddpm",
timestep_respacing="", # Default 1000 timesteps
)

# Others
seed = 42
outputs = "outputs" # path to save checkpoints
wandb = False # Use wandb for logging

epochs = 1000 # number of epochs (just large enough, kill when satisfied)
log_every = 10
ckpt_every = 250
load = None # path to resume training

batch_size = 4
lr = 2e-5
grad_clip = 1.0 # gradient clipping
```