bruefire
commited on
Commit
•
99b4771
1
Parent(s):
f66431c
fixed workflow.md a bit.
Browse files- config.yaml +73 -0
- workflow.md +12 -13
config.yaml
ADDED
@@ -0,0 +1,73 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
pretrained_model_path: ./outputs/train_2023-05-02T00-50-05/checkpoint-15000/
|
2 |
+
output_dir: ./outputs/
|
3 |
+
train_data:
|
4 |
+
width: 512
|
5 |
+
height: 512
|
6 |
+
use_bucketing: true
|
7 |
+
sample_start_idx: 1
|
8 |
+
fps: 24
|
9 |
+
frame_step: 5
|
10 |
+
n_sample_frames: 45
|
11 |
+
single_video_path: ''
|
12 |
+
single_video_prompt: ''
|
13 |
+
fallback_prompt: ''
|
14 |
+
path: E:/userdata/Pictures/ai_trainning/t2v-v2/gif/vid/old/
|
15 |
+
json_path: ./json/anime-v2.json
|
16 |
+
image_dir: E:/userdata/Pictures/ai_trainning/t2v-v2/img/
|
17 |
+
single_img_prompt: ''
|
18 |
+
validation_data:
|
19 |
+
prompt: ''
|
20 |
+
sample_preview: true
|
21 |
+
num_frames: 16
|
22 |
+
width: 512
|
23 |
+
height: 512
|
24 |
+
num_inference_steps: 25
|
25 |
+
guidance_scale: 9
|
26 |
+
dataset_types:
|
27 |
+
- json
|
28 |
+
- image
|
29 |
+
validation_steps: 100
|
30 |
+
extra_unet_params: null
|
31 |
+
extra_text_encoder_params: null
|
32 |
+
train_batch_size: 1
|
33 |
+
max_train_steps: 10000
|
34 |
+
learning_rate: 5.0e-06
|
35 |
+
scale_lr: false
|
36 |
+
lr_scheduler: constant
|
37 |
+
lr_warmup_steps: 0
|
38 |
+
adam_beta1: 0.9
|
39 |
+
adam_beta2: 0.999
|
40 |
+
adam_weight_decay: 0.01
|
41 |
+
adam_epsilon: 1.0e-08
|
42 |
+
max_grad_norm: 1.0
|
43 |
+
gradient_accumulation_steps: 1
|
44 |
+
checkpointing_steps: 2500
|
45 |
+
resume_from_checkpoint: null
|
46 |
+
mixed_precision: fp16
|
47 |
+
use_8bit_adam: false
|
48 |
+
enable_xformers_memory_efficient_attention: false
|
49 |
+
enable_torch_2_attn: true
|
50 |
+
seed: 64
|
51 |
+
extend_dataset: false
|
52 |
+
cached_latent_dir: null
|
53 |
+
use_unet_lora: true
|
54 |
+
unet_lora_modules:
|
55 |
+
- ResnetBlock2D
|
56 |
+
text_encoder_lora_modules:
|
57 |
+
- CLIPEncoderLayer
|
58 |
+
lora_rank: 25
|
59 |
+
lora_path: ''
|
60 |
+
kwargs: {}
|
61 |
+
cache_latents: true
|
62 |
+
gradient_checkpointing: true
|
63 |
+
offset_noise_strength: 0.1
|
64 |
+
text_encoder_gradient_checkpointing: false
|
65 |
+
train_text_encoder: false
|
66 |
+
trainable_modules:
|
67 |
+
- attn1
|
68 |
+
- attn2
|
69 |
+
- temp_conv
|
70 |
+
trainable_text_modules:
|
71 |
+
- all
|
72 |
+
use_offset_noise: false
|
73 |
+
use_text_lora: true
|
workflow.md
CHANGED
@@ -1,5 +1,5 @@
|
|
1 |
# Workflow for fine-tuning ModelScope in anime style
|
2 |
-
Here is a brief description of my process for fine-tuning ModelScope in an
|
3 |
Most of it may be basic, but I hope it will be useful.
|
4 |
There is no guarantee that what is written here is correct and will lead to good results!
|
5 |
|
@@ -7,12 +7,12 @@ There is no guarantee that what is written here is correct and will lead to good
|
|
7 |
The goal of my training was to change the model to an overall anime style.
|
8 |
Only the art style was to override the ModelScope content, so I did not need a huge data set.
|
9 |
The total number of videos and images was only a few thousand.
|
10 |
-
Most of the video was taken from Tenor.
|
11 |
Many of the videos were posted as gifs and mp4s of one short scene.
|
12 |
It seems to be possible to automate the process using the API.
|
13 |
-
|
14 |
I also used some smooth and stable motions and videos of 3d models with toon shading.
|
15 |
-
Short videos are sufficient, as we are not able to
|
16 |
|
17 |
### Notes on data collection
|
18 |
Blurring and noise are also trained. This is especially noticeable in the case of high-resolution training.
|
@@ -27,7 +27,7 @@ I collected data while checking if common emotions and actions were included.
|
|
27 |
|
28 |
## Correcting data before training
|
29 |
|
30 |
-
### Fixing resolution,
|
31 |
It is safe to use a resolution at least equal to or higher than the training resolution.
|
32 |
The ratio should also match the training settings.
|
33 |
Trimming is possible with ffmpeg.
|
@@ -42,22 +42,21 @@ If you cannot improve the image quality as well as the resolution, it may be bet
|
|
42 |
Since many animations have a small number of frames, the results of the training are likely to be collapsed.
|
43 |
In addition to body collapse, the appearance of the character will no longer be consistent. Less variation between frames seems to improve consistency.
|
44 |
The following tool may be useful for frame interpolation
|
45 |
-
https://github.com/google-research/frame-interpolation
|
46 |
If the variation between frames is too large, you will not get a clean result.
|
47 |
|
48 |
## Tagging
|
49 |
-
For anime, WaifuTagger can extract content with good accuracy, so I created a slightly modified script for video and used it for animov512x.
|
50 |
-
https://github.com/
|
51 |
-
Nevertheless, Blip2-Preprocessor can also extract enough general scene content. It may be a better idea to use them together.
|
52 |
-
https://github.com/ExponentialML/Video-BLIP2-Preprocessor
|
53 |
|
54 |
-
##
|
55 |
-
|
|
|
56 |
|
57 |
## Evaluate training results
|
58 |
If there are any poorly done results in the sample videos being trained, we will search the json with the prompts for that sample. With a training dataset of a few thousand or so, you can usually find the training source videos, which may be helpful to see where the problem lies.
|
59 |
I dared to train all videos with 'anime' tags.
|
60 |
-
Comparing videos with the positive prompts and negative ones with anime tag after training (comparing a fine-tuned
|
61 |
|
62 |
It is difficult to add additional training to specific things afterwards, even if they are tagged, so I avoided that.
|
63 |
Note that the number of frames in anime is small to begin with, so over-learning tends to freeze the characters.
|
|
|
1 |
# Workflow for fine-tuning ModelScope in anime style
|
2 |
+
Here is a brief description of my process for fine-tuning ModelScope in an anime style with [Text-To-Video-Finetuning](https://github.com/ExponentialML/Text-To-Video-Finetuning).
|
3 |
Most of it may be basic, but I hope it will be useful.
|
4 |
There is no guarantee that what is written here is correct and will lead to good results!
|
5 |
|
|
|
7 |
The goal of my training was to change the model to an overall anime style.
|
8 |
Only the art style was to override the ModelScope content, so I did not need a huge data set.
|
9 |
The total number of videos and images was only a few thousand.
|
10 |
+
Most of the video was taken from [Tenor](https://tenor.com/).
|
11 |
Many of the videos were posted as gifs and mp4s of one short scene.
|
12 |
It seems to be possible to automate the process using the API.
|
13 |
+
|
14 |
I also used some smooth and stable motions and videos of 3d models with toon shading.
|
15 |
+
Short videos with a few seconds are sufficient, as we are not able to train long data yet.
|
16 |
|
17 |
### Notes on data collection
|
18 |
Blurring and noise are also trained. This is especially noticeable in the case of high-resolution training.
|
|
|
27 |
|
28 |
## Correcting data before training
|
29 |
|
30 |
+
### Fixing resolution, blurring, and noise
|
31 |
It is safe to use a resolution at least equal to or higher than the training resolution.
|
32 |
The ratio should also match the training settings.
|
33 |
Trimming is possible with ffmpeg.
|
|
|
42 |
Since many animations have a small number of frames, the results of the training are likely to be collapsed.
|
43 |
In addition to body collapse, the appearance of the character will no longer be consistent. Less variation between frames seems to improve consistency.
|
44 |
The following tool may be useful for frame interpolation
|
45 |
+
https://github.com/google-research/frame-interpolation.
|
46 |
If the variation between frames is too large, you will not get a clean result.
|
47 |
|
48 |
## Tagging
|
49 |
+
For anime, WaifuTagger can extract content with good accuracy, so I created [a slightly modified script](https://github.com/bruefire/WaifuTaggerForVideo) for video and used it for animov512x.
|
50 |
+
Nevertheless, [BLIP2-Preprocessor](https://github.com/ExponentialML/Video-BLIP2-Preprocessor) can also extract enough general scene content. It may be a better idea to use them together.
|
|
|
|
|
51 |
|
52 |
+
## config.yaml settings
|
53 |
+
I'm still not quite sure what is appropriate for this.
|
54 |
+
[config.yaml for animov512x](https://huggingface.co/strangeman3107/animov-512x/blob/main/config.yaml)
|
55 |
|
56 |
## Evaluate training results
|
57 |
If there are any poorly done results in the sample videos being trained, we will search the json with the prompts for that sample. With a training dataset of a few thousand or so, you can usually find the training source videos, which may be helpful to see where the problem lies.
|
58 |
I dared to train all videos with 'anime' tags.
|
59 |
+
Comparing videos with the positive prompts and negative ones with anime tag after training (comparing a fine-tuned result with those that are near to the original ModelScope) may help improve training.
|
60 |
|
61 |
It is difficult to add additional training to specific things afterwards, even if they are tagged, so I avoided that.
|
62 |
Note that the number of frames in anime is small to begin with, so over-learning tends to freeze the characters.
|