(English part follows Japanese one.)

SD-XL 1.0-jp-base Model Card

総計5.8Bのパラメータを持つ画像生成モデル，SDXLを日本語入力に対応させたモデルです．ここではベースモデル(stable-diffusion-xl-base-1.0)の日本語対応版を公開しています．

学習戦略

ファインチューニング

stable-diffusion-xl-base-1.0に使われているテキストエンコーダである，OpenCLIP-ViT/G, CLIP-ViT/Lのみをファインチューニングすることにより，日本語入力に対応したテキストエンコーダを学習した．具体的には，英語のデータセットで学習されたオリジナルのテキストエンコーダに対して，英文を入力した際の出力(hidden states)と，新たに学習する日本語テキストエンコーダに同じ意味の日本語を入力した際の出力が一致するように学習行った．学習データとして日英対訳データを利用し，日本語のtokenizerとしてはline-corporation/japanese-large-lm-3.6bを利用した．

語彙の類似度をベースとした単語埋め込みの初期化

日本語テキストエンコーダの効率的な学習と，対訳データに含まれない単語へのある程度の適応を期待して，オリジナルの英語のテキストエンコーダの単語埋め込みを利用した日本語の単語埋め込みの初期化を行なった．具体的には，日本語トークナイザーの語彙と，オリジナルの英語のトークナイザーの語彙全ての単語ベクトルをmultilingual-e5-largeを用いて計算し，全ての日本語・英語の語彙の組み合わせについて類似度を求めた．その後，日本語の各語彙に対応する単語(サブワード)ベクトルと最も類似する英語の単語を求め，その類似する英単語に対応するベクトルを日本語単語の単語埋め込みの初期値とした．

学習データ

WMT

WMT2023 Shared Task: General Machine Translationで利用される日英対訳コーパスである．本モデルの学習にはSKIM at WMT 2023 General Translation Taskでのモデルの学習のために利用されたフィルタリング済みのデータセットを利用した．対訳ペアの総数は28155494件である．

laion2B-multi

Christoph et al. (2022)によって公開された大規模な画像とそのキャプションのペアで構成されたデータセットである．本モデルの学習にはキャプションのみを用いた. 前処理としてfasttextを用いて日本語キャプションのフィルタリングを行なった後，画像とキャプションの類似度が高い上位13221368件のキャプションを利用した．画像とキャプションの類似度の計算にはrinna/japanese-cloob-vit-b-16を用いた．日本語のキャプションを日英翻訳モデルを用いて翻訳を行い英語のキャプションを生成した．翻訳モデルはWMT22 Genral Machine TranslationタスクのチームNT5の提出システムの中で用いられている日英翻訳モデル，ABCI-baeeを利用した．

使用例

import torch
from diffusers import StableDiffusionXLPipeline, StableDiffusionXLImg2ImgPipeline

base_model_name_or_path = "cl-tohoku/stable-diffusion-xl-jp-base-1.0"
refiner_model_name_or_path = "cl-tohoku/stable-diffusion-xl-jp-refiner-1.0"

pipeline_base = StableDiffusionXLPipeline.from_pretrained(
    base_model_name_or_path,
    torch_dtype=torch.float16,
)

pipeline_refiner = StableDiffusionXLImg2ImgPipeline.from_pretrained(
    refiner_model_name_or_path,
    torch_dtype=torch.bfloat16,
)
pipeline_base = pipeline_base.to("cuda")
pipeline_refiner = pipeline_refiner.to("cuda")

n_steps = 100
high_noise_frac = 0.8
guidance_scale = 7.5
text = "かわいすぎる子猫"

with torch.autocast(
    device_type="cuda",
    dtype=torch.bfloat16
):
    image = pipeline_base(
        prompt=text,
        num_inference_steps=n_steps,
        denoising_end=high_noise_frac,
        guidance_scale=guidance_scale,
        output_type="latent",
    ).images[0]

    image = pipeline_refiner(
        prompt=text,
        num_inference_steps=n_steps,
        denoising_start=high_noise_frac,
        guidance_scale=guidance_scale,
        image=image,
    ).images[0]
    
image.save("image.png")

ライセンス

モデルはOpen RAIL++-Mライセンスの下で配布されています．

謝辞

このモデルの学習にあたり様々な面でご協力いただきましたTohoku NLPグループの皆様に感謝いたします．

SD-XL 1.0-jp-base Model Card

This is a Japanese input support version of the image generation model SDXL with a total of 5.8B parameters. Here, we release the Japanese input support version of the base model (stable-diffusion-xl-base-1.0).

Training Strategy

Fine-tuning

We fine-tuned only the text encoders used in stable-diffusion-xl-base-1.0, OpenCLIP-ViT/G, CLIP-ViT/L to support Japanese input. We used Japanese-English parallel corpus as training dataset. We trained the Japanese text encoder so that the output (hidden states) when English sentences were input to the original English text encoder and the output when the same meaning Japanese sentences were input to the newly trained Japanese text encoder were the same. We used line-corporation/japanese-large-lm-3.6b as Japanese tokenizer.

We trained a text encoder that supports Japanese input by fine-tuning only the text encoders used in stable-diffusion-xl-base-1.0, OpenCLIP-ViT/G and CLIP-ViT/L. Specifically, we trained the new Japanese text encoder to produce output that matches the output (hidden states) of the original text encoder when the same meaning Japanese sentences and English sentences are input. We used Japanese-English parallel data as the training data and employed the line-corporation/japanese-large-lm-3.6b as the Japanese tokenizer.

Training Data

WMT

A Japanese-English parallel corpus used in WMT2023 Shared Task: General Machine Translation. We used the filtered dataset used for training the model in SKIM at WMT 2023 General Translation Task. The size of this parallel corpus is 28155494.

laion2B-multi

A large-scale dataset consisting of image-caption pairs released by Christoph et al. (2022). We used only the captions for training this model. As a preprocessing step, we filtered the Japanese captions using fasttext, and then used the top 13221368 captions with high similarity to the images. We used rinna/japanese-cloob-vit-b-16 to calculate the similarity between images and captions. We translated the Japanese captions into English captions using a Japanese-English translation model, ABCI-baee, used in NT5 at WMT 2022 General Translation Task.

Example

import torch
from diffusers import StableDiffusionXLPipeline, StableDiffusionXLImg2ImgPipeline

base_model_name_or_path = "cl-tohoku/stable-diffusion-xl-jp-base-1.0"
refiner_model_name_or_path = "cl-tohoku/stable-diffusion-xl-jp-refiner-1.0"

pipeline_base = StableDiffusionXLPipeline.from_pretrained(
    base_model_name_or_path,
    torch_dtype=torch.float16,
)

pipeline_refiner = StableDiffusionXLImg2ImgPipeline.from_pretrained(
    refiner_model_name_or_path,
    torch_dtype=torch.bfloat16,
)
pipeline_base = pipeline_base.to("cuda")
pipeline_refiner = pipeline_refiner.to("cuda")

n_steps = 100
high_noise_frac = 0.8
guidance_scale = 7.5
text = "かわいすぎる子猫"

with torch.autocast(
    device_type="cuda",
    dtype=torch.bfloat16
):
    image = pipeline_base(
        prompt=text,
        num_inference_steps=n_steps,
        denoising_end=high_noise_frac,
        guidance_scale=guidance_scale,
        output_type="latent",
    ).images[0]

    image = pipeline_refiner(
        prompt=text,
        num_inference_steps=n_steps,
        denoising_start=high_noise_frac,
        guidance_scale=guidance_scale,
        image=image,
    ).images[0]
    
image.save("image.png")

Licenses

The models are distributed under the Open RAIL++-M.

Acknowledgments

We would like to appreciate the member of Tohoku NLP Group for their cooperation to train this model.