metadata

license: other
license_name: mitsua-likes-by-nc
license_link: LICENSE
datasets:
  - Mitsua/vrm-color-concept-550k
  - Mitsua/art-museums-pd-440k
  - Mitsua/safe-commons-pd-3m
language:
  - ja
  - en
pipeline_tag: text-to-image
tags:
  - text-to-image
  - image-generation
  - mitsua-likes
  - legal
inference: true
extra_gated_prompt: >-
  By clicking "Agree", you agree to the terms of the [Mitsua Likes
  Attribution-NonCommercial
  License](https://elanmitsua.notion.site/Mitsua-Likes-Attribution-NonCommercial-License-15baa85a9b278038be5dc7f47a9c26cc)
  and acknowledge Abstract Engine's [Privacy
  Policy](https://abstractengine.ltd/en/privacypolicy/).
  "Agree"をクリックすることによって、[Mitsua Likes
  表示-非営利ライセンス](https://elanmitsua.notion.site/Mitsua-Likes-15baa85a9b278005bba5f30866a35f48)の規約を遵守し、
  Abstract
  Engineの[プライバシーポリシー](https://abstractengine.ltd/privacypolicy/)に同意するものとします。
extra_gated_fields:
  Name/名前: text
  Email: text
  Organization or Affiliation/組織または所属: text
  What do you intend to use the model for?/モデルの利用目的:
    type: select
    options:
      - Research
      - Personal use
      - Others
  I accept the terms of the Mitsua Likes BY-NC License and acknowledge that the information I provide will be collected and stored in accordance with Abstract Engine's Privacy Policy / 日本語：Mitsua Likes 表示-非営利ライセンスの規約を遵守し、Abstract Engineのプライバシーポリシーに従って提供した情報が収集及び保存されることに同意します: checkbox

Mitsua Likes : A Text-to-Image Diffusion Model trained on Opt-In Contributors' "Likes"

The Mitsua Likes model is Fairly Trained Certified.

Summary

Mitsua Likesは「みんなでつくるアート」のコンセプトで活動するAI VTuber絵藍ミツアの活動のベースモデルとなる日本語/英語対応のText-to-Imageの潜在拡散モデルであり、明示的オプトインで許諾を得たデータ、オープンライセンス及びパブリックドメインのデータのみを学習しています。他のAIモデルで生成されたデータ(画像/テキスト)は学習データに含まれません。このモデルのアーキテクチャ全体(CLIP Text Encoder, VAE, UNet)が、他のモデルの知識を使用することなく、完全にゼロから学習されています。つまり、このモデルは直接的にも間接的にも既存の無許諾でスクレイピングされた画像やテキストをベースにしたデータセットに依存していません。 Mitsua Likesは「著作権保護期間にある著作物を許諾を得ずに学習していないAIモデル」として、米国の非営利団体Fairly Trainedによってライセンスモデルとして認定されました。

Mitsua Likesは、ほとんどすべての現代的なコンセプトは苦手であり、複雑なプロンプトの理解も難しいですが、その学習データの特徴から、アニメスタイルのシンプルなポートレイトや風景画など、特定分野の生成は得意です。

Mitsua Likes is a Japanese/English text-to-image latent diffusion model trained solely on opt-in/openly licensed data and public domain data. Not trained on other pre-trained model's synthetic data (AI generated image/text). The entire architecture of the Mitsua Likes model (CLIP Text Encoder, VAE and UNet) is trained completely from scratch without using any pre-trained model's knowledge. In other words, this model does not depend on existing scraped based image or text dataset directly or indirectly. The Mitsua Likes model is Fairly Trained certified, which indicates it is an AI model that is not trained on copyrighted works without a license.

The Mitsua Likes struggles with most modern concepts and has difficulty understanding complex prompts, but excels in generating specific types of images, such as simple anime-style portraits and landscapes, due to the characteristics of its training data.

Trainable model waitlist

本リポジトリのVAE Encoderは誤用リスク防止のためウエイトが初期化されています。そのため、画像を利用した追加学習やimage2imageは規約上禁止されていますが、技術的にもできません。非営利の研究目的または個人の創作目的で、VAE Encoderを含むフルモデルアクセスの予約登録は以下のGoogleフォームから可能です。追加学習/入力データは自身が著作権を有するか、明示的な許諾を得たデータ限定であり、概要は公開されます。その他の詳細な条件はGoogleフォームの内容をご確認ください。

The encoder weight of the VAE is initialized in this repository for preventing misuse. Therefore, finetuning using image or image2image is technically disabled, as well as prohibited by the terms. For non-commercial research or personal creative purposes, you can register for the waitlist to receive full model access with the encoder weight of the VAE. The training data needs to be owned by your own or explicitly licensed. The training data summary will be disclosed to public. The other conditions are described in the following Google Form.

My Mitsua Likes Waitlist Registration

Model Details

Developed by: ELAN MITSUA Project / Abstract Engine
Model type: Text-to-Image Latent Diffusion Model
Language(s): Japanese and English
License: Mitsua Likes Attribution-NonCommercial License (in English) / Mitsua Likes 表示-非営利ライセンス (in Japanese)
- 生成物に"Mitsua Likes"のクレジット表記必須。商用利用は個人の自身の創作目的に限定。他のモデルを学習する目的での利用は禁止。
- "Mitsua Likes" attribution required for sharing generated result. Commercial use is restricted to personal own creative purpose. Using this model for machine learning is prohibited.
企業商用利用はお問い合わせください。
For corporate commercial use, please contact at this contact form.

Usage

Install the python packages

pip install transformers sentencepiece diffusers

Verifed on following version.

transformers==4.44.2
diffusers==0.31.0
sentencepiece==0.2.0

from diffusers import DiffusionPipeline
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.float16

pipe = DiffusionPipeline.from_pretrained("Mitsua/mitsua-likes", trust_remote_code=True).to(device, dtype=dtype)

# 日本語 or English prompt
prompt = "滝の中の絵藍ミツア、先生アート"
# prompt = "elanmitsua in waterfall, sensei art, analog, impressionism painting"
negative_prompt = "elan doodle, lowres"

ret = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    guidance_scale=5.0,
    guidance_rescale=0.7,
    width=768,
    height=768,
    num_inference_steps=40,
)

# 必ず、類似性判定AIの結果をチェックしてください
# Please check similarity detection model output
print("Similarity Restriction:", ret.detected_public_fictional_characters[0])
print("Similarity Measure:")
for k, v in ret.detected_public_fictional_characters_info[0].items():
    print(f"{k} : {v:.3%}")

image = ret.images[0]

Model Architecture

CLIP Text Encoder

12-Layer masked text transformer
Tokenizer : sentencepiece tokenizer with 64k vocabs
Max length : 64 tokens
This text encoder is from Mitsua Japanese CLIP

VAE

The VAE is trained with fully formula based Wavelet Loss, ensuring not depending on ImageNet in any kind. (Note: LPIPS perceptual loss is based on ImageNet.)
The VAE decoder is finetuned so that it can embed invisible watermark to the image. We references The Stable Signature paper but based on our own implemantation.
By placing the watermarking process in the VAE rather than as a post-process, it becomes impossible to remove the watermark when generating the image, making it easier to distinguish that the image is generated by Mitsua Likes.
Num latent channels : 8
- Most latent diffusion models adopt 4ch or 16ch as latent channels, but 8ch latent channles provides better balancing betweeen detail and compression efficiency for smaller UNet.
Note : This repo's VAE encoder weight is initialized to prevent misuse of unauthorized finetuning. If you need VAE encoder weight, please apply from My Mitsua Likes Waitlist Registration.
Total training steps : 280k steps w/ batch size 240, resolution 256x256, took about 800 RTX4090 hours.

UNet

The UNet architecture heavily references SDXL's UNet, but number of parameters are reduced to fit to our relatively small training data size, based on Scalability survey by Hao Li et al.
- Transformer depth is reduced from [0,2,10] to about [0,2,3] with detailed manupilation.
- Number of input text encoder is reduced to one.
- Number of input channel width is increased to 384.
- Number of input latent channels is increased to 8.
- Cross-attention transformer layers in the midblocks are removed.
- Finally, this UNet has approx 1.2B parameters which is about half of SDXL's UNet.
Training procedure is almost same as existing diffusion models, which adopts progressive resolution training and ends with aspect bucket training.
- 256x256 --> 512x512 --> 768x768 w/ aspect bucket (1024x576 ~ 896x672 ~ 768x768 ~ 672x896 ~ 576x1024)
- Total training steps: 550k w/ batch size 216 ~ 1920 depending on resolution
- Trained with epsilon loss at starting point, then training loss is changed to v-prediction with zero terminal snr in the final training stage.
The training of the UNet is the most compute resource-intensive part. To realize on-budget training, speeding up UNet training is needed.
- For faster convergence, Min-SNR formulation and Immiscible Diffusion technique are applied.
- We have trained extreme distilled version of VAE encoder beforehand and the most part of training we used distilled VAE encoder.
- For speeding up training, MosaicML introcuded pre-computed latents but this is not an option for us because we need augment images on-the-fly due to small amount training data.
- Rather, we noticed that very intensive inter GPU communication is a bottleneck of the training.
- Thus, by splitting the UNet and VAE Encoder processing into separate GPUs and concentrating UNet training resources on fewer GPUs, we have minimized UNet synchronization overhead.
- These changes result in 67% speed up of UNet training. All training done on single 8xH100 node and total UNet training took about 2,000 H100 GPU hours.

Character Similarity Determination Model

This model is Swin Transformer multi-label classification model finetuned from Swin Base Multi Fractal 1k which is pre-trained on Multi Fractal Images
Training data is a subset of Mitsua Japanese CLIP model

This is an additional post-processing classification model for checking whether or not the generated image resembles certain licensed fictional character. Due to lack of diversity in the training data, the generated image often unintentionally resembles licensed characters. Therefore additional checking is required to safely comply with the terms of the licensed characters.

Intended Use

Generation of artworks for further creative endeavors
Research or education on generative models

Out-of-Scope Use

Infringing others' rights in any kind (copyright, publicity right, privacy etc) or causing harm to others is a misuse of this model. This includes, but is not limited to:

Discriminating against, defaming, or insulting others, thereby damaging their honor or credibility.
Infringing or potentially infringing the intellectual property rights or privacy of others.
Disseminating information or content that unjustly harms the interests of others.
Disseminating false information or content.

Please read Mitsua Likes BY-NC "Prohibitions" for more detailes.

Limiations

These limitations come from lack of diversity of the training data.

This model is hardly capable of depicting a photographic person.
This model is hardly capable of interpreting long natural languages' prompt.
This model is hardly capable of generating complex compositions.
This model is not very familiar with modern concepts.

Opt-in Contributors Credit

スポンサーの先生

霧太郎/HAnS N Erhard先生
pikurusu39先生
ムスビイト先生
夢前黎 / つくよみちゃんプロジェクト先生
Hussini先生
力ナディス先生
るな先生

いつもありがとうございます！

All Mitsua Contributors Credit

霧太郎/HAnS N Erhard, pikurusu39, Hussini, 灯坂アキラ, ムスビイト, ネセヨレワ, 亞襲, E-Ken, とまこ, Nr. N, RI-YAnks, mkbt, 最中亜梨香/中森あか, 夢観士, KIrishusei, 長岡キヘイ, username_Kk32056, 相生創, 柊華久椰, nog, 加熱九真, amabox, 野々村のの, 嘯(しゃお), 夢前黎 / つくよみちゃんプロジェクト, みきうさぎ, るな, テラリソース / Tera Resource (素材系サークル), 力ナディス, とあ, 莉子, Roach=Jinx, ging ging.jpeg, 毛玉, 寝てる猫, ぽーたー, やえしたみえ, mizuchi, 262111, 乙幡皇斗羽, ゆう, とどめの35番, WAYA, 明煉瓦, 桐生星斗(投稿物生成物使用自由), rcc, ask, L, 弐人, 石川すゐす, Sulphuriy, 602e, 中屋, IRICOMIX, 琵來山まろり(画像加工可), とりとめ, cha, 鏡双司, YR, えれいた, mariedoi, あると, あああ, らどん, netai98, 脂質, つあ🌠, ろすえん, 善良, UranosEBi, lenbrant, 長谷川, 輝竜司 / citrocube, 詩原るいか, 末広うた, 翠泉, 月波清火, ゆぬ, 駒込ぴぺっこ, 原動機, ふわふわわ
(敬称略)
Latest Mitsua Contributors Credit

Official Public Characters

We have obtained official permission to train these Japanese fictional characters. The dataset includes official images and fan arts from opt-in contributors.

公式の許可を得て、以下のキャラクターの公式提供画像及びオプトイン参加者のファンアートを学習しています。

Training Data

For CLIP training data, please see Mitsua Japanese CLIP model card

For generative models training, our dataset is a mix of opt-in / openly licensed data and public domain / CC0 data. Pre-filtering based on metadata and captions are applied to exclude potential rights-infringing, harmful or NSFW data. For pre-filtering data, we built 146,041 words database which contains artist names, celebrity names, fictional character names, trademarks and bad words, based on Wikidata licensed under CC0. We do not use actual photographs of recognizable human faces without explicit permission.

"Mitsua Likes" Dataset : Our licensed data from opt-in contributors
- Contributors Credit (Attribution)
- Thumbnail of the partial training images can be browsed at Our Official Website
- All training data can be browsed on our Discord server "Mitsua Contributors"
- All contributors were screened upon entry and all submitted images were human verified.
- AI generated contents detector is used to exclude potential AI generated images.
- "3RG" licensed images and its captions are used to train this model. "3" or "3R" images are not used for training.
- Poly Haven HDRI images licensed under CC0 are used to augment background composition.
VRM Color Concept 550K (CC BY-NC 4.0, We curated this dataset.)
- Created by ELAN MITSUA Project / Abstract Engine
- Even if this dataset is licensed under NC, we own this dataset and assets used in this dataset is all commercially permissive license (CC0 or explicit permission), so we can use this dataset for commercial use.
Safe Commons PD 3M (CC BY-SA 4.0, We curated this dataset.)
- This is a balanced and safe-to-use public domain / CC0 images dataset.
- All images and texts come from Wikimedia Commons and Wikidata with strict filtering.
- We used category tags to limit the data to safe use, and then conducted word based filtering to ensure the highest level safety.
- Images license is either Public Domain or CC0 (varies by image).
- Texts license is either CC0 or CC BY-SA (varies by caption source).
- No synthetic data (AI generated images or captions) is in the dataset.
- The Share-Alike condition won't apply, because we limit to use CC0 / public domain images for generative model training, though the dataset itself is licensed under CC BY-SA.
- Curated by ELAN MITSUA Project / Abstract Engine.
Art Museums PD Dataset (CC BY 4.0, We curated this dataset.)
- Images and metadata collected from these museums open access. All images and metadata are shared under CC0 or Public Domain.
- We created image caption only from these metadata.
- Smithsonian Open Access (CC0)
- The Metropolitan Museum of Art Open Access (CC0)
- The Cleveland Museum of Art Open Access (CC0)
- The Art Institute of Chicago Open Access (CC0)
- Curated by ELAN MITSUA Project / Abstract Engine.

Even if the dataset itself is CC-licensed, we did not use it if the image contained in the dataset is not properly licensed, is based on unauthorized use of copyrighted works, or is based on the synthetic data output of other pretrained models.
English captions are translated into Japanese using ElanMT model which is trained solely on openly licensed corpus.
For additional tagging, Mitsua Japanese Tagger model which is trained solely on opt-in / openly licensed data is used.

Disclaimer

The generated result may be very incorrect, harmful or biased. The model was developed to investigate achievable performance with only a relatively small, licensed data, and is not suitable for use cases requiring high generation accuracy. ELAN MITSUA Project / Abstract Engine is not responsible for any direct or indirect loss caused by the use of the model.
免責事項：生成結果は不正確で、有害であったりバイアスがかかっている可能性があります。本モデルは比較的小規模でライセンスされたデータのみで達成可能な性能を調査するために開発されたモデルであり、生成の正確性が必要なユースケースでの使用には適していません。絵藍ミツアプロジェクト及び株式会社アブストラクトエンジンは、本モデルの使用によって生じた直接的または間接的な損失に対して、一切の責任を負いません。