Model Card for Model ID

This repo contains models used as raters for media into categories of PG, PG13, R, X, and XXX. These models are single modality models used to create an ensemble or multimodal model. In the case of the multimodal model, the single modality models are used as processor components to create the inputs for a smaller Multilayer Perceptron (MLP)

Model Details

Model Description

The main model here is the multimodal model trained 7/22/24. This model was trained using a weighted soft f1 loss with emphasis on class 0 (PG). This model utilizes finetuned resnet18, ViT, resnet50 with cross validation, prompt Bert, and prompt Roberta in the MultiModalProcessor. This processor passes the proper modality through the proper models and then returns the last hidden layer. These vectors are concatonated to create the input to the Multimodal Models MLP.

Each model was trained on the same balanced downsampled dataset found here. Please note: this dataset contains some mislabeled data across each label. The resnet50-CV is the only model which may have different training/test set data due to the cross validation search, however no data used for evaluation was found in the training/test sets. The data for evaluation is a private dataset labeled by Wolfgang Black and Seb at CivitAI.

Developed by: Wolfgang Black
Model type: Multimodal
Language(s) (NLP): English
Finetuned from model [optional]: Various - due to the multimodal nature however ony the MLP was truly trained from scratch.

Model Sources [optional]

ResNets

Link - https://pytorch.org/vision/main/models/resnet.html
Note: models were initialized with weights = 'ImageNetV1'

ViT

Repository: https://huggingface.co/google/vit-base-patch16-224
Paper [optional]: https://arxiv.org/abs/2010.11929

DistilBert

This model is the basis for promptBert

Repository: https://huggingface.co/distilbert/distilbert-base-uncased
Paper: https://arxiv.org/abs/1910.01108

Roberta

This model is the basis for promptRoberta

Repository: https://huggingface.co/FacebookAI/roberta-large-mnli
Paper: https://arxiv.org/abs/1907.11692

Uses

These models should be used to classify generated images or text into movie-ratings

How to Get Started with the Model

Warning: I did not include the code here necessary for the Multimodal Config, Processor, or Model. The code snippet below assumes the users have that code.

from src.multimodal_model import MultimodalConfig, MultimodalModel, MultimodalProcessor
model_dir = '' #where the multimodal directory is
config = MultimodalConfig.from_pretrained(model_dir)
model = MultimodalModel(config).from_pretrained(model_dir) #assumes composite models exist in directories as specified by config
processor = MultimodalProcessor(models = config.models) #assumes composite models exist in directories as specified by config
model.eval()
with torch.no_grad():
    outputs = model(**inputs) ##assumes inputs as pil.Image, text = None | str(prompt), tags = None | str(tags), label = None | str
    logits = outputs['logits']
torch.argmax(logits, dim = 1).item()
prediction = model.config.id2label[torch.argmax(out['logits'], dim=1).item()]

Out-of-Scope Use

Currently all models are untested on videos

Bias, Risks, and Limitations

Models are entirely finetuned (in the case of composite models) or trained (MLP) on generated images and may not work well on real images or non-digital media

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. This includes the poor labels for PG13/R due to personal bias of the dataset as well as that all data for training is generated images