Sales Forecasting with Image Regression

Community Article Published May 24, 2024



In this article we're going train our own image regression model and then use it to predict sales given a product image. Image regression is a machine learning technique that predicts a continuous numerical value from an image. We'll use my simple Image Regression Model Trainer tool for the training, uploading and inference.

One of the primary motivations for this project was at the time of writing this I couldn't find any resources on 🤗 regarding image regression. Image Regression Model Trainer is built on top of 🤗 Transformers and PyTorch and was designed to integrate into the 🤗 ecosystem.


The model trainer takes a 🤗 dataset id as input so your dataset must be uploaded to 🤗. It should have a column of images and a column of values (floats or ints). Check out 🤗 Create an image dataset if you need help creating a 🤗 image dataset. You'll need to format your images in a folder with a metadata.csv file like so:


Your metadata.csv file will look something like this:


To upload it to the 🤗 Hub:

from datasets import load_dataset

dataset = load_dataset("imagefolder", data_dir="/path/to/folder")

Your dataset should look like tonyassi/clothing-sales-ds (the values column can be named whatever you'd like).


Image Regression using PyTorch and 🤗 Transformers

Our image regression model will be a fine-tuned version of Google's Vision Transformer (ViT). Google ViT processes 224x224 pixel images by dividing them into 16x16 pixel patches for tasks like image classification. You can read more about it in the paper.


We'll need to customize the model to output a continuous numerical value instead of an image classification label. Let's take a peak under the hood of the Image Regression Model Trainer to see how the model is defined.

class ViTRegressionModel(nn.Module):
    def __init__(self):
        super(ViTRegressionModel, self).__init__()
        self.vit = ViTModel.from_pretrained('google/vit-base-patch16-224')
        self.classifier = nn.Linear(self.vit.config.hidden_size, 1)

Let's break it down. This line of code loads a pre-trained Vision Transformer model (ViT) google/vit-base-patch16-224 from the Hugging Face model hub.

self.vit = ViTModel.from_pretrained('google/vit-base-patch16-224')

This line defines a linear layer (fully connected layer) that takes the hidden size of the ViT model as input and outputs a single value. This layer will be used to output the predicted regression value.

self.classifier = nn.Linear(self.vit.config.hidden_size, 1)

The other important component of the Image Regression Model Trainer is the 🤗 Transformers Trainer. The Trainer is a complete training and evaluation loop for PyTorch models so that you only need to pass it the necessary pieces for training (model, dataset, training hyperparameters, etc.) and the Trainer class takes care of the rest.

Here is what Training Arguments look like:

training_args = TrainingArguments(

All we need to do is give the Trainer our model, arguments, and dataset:

model = ViTRegressionModel()
trainer = Trainer(

The Image Regression Model Trainer abstracts these details away from us so we don't need to dive too deep into the PyTorch/Transformers code.


Download Image Regression Model Trainer from GitHub:

git clone
cd ImageRegression


Install the required libraries:

pip install -r requirements.txt


Let's finally train our model! Image Regression Model Trainer makes it really easy. If you're using your own dataset make sure it's uploaded to the 🤗 Hub correctly. The value_column_name variable will be the column name of your values. Feel free to experiment with test_split, num_train_epochs and learning_rate (the values below are a good starting place).

  • dataset_id 🤗 dataset id
  • value_column_name column name of prediction values in dataset
  • test_split test split of the train/test split
  • output_dir the directory where the checkpoints will be saved
  • num_train_epochs training epochs
  • learning_rate learning rate

The trainer will save the checkpoints in the output_dir location. The model.safetensors are the trained weights you'll use for inference (predicton).

As the model is being trained you should see some information being printing out. The Mean Squared Error (MSE) is a common loss function used in regression tasks to measure the difference between the predicted values and the actual values. You should see this value going down after each epoch.

Upload Model

Uploading your model to the 🤗 Hub is suggested because it'll make inference a lot easier and it'll autogenerate a model card. You'll need to pick a unique name for the model model_id, generate a token, and define the checkpoint folder. Go to the output_dir location of the training and you should see checkpoint folders--pick the latest checkpoint.

  • model_id the name of the model id
  • token go here to create a new 🤗 token
  • checkpoint_dir checkpoint folder that will be uploaded

Go to your 🤗 profile to find your uploaded model, it should look similar to tonyassi/sales-prediction. The upload function autogenerates a model card for you which lists the dataset information, training parameters and instructions on how to use it.


Now we can use our custom trained model to predict a float value given an image. In our example our dataset is product images and sales, so we can use this model to forecast the sales for new products. You'll need the model repo id from the previous step and an image path to predict its value.

  • repo_id 🤗 repo id of the model
  • image_path path to image

The first time this function is called it'll download the safetensor model. Subsequent function calls will run faster.

Additional Applications

This aproach to image regression can be used for many different applications besides sales forecasting.

  • Predict a person's age given their image
  • Scoring an image based on aesthetics
  • Predicting the size of tumors in medical images such as MRIs or CT scans
  • Estimating the yield of crops based on aerial or satellite images
  • Assessing air or water quality by analyzing images of skies or water bodies
  • Predicting traffic density or vehicle count from road images
  • Any machine learning problem where you need to predict a number given an image

About Me

Hello, my name is Tony Assi. I'm a designer based in Los Angeles. I have a background in software, fashion, and marketing. I currently work for an e-commerce fashion brand. Check out my 🤗 profile for more apps, models and datasets.

Feel free to send me an email at with any questions, comments, business inquiries or job offers.