Tabular Regression

Tabular regression is the task of predicting a numerical value given a set of attributes.

Car Name Horsepower Weight
ford torino 140 3,449
amc hornet 97 2,774
toyota corolla 65 1,773
Tabular Regression Model
MPG (miles per gallon)

About Tabular Regression

About the Task

Tabular regression is the task of predicting a numerical value given a set of attributes/features. Tabular meaning that data is stored in a table (like an excel sheet), and each sample is contained in its own row. The features used to predict our target can be both numerical and categorical. However, including categorical features often requires additional preprocessing/feature engineering (a few models do accept categorical features directly, like CatBoost). An example of tabular regression would be predicting the weight of a fish given its' species and length.

Use Cases

Sales Prediction: a Use Case for Predicting a Continuous Target Variable

Here the objective is to predict a continuous variable based on a set of input variable(s). For example, predicting sales of an ice cream shop based on temperature of weather and duration of hours shop was open. Here we can build a regression model with temperature and duration of hours as input variable and sales as target variable.

Missing Value Imputation for Other Tabular Tasks

In real-world applications, due to human error or other reasons, some of the input values can be missing or there might not be any recorded data. Considering the example above, say the shopkeeper's watch was broken and they forgot to calculate the hours for which the shop was open. This will lead to a missing value in their dataset. In this case, missing values could be replaced it with zero, or average hours for which the shop is kept open. Another approach we can try is to use temperature and sales variables to predict the hours variable here.

Model Training

A simple regression model can be created using sklearn as follows:

#set the input features
X = data[["Feature 1", "Feature 2", "Feature 3"]]
#set the target variable
y = data["Target Variable"]
#initialize the model
model = LinearRegression()
#Fit the model
model.fit(X, y)

Model Hosting and Inference

You can use skops for model hosting and inference on the Hugging Face Hub. This library is built to improve production workflows of various libraries that are used to train tabular models, including sklearn and xgboost. Using skops you can:

  • Easily use Inference Endpoints,
  • Build neat UIs with one line of code,
  • Programmatically create model cards,
  • Securely serialize your models. (See limitations of using pickle here.)

You can push your model as follows:

from skops import hub_utils
# initialize a repository with a trained model
local_repo = "/path_to_new_repo"
hub_utils.init(model, dst=local_repo)
# push to Hub!
hub_utils.push("username/my-awesome-model", source=local_repo)

Once the model is pushed, you can infer easily.

import skops.hub_utils as hub_utils
import pandas as pd
data = pd.DataFrame(your_data)
# Load the model from the Hub
res = hub_utils.get_model_output("username/my-awesome-model", data)

You can launch a UI for your model with only one line of code!

import gradio as gr

Useful Resources

Training your own model in just a few seconds

We have built a baseline trainer application to which you can drag and drop your dataset. It will train a baseline and push it to your Hugging Face Hub profile with a model card containing information about the model.

This page was made possible thanks to efforts of Brenden Connors and Ayush Bihani.

Compatible libraries

Tabular Regression demo
Models for Tabular Regression
Browse Models (105)

Note Fish weight prediction based on length measurements and species.

Datasets for Tabular Regression
Browse Datasets (40)

Note A comprehensive curation of datasets covering all benchmarks.

Spaces using Tabular Regression

Note An application that can predict weight of a fish based on set of attributes.

Metrics for Tabular Regression
Mean Squared Error(MSE) is the average of the square of difference between the predicted and actual values.
Coefficient of determination (or R-squared) is a measure of how well the model fits the data. Higher R-squared is considered a better fit.