You are viewing v0.7.69 version.
A newer version
v0.8.24 is available.
Tabular Classification / Regression
Using AutoTrain, you can train a model to classify or regress tabular data easily. All you need to do is select from a list of models and upload your dataset. Parameter tuning is done automatically.
Models
The following models are available for tabular classification / regression.
- xgboost
- random_forest
- ridge
- logistic_regression
- svm
- extra_trees
- gradient_boosting
- adaboost
- decision_tree
- knn
Data Format
id,category1,category2,feature1,target
1,A,X,0.3373961604172684,1
2,B,Z,0.6481718720511972,0
3,A,Y,0.36824153984054797,1
4,B,Z,0.9571551589530464,1
5,B,Z,0.14035078041264515,1
6,C,X,0.8700872583584364,1
7,A,Y,0.4736080452737105,0
8,C,Y,0.8009107519796442,1
9,A,Y,0.5204774795512048,0
10,A,Y,0.6788795301189603,0
.
.
.
Columns
Your CSV dataset must have two columns: id
and target
.
Parameters
❯ autotrain tabular --help
usage: autotrain <command> [<args>] tabular [-h] [--train] [--deploy] [--inference] [--username USERNAME]
[--backend {local-cli,spaces-a10gl,spaces-a10gs,spaces-a100,spaces-t4m,spaces-t4s,spaces-cpu,spaces-cpuf}]
[--token TOKEN] [--push-to-hub] --model MODEL --project-name PROJECT_NAME [--data-path DATA_PATH]
[--train-split TRAIN_SPLIT] [--valid-split VALID_SPLIT] [--batch-size BATCH_SIZE] [--seed SEED]
--target-columns TARGET_COLUMNS [--categorical-columns CATEGORICAL_COLUMNS]
[--numerical-columns NUMERICAL_COLUMNS] --id-column ID_COLUMN --task {classification,regression}
[--num-trials NUM_TRIALS] [--time-limit TIME_LIMIT] [--categorical-imputer {most_frequent,None}]
[--numerical-imputer {mean,median,None}] [--numeric-scaler {standard,minmax,normal,robust}]
✨ Run AutoTrain Tabular Data Training
options:
-h, --help show this help message and exit
--train Command to train the model
--deploy Command to deploy the model (limited availability)
--inference Command to run inference (limited availability)
--username USERNAME Hugging Face Hub Username
--backend {local-cli,spaces-a10gl,spaces-a10gs,spaces-a100,spaces-t4m,spaces-t4s,spaces-cpu,spaces-cpuf}
Backend to use: default or spaces. Spaces backend requires push_to_hub & username. Advanced users only.
--token TOKEN Your Hugging Face API token. Token must have write access to the model hub.
--push-to-hub Push to hub after training will push the trained model to the Hugging Face model hub.
--model MODEL Base model to use for training
--project-name PROJECT_NAME
Output directory / repo id for trained model (must be unique on hub)
--data-path DATA_PATH
Train dataset to use. When using cli, this should be a directory path containing training and validation data in appropriate
formats
--train-split TRAIN_SPLIT
Train dataset split to use
--valid-split VALID_SPLIT
Validation dataset split to use
--batch-size BATCH_SIZE
Training batch size to use
--seed SEED Random seed for reproducibility
--target-columns TARGET_COLUMNS
Specify the names of the target or label columns separated by commas if multiple. These columns are what the model will
predict. Required for defining the output of the model.
--categorical-columns CATEGORICAL_COLUMNS
List the names of columns that contain categorical data, useful for models that need explicit handling of such data.
Categorical data is typically processed differently from numerical data, such as through encoding. If not specified, the
model will infer the data type.
--numerical-columns NUMERICAL_COLUMNS
Identify columns that contain numerical data. Proper specification helps in applying appropriate scaling and normalization
techniques, which can significantly impact model performance. If not specified, the model will infer the data type.
--id-column ID_COLUMN
Specify the column name that uniquely identifies each row in the dataset. This is critical for tracking samples through the
model pipeline and is often excluded from model training. Required field.
--task {classification,regression}
Define the type of machine learning task, such as 'classification', 'regression'. This parameter determines the model's
architecture and the loss function to use. Required to properly configure the model.
--num-trials NUM_TRIALS
Set the number of trials for hyperparameter tuning or model experimentation. More trials can lead to better model
configurations but require more computational resources. Default is 100 trials.
--time-limit TIME_LIMIT
mpose a time limit (in seconds) for training or searching for the best model configuration. This helps manage resource
allocation and ensures the process does not exceed available computational budgets. The default is 3600 seconds (1 hour).
--categorical-imputer {most_frequent,None}
Select the method or strategy to impute missing values in categorical columns. Options might include 'most_frequent',
'None'. Correct imputation can prevent biases and improve model accuracy.
--numerical-imputer {mean,median,None}
Choose the imputation strategy for missing values in numerical columns. Common strategies include 'mean', & 'median'.
Accurate imputation is vital for maintaining the integrity of numerical data.
--numeric-scaler {standard,minmax,normal,robust}
Determine the type of scaling to apply to numerical data. Examples include 'standard' (zero mean and unit variance), 'min-
max' (scaled between given range), etc. Scaling is essential for many algorithms to perform optimally