How to run a regression using Hugging Face — An example with financial news to predict stock returns

Community Article Published October 4, 2025

Do you have a dataset with text and numerical labels?

The common approach is to discretize the numerical labels into categories, and then run a text classification task, following the standard Hugging Face tutorials.

But what if you want your model to predict the exact numerical value instead of a category?

In other words: you have a text as input and a number as the label, and you want the model to learn how to map the text directly to that number.

This article shows you how to do exactly that — run a regression with Hugging Face.

Step 1: Load the dataset

First, we need a dataset containing both text and numerical labels. For this tutorial, I’ll use a dataset I created from CNBC news. Each entry contains the reason behind a stock price variation and the corresponding numeric change. I processed the dataset to capture the causes behind specific stock movements.

You can download it manually from my Hugging Face page or load it directly with the datasets library:

from datasets import load_dataset

dataset = load_dataset("SelmaNajih001/FinancialClassification")
print(dataset)

Step 2: Filter the dataset

For training, we only need the text and the numerical label. We cannot simply run a regression on the entire dataset without filtering for a specific stock, because different stocks move for different reasons.

For example:

An interest rate hike affects growth and value stocks differently.
Supply chain issues may impact Tesla but not a financial institution.

Thus, it’s important to train the model on a subset focusing on a single stock. Here, I’ll filter the dataset for Tesla-related news and remove extreme outliers in price variation:

from datasets import DatasetDict

def clean_dataset(ds):

    # Keep only relevant columns
    ds = ds.select_columns(["Reasons", "PriceVariation", "Stock", "explanation_summary"])
    # Filter out extreme price variations
    ds = ds.filter(lambda x: -5 < x["PriceVariation"] < 5)
    # Keep only Tesla-related rows (case-insensitive)
    ds = ds.filter(lambda x: "tesla" in x["Stock"].lower())
    return ds

cleaned_dataset = DatasetDict({
    split: clean_dataset(ds)
    for split, ds in dataset.items()
})

# Keep only text and label columns
cleaned_dataset = cleaned_dataset.select_columns(["Reasons", "PriceVariation"])
print(cleaned_dataset)

Step 3: Prepare the dataset for training

Rename columns to match Hugging Face conventions (text for input, labels for regression targets):

cleaned_dataset = cleaned_dataset.rename_column("Reasons", "text")
cleaned_dataset = cleaned_dataset.rename_column("PriceVariation", "labels")

Tokenization

We’ll use the allenai/longformer-base-4096 tokenizer to handle long news texts:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("allenai/longformer-base-4096")
#You can increase the max_length
def preprocess_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)

tokenized_dataset = cleaned_dataset.map(preprocess_function, batched=True)

Step 4: Model setup

For regression, we configure AutoModelForSequenceClassification with problem_type=”regression” and num_labels=1:

from transformers import AutoModelForSequenceClassification, AutoConfig

config = AutoConfig.from_pretrained(
    "allenai/longformer-base-4096",
    num_labels=1,
    problem_type="regression"
)
model = AutoModelForSequenceClassification.from_pretrained("allenai/longformer-base-4096", config=config)

Metrics

We’ll use RMSE to evaluate regression performance:

import numpy as np
from sklearn.metrics import mean_squared_error

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    rmse = np.sqrt(mean_squared_error(labels, predictions))
    return {"rmse": rmse}

Step 5: Training

from transformers import TrainingArguments, Trainer

train_dataset = tokenized_dataset["train"]
eval_dataset = tokenized_dataset["test"]

training_args = TrainingArguments(
    output_dir="model",
    eval_strategy="epoch",
    learning_rate=1e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=6,
    weight_decay=0.1,
    report_to="none",
    save_strategy="epoch",
    logging_strategy="epoch",
    logging_first_step=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()

And that’s how you train a model for regression. I hope you find this tutorial useful.

Here an example of the output

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote