How to run a regression using Hugging Face — An example with financial news to predict stock returns

Community Article Published October 4, 2025

image/png

Do you have a dataset with text and numerical labels?

The common approach is to discretize the numerical labels into categories, and then run a text classification task, following the standard Hugging Face tutorials.

But what if you want your model to predict the exact numerical value instead of a category?

In other words: you have a text as input and a number as the label, and you want the model to learn how to map the text directly to that number.

This article shows you how to do exactly that — run a regression with Hugging Face.


Step 1: Load the dataset

First, we need a dataset containing both text and numerical labels. For this tutorial, I’ll use a dataset I created from CNBC news. Each entry contains the reason behind a stock price variation and the corresponding numeric change. I processed the dataset to capture the causes behind specific stock movements.

You can download it manually from my Hugging Face page or load it directly with the datasets library:

from datasets import load_dataset

dataset = load_dataset("SelmaNajih001/FinancialClassification")
print(dataset)

Step 2: Filter the dataset

For training, we only need the text and the numerical label. We cannot simply run a regression on the entire dataset without filtering for a specific stock, because different stocks move for different reasons.

For example:

  • An interest rate hike affects growth and value stocks differently.
  • Supply chain issues may impact Tesla but not a financial institution.

Thus, it’s important to train the model on a subset focusing on a single stock. Here, I’ll filter the dataset for Tesla-related news and remove extreme outliers in price variation:

from datasets import DatasetDict

def clean_dataset(ds):

    # Keep only relevant columns
    ds = ds.select_columns(["Reasons", "PriceVariation", "Stock", "explanation_summary"])
    # Filter out extreme price variations
    ds = ds.filter(lambda x: -5 < x["PriceVariation"] < 5)
    # Keep only Tesla-related rows (case-insensitive)
    ds = ds.filter(lambda x: "tesla" in x["Stock"].lower())
    return ds

cleaned_dataset = DatasetDict({
    split: clean_dataset(ds)
    for split, ds in dataset.items()
})

# Keep only text and label columns
cleaned_dataset = cleaned_dataset.select_columns(["Reasons", "PriceVariation"])
print(cleaned_dataset)

Step 3: Prepare the dataset for training

Rename columns to match Hugging Face conventions (text for input, labels for regression targets):

cleaned_dataset = cleaned_dataset.rename_column("Reasons", "text")
cleaned_dataset = cleaned_dataset.rename_column("PriceVariation", "labels")

Tokenization

We’ll use the allenai/longformer-base-4096 tokenizer to handle long news texts:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("allenai/longformer-base-4096")
#You can increase the max_length
def preprocess_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)

tokenized_dataset = cleaned_dataset.map(preprocess_function, batched=True)

Step 4: Model setup

For regression, we configure AutoModelForSequenceClassification with problem_type=”regression” and num_labels=1:

from transformers import AutoModelForSequenceClassification, AutoConfig

config = AutoConfig.from_pretrained(
    "allenai/longformer-base-4096",
    num_labels=1,
    problem_type="regression"
)
model = AutoModelForSequenceClassification.from_pretrained("allenai/longformer-base-4096", config=config)

Metrics

We’ll use RMSE to evaluate regression performance:

import numpy as np
from sklearn.metrics import mean_squared_error

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    rmse = np.sqrt(mean_squared_error(labels, predictions))
    return {"rmse": rmse}

Step 5: Training

from transformers import TrainingArguments, Trainer

train_dataset = tokenized_dataset["train"]
eval_dataset = tokenized_dataset["test"]

training_args = TrainingArguments(
    output_dir="model",
    eval_strategy="epoch",
    learning_rate=1e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=6,
    weight_decay=0.1,
    report_to="none",
    save_strategy="epoch",
    logging_strategy="epoch",
    logging_first_step=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()

And that’s how you train a model for regression. I hope you find this tutorial useful.

Here an example of the output image/png

Community

Sign up or log in to comment