How to run a regression using Hugging Face — An example with financial news to predict stock returns

Do you have a dataset with text and numerical labels?
The common approach is to discretize the numerical labels into categories, and then run a text classification task, following the standard Hugging Face tutorials.
But what if you want your model to predict the exact numerical value instead of a category?
In other words: you have a text as input and a number as the label, and you want the model to learn how to map the text directly to that number.
This article shows you how to do exactly that — run a regression with Hugging Face.
Step 1: Load the dataset
First, we need a dataset containing both text and numerical labels. For this tutorial, I’ll use a dataset I created from CNBC news. Each entry contains the reason behind a stock price variation and the corresponding numeric change. I processed the dataset to capture the causes behind specific stock movements.
You can download it manually from my Hugging Face page or load it directly with the datasets library:
from datasets import load_dataset
dataset = load_dataset("SelmaNajih001/FinancialClassification")
print(dataset)
Step 2: Filter the dataset
For training, we only need the text and the numerical label. We cannot simply run a regression on the entire dataset without filtering for a specific stock, because different stocks move for different reasons.
For example:
- An interest rate hike affects growth and value stocks differently.
- Supply chain issues may impact Tesla but not a financial institution.
Thus, it’s important to train the model on a subset focusing on a single stock. Here, I’ll filter the dataset for Tesla-related news and remove extreme outliers in price variation:
from datasets import DatasetDict
def clean_dataset(ds):
# Keep only relevant columns
ds = ds.select_columns(["Reasons", "PriceVariation", "Stock", "explanation_summary"])
# Filter out extreme price variations
ds = ds.filter(lambda x: -5 < x["PriceVariation"] < 5)
# Keep only Tesla-related rows (case-insensitive)
ds = ds.filter(lambda x: "tesla" in x["Stock"].lower())
return ds
cleaned_dataset = DatasetDict({
split: clean_dataset(ds)
for split, ds in dataset.items()
})
# Keep only text and label columns
cleaned_dataset = cleaned_dataset.select_columns(["Reasons", "PriceVariation"])
print(cleaned_dataset)
Step 3: Prepare the dataset for training
Rename columns to match Hugging Face conventions (text for input, labels for regression targets):
cleaned_dataset = cleaned_dataset.rename_column("Reasons", "text")
cleaned_dataset = cleaned_dataset.rename_column("PriceVariation", "labels")
Tokenization
We’ll use the allenai/longformer-base-4096 tokenizer to handle long news texts:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("allenai/longformer-base-4096")
#You can increase the max_length
def preprocess_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)
tokenized_dataset = cleaned_dataset.map(preprocess_function, batched=True)
Step 4: Model setup
For regression, we configure AutoModelForSequenceClassification with problem_type=”regression” and num_labels=1:
from transformers import AutoModelForSequenceClassification, AutoConfig
config = AutoConfig.from_pretrained(
"allenai/longformer-base-4096",
num_labels=1,
problem_type="regression"
)
model = AutoModelForSequenceClassification.from_pretrained("allenai/longformer-base-4096", config=config)
Metrics
We’ll use RMSE to evaluate regression performance:
import numpy as np
from sklearn.metrics import mean_squared_error
def compute_metrics(eval_pred):
predictions, labels = eval_pred
rmse = np.sqrt(mean_squared_error(labels, predictions))
return {"rmse": rmse}
Step 5: Training
from transformers import TrainingArguments, Trainer
train_dataset = tokenized_dataset["train"]
eval_dataset = tokenized_dataset["test"]
training_args = TrainingArguments(
output_dir="model",
eval_strategy="epoch",
learning_rate=1e-5,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
num_train_epochs=6,
weight_decay=0.1,
report_to="none",
save_strategy="epoch",
logging_strategy="epoch",
logging_first_step=True
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=tokenizer,
compute_metrics=compute_metrics
)
trainer.train()
And that’s how you train a model for regression. I hope you find this tutorial useful.