Supervised Versus Unsupervised Learning
Most statistical learning problems fall into one of two categories: supervised supervised
or unsupervised. The examples that we have discussed so far in this chap- unsupervised
all fall into the supervised learning domain. For each observation of the
predictor measurement(s) xi, i = 1,...,n there is an associated response
measurement yi. We wish to fit a model that relates the response to the
predictors, with the aim of accurately predicting the response for future
observations (prediction) or better understanding the relationship between
the response and the predictors (inference). Many classical statistical learning methods such as linear regression and logistic regression (Chapter 4), as logistic well as more modern approaches such as GAM, boosting, and support vec- regression
tor machines, operate in the supervised learning domain. The vast majority
of this book is devoted to this setting.
In contrast, unsupervised learning describes the somewhat more challenging situation in which for every observation i = 1,...,n, we observe
a vector of measurements xi but no associated response yi. It is not possible to fit a linear regression model, since there is no response variable
to predict. In this setting, we are in some sense working blind; the situation is referred to as unsupervised because we lack a response variable that can supervise our analysis. What sort of statistical analysis is possible? We can seek to understand the relationships between the variables
or between the observations. One statistical learning tool that we may use
in this setting is cluster analysis, or clustering. The goal of cluster analysis cluster
is to ascertain, on the basis of x1,...,xn, whether the observations fall into analysis
relatively distinct groups. For example, in a market segmentation study we
might observe multiple characteristics (variables) for potential customers,
such as zip code, family income, and shopping habits. We might believe
that the customers fall into different groups, such as big spenders versus
low spenders. If the information about each customer’s spending patterns
were available, then a supervised analysis would be possible. However, this
information is not available—that is, we do not know whether each potential customer is a big spender or not. In this setting, we can try to cluster
the customers on the basis of the variables measured, in order to identify
distinct groups of potential customers. Identifying such groups can be of
interest because it might be that the groups differ with respect to some
property of interest, such as spending habits.