Supervised Learning

The concept of supervised learning comes from the supervisor, acting as a teacher in the learning process. In other words, the teacher says during training how the output of the learner should be. Supervised learning is besides unsupervised and reinforcement learning part of the scientific discipline machine learning.


The goal of supervised learning is to build a learning system that models a target function. Training is performed on a list of tuples, each tuple consists of an input and a target vector. Targets can also be class labels, in the case of a classification problem. The list of tuples is called training set. A good learner predicts targets as accurate as possible, for any given input vector. This property is called generalization (generalize to new data).

Training data is essential in supervised learning. It can come from different sources like artificial data, sensory data, databases, human generated data etc. Each dataset has its own underlying function, which the learner is trying to model. The error of the model is the difference between the true and the predicted target. The proper error measure is problem or data set dependent, it gives a real-valued number of the performance of the learner on a test set. Our goal is to minimize the prediction error on the test set.

There are two related problem types in supervised learning, the first is regression and the second classification. For example the Boston Housing dataset from the UCI data set repository is a regression problem. The goal is to predict the crime rate (real-valued) based on a number of features like pupil-teacher ratio, accessibility to highways or nitric oxides concentration. A popular classification data set is the MNIST, the aim is to predict the label of handwritten digits (0...9). The input vectors are 784-dimensional (28x28 pixel grayscale image) and the targets are labels.

The error measure is typically different in context of supervised regression or classification. In regression it is usual to take root mean square error (RMSE), mean square error (MSE) or mean absolute error (MAE) as the error to be optimized. In classification problems it is more convenient to use classification error (accuracy) or area under the ROC curve curve (AUC, for binary problems). For more than two target classes it is common to use a confusion matrix to visualize the false classified examples. The rows of the square confusion matrix are the predicted labels, the columns are the real labels. The number of diagonal elements is equivalent to the accuracy.

Furthermore supervised learning can be categorized into offline and online learning. In offline learning, there is a data set available where the learner can build the internal model without any limits in accessing the data. For example, the Netflix Prize is such an offline data set. It is a movie rating data set with 100M samples collected in the time from 1998 to 2005. The competitors have time to carefully analyze the dataset, build large predictive models and combine them in a sophisticated way. This is the sketch of the winner solution. The solution was provided approximately three years after start of the competition. In contrast to offline learning, online learning has access to a sample only once. The goal is the same, predicting targets as accurate as possible. For example, stock market prediction can be seen as online learning. The algorithm makes a prediction of the stock, little time after the real stock price is available, this information can be incorporated to the learner to further improve the prediction accuracy. In general, there is very much data availaible in an online learning setup, the data set grows continuously. Offline learning has equal or superior accuracy compared to online learning when the same amount of data is used.

Input features are also called input vectors or attribute vectors, the targets are sometimes called outputs or desired values. The type of features play a crucial role when do supervised learning. For most data sets, one real-value is a single feature. The number of available features assemble the feature vector. Each training example is a tuple of a feature vector and the corresponding target vector. A single feature can have "the categorical" type, which means only a discrete number of values are allowed. For example the feature "State" can have the values {"Nevada", "Washington", "Utah",...}, or "Sex" can have the values {"male", "female"}. There is a third type of value, namely "missing". Such missing features are a problem for most machine learning algorithms. For example decision trees can work with missing feature natively. A possible work around for missing features is to replace missing values with the mean value (or the median) in the training set. This approximation works well in most cases.

To be more general, the supervised learning problem can be formulated as function approximation from an input to an output space. A classical trainable model for function approximation are neural networks. Usually, the input dimension is much larger than the output dimension. When we look to the supervised learning problems from the UCI machine learning repository, almost all data sets have more training examples than the number of features. A feature is one dimension of the input space. This is the standard case when doing learning from data. There are a small number of exceptions, mainly in the domain of biochemistry, where many features and less observations are available. Such data sets consist of for example genomes, DNA microarrays, drug discovery or mass-spectrometric data.