Using Learning Curves to Analyse Machine Learning Model Performance

A learning curve is a plot of a model's learning performance across time or experience.

Learning curves are a common diagnostic tool in machine learning for algorithms that learn progressively from a training dataset. After each update during training, the model may be tested on the training dataset and a hold out validation dataset, and graphs of the measured performance can be constructed to display learning curves.

Examining model learning curves during training can help to detect learning issues such as an underfit or overfit model, as well as whether the training and validation datasets are sufficiently representative.

Learning Curves in Machine Learning

A learning curve is, in general, a figure that depicts time or experience on the x-axis and learning or progress on the y-axis.

"Learning curves (LCs) are deemed effective tools for monitoring the performance of workers exposed to a new task. LCs provide a mathematical representation of the learning process that takes place as task repetition occurs. – Learning curve models and applications: Literature review and research directions, 2011.

For example, if you were studying a musical instrument, your proficiency may be evaluated and a numerical score provided each week for a year. A learning curve is a plot of the scores over the 52 weeks that shows how your understanding of the instrument has changed over time.

Learning Curve: Line plot of learning (y-axis) over experience (x-axis).

For algorithms that learn (optimise their internal parameters) progressively over time, such as deep learning neural networks, learning curves are extensively utilised.

The learning measure might be maximising, which means that higher scores (bigger numbers) represent more learning. One example is classification accuracy.

It is more customary to employ a minimising score, such as loss or error, where better scores (lower numbers) imply greater learning and a value of 0.0 implies that the training dataset was properly learnt with no mistakes.

The present status of a machine learning model at each stage of the training process may be examined during training. It may be tested against the training dataset to determine how effectively the model is "learning." It may also be tested on a separate validation dataset that is not included in the training dataset. The validation dataset evaluation indicates how effectively the model is "generalising."

Train Learning Curve: A learning curve derived from the training dataset that indicates how effectively the model is learning.
Validation Learning Curve: A learning curve derived from a hold-out validation dataset that indicates how effectively the model generalises.

Dual learning curves for a machine learning model are commonly created during training on both the training and validation datasets.

In other circumstances, creating learning curves for several metrics is also popular, such as in classification predictive modelling challenges, where the model may be optimised using cross-entropy loss and model performance is measured using classification accuracy. In this example, two plots are generated, one for each metric's learning curves, and each plot can display two learning curves, one for each of the train and validation datasets.

Optimization Learning Curve: Learning curves based on the measure used to optimise the model's parameters, such as i.e. loss.
Performance Learning Curve: Learning curves based on the statistic used to evaluate and choose the model, such as accuracy.

Analysing AI Model's Behaviour

The shape and dynamics of a learning curve may be used to diagnose the behaviour of a machine learning model and, as a result, may propose the sort of configuration adjustments that might be made to improve learning and/or performance.

Learning curves are likely to exhibit three common dynamics, which are as follows:

Underfit;
Overfit;
Good Fit.

Underfit Learning Curves

Underfitting occurs when a model is unable to learn from the training dataset.

"Underfitting occurs when the model is not able to obtain a sufficiently low error value on the training set. – Deep Learning, 2016.

Only the learning curve of the training loss may be used to identify an underfit model.

It may display a flat line or noisy values of relatively significant loss, suggesting that the model failed to learn the training dataset at all.

This is typical when the model's capability is insufficient for the intricacy of the dataset, as seen below.

Example of Training Learning Curve Showing An Underfit Model.

An underfit model can also be spotted by a falling training loss that continues to decline at the conclusion of the plot.

This shows that the model is capable of more learning and development, and that the training process was terminated prematurely.

Example of Training Learning Curve Showing an Underfit Model That Requires Further Training.

Underfitting is shown by a plot of learning curves if:

Regardless of training, the training loss stays constant.
The training loss decreases till the completion of training.

Overfit Learning Curves

Overfitting is defined as a model that has learnt the training dataset too well, including statistical noise or random fluctuations.

Overfitting has the disadvantage that the more specialised the model gets to training data, the less successfully it can generalise to new data, resulting in an increase in generalisation error. The model's performance on the validation dataset may be used to quantify this rise in generalisation error.

This is common when the model has more capacity than is necessary for the problem and, as a result, too much flexibility. It can also happen if the model is trained for an inordinately extended period of time.

Overfitting is indicated by a plot of learning curves if: 1. The training loss plot continues to diminish with progress; 2. The validation loss plot drops to a point before climbing again.

As experience beyond that moment demonstrates the mechanics of overfitting, the inflection point in validation loss may be the point at which training might be discontinued.

The plot below is an example of overfitting.

Good Fit Learning Curves

The learning algorithm seeks a good match between an overfit and an underfit model.

A good fit is defined by a training and validation loss that declines to a stable point with a small difference between the two final loss values.

The model's loss is nearly always smaller on the training dataset than on the validation dataset. This implies that there will be some disparity between the train and validation loss learning curves. This is known as the "generalisation gap."

A plot of learning curves indicates a good fit if and only if: 1. The training loss plot lowers to a point of stability; 2. The validation loss plot approaches stability and has a tiny gap with the training loss.

Continued training of a good fit will almost certainly result in overfit.

The plot below shows an example of a good match.

Diagnosing Unrepresentative Datasets

Learning curves may also be used to determine the qualities of a dataset and its representativeness.

An unrepresentative dataset is one that does not capture the statistical properties of another dataset derived from the same domain, such as the difference between a train and a validation dataset. This is typical when the number of samples in one dataset is too little in comparison to another.

There are two common scenarios that may be observed:

The training dataset is not very representative;
The validation dataset is not very representative.

Unrepresentative Training Dataset

An unrepresentative training dataset is one that does not give enough information to learn the problem in comparison to the validation dataset used to evaluate it.

This can happen if the training dataset includes fewer instances than the validation dataset.

This condition is indicated by a learning curve for training loss that shows improvement and a learning curve for validation loss that also shows progress, but there is a big gap between the two curves.

Unrepresentative Validation Dataset

An unrepresentative validation dataset indicates that the validation dataset does not include enough information to assess the model's capacity to generalise.

This might happen if the validation dataset includes fewer instances than the training dataset.

A learning curve for training loss that appears like a good fit (or other fits) and a learning curve for validation loss that displays noisy movements around the training loss identify this scenario.

A validation loss that is less than the training loss may also be used to identify it. In this scenario, it suggests that the model may find the validation dataset easier to predict than the training dataset.