Feature Scaling

In chapters 2.1, 2.2, 2.3 we used the gradient descent algorithm (or variants of) to minimize a loss function, and thus achieve a line of best fit. However, it turns out that the optimization in chapter 2.3 was much, much slower than it needed to be. While this isn’t a big problem for these fairly simple linear regression models that we can train in seconds anyways, this inefficiency becomes a much more drastic problem when dealing with large data sets and models.

Example of the Problem

First, let’s look at a concrete example of the problem, by again considering a synthetic data set. Like in chapter 2.3 I generated another simple synthetic data set consisting of 2 independent variables \(x_1\) and \(x_2\), and one dependent variable \(y = 2x_1 + 0.013x_2\). However, note that the range for \(x_1\) is 0 to 10, but the range for \(x_2\) is 0 to 1000. Let's call this data set \(D\). A few sample data points look like this:

\(x_1\) \(x_2\) \(y\)
7.36 1000 34.25
9.47 0 19.24
0.52 315.78 3.50
1.57 315.78 11.02
6.31 263.15 13.93
1.57 526.31 10.21
0.52 105.26 3.41
4.21 842.10 16.27
2.63 894.73 19.04
3.68 210.52 4.60
... ... ...

For simplification I have excluded a constant term from this synthetic data set, and when training models I "cheated" by not training the constant term \(b\) and just setting it to 0, so as to simplify visualization. The model that I used was simply:

def model(x_input):
    y_predicted = A.mm(x_input) # A is a 1x2 matrix

Since this is just a simple application of the ideas from chapter 2.3, one would expect this to work fairly easily. However, when using an optim.SGD optimizer, training goes rather poorly. Here is a plot showing the training progress of A[0] and the loss function side-by-side:

Training progress without scaling

This optimization was performed with a learning rate of 0.0000025, which is about the largest before it would diverge. The loss function quickly decreases at first, but then quickly stalls, and decreases quite slowly. Meanwhile, \(A_0\) is increasing towards the expected value of approximately 2.0, but very slowly. Within 5000 iterations this model fails to finish training.

Now let's do something rather strange: take the data set \(D\), and create a new data set \(D'\) by dividing all the \(x_2\) values by 100. The resulting data set \(D'\) looks roughly like this:

\(x_1'\) \(x_2'\) \(y'\)
7.36 10 34.25
9.47 0 19.24
0.52 3.1578 3.50
1.57 3.1578 11.02
6.31 2.6315 13.93
1.57 5.2631 10.21
0.52 1.0526 3.41
4.21 8.4210 16.27
2.63 8.9473 19.04
3.68 2.1052 4.60
... ... ...

A crucial note is that while \(D'\) is technically different from \(D\), it contains exactly the same information: one can convert between them freely, by dividing or multiplying the second column by 100. In fact, since this transformation is linear, and we are using a linear model, we can train our model on \(D'\) instead. We would just expect to obtain a value of 1.3 for A[1] rather than 0.013. So let's give it a try!

The first interesting observation is that we can use a much larger learning rate. The largest learning rate we could use with \(D\) was 0.0000025, but with \(D'\) we can use a learning rate of 0.01. And when we plot A[0] and the loss function for both \(D\) and \(D'\) we see something pretty crazy:

Training progress with scaling

While training on \(D\) wasn't even close to done after 5000 iterations, training on \(D'\) seems to have completed nearly instantly. If we zoom in on the first 60 iterations, we can see the training more clearly:

Training progress with scaling zoom

So this incredibly simple data set transformation has changed a problem that was untrainable within 5000 iterations to one that can be trained practically instantly with 50 iterations. What is this black magic, and how does it work?

Visualizing Gradient Descent with Level Sets

One of the best ways to gain insight in machine learning is by visualization. As seen in chapter 2.1 we can visualize loss functions using plots. Since we have 2 parameters A[0] and A[1] of the loss function, it would be a 3D plot. We used this for visualization in chapter 2.1, but frankly it's a bit messy looking. Instead, we will use level sets (also called contour plots), which use lines to indicate where the loss function has a constant value. An example is easiest, so here is the contour plot for the loss function for \(D'\), the one that converges quickly:

Contour plot of D'

Each ellipse border is where the loss function has a particular constant value (the innermost ones are labeled), and the red X marks the spot of the optimal values of A[0] and A[1]. By convention the particular constant values of the loss function are evenly spaced, which means that contour lines that are closer together indicate a "steeper" slope. In this plot, the center near the X is quite shallow, while far away is pretty steep. In addition, one diagonal axis of the ellipses is steeper than the other diagonal axis.

We can also plot how A[0] and A[1] evolve during training on the contour plot, to get a feel for how gradient descent is working:

Contour plot of D' with training

Here, we can see that the initial values for A[0] and A[1] start pretty far off, but gradient descent quickly zeros in towards the X. As a side note, the line connecting each pair of dots is perpendicular to the line of the level set: see if you can figure out why.

So if that is what the contour plot looks like for the "good" case of \(D'\), how does it look for \(D\)? Well, it looks rather strange:

Contour plot of D

Don't be fooled: like the previous plot the level sets also form ellipses, but they are so stretched that they are nearly straight lines at this scale. This means that vertically (A[0] is constant and we vary A[1]) there is substantial gradient, but horizontally (A[1] is constant and we vary A[0]) there is practically no slope. Put another way, gradient descent only knows to vary A[1], and (almost) doesn't vary A[0]. We can test this hypothesis by plotting how gradient descent updates A[0] and A[1], like above. Since gradient descent makes such little progress though, we have to zoom in a lot to see what is going on:

Contour plot of D

We can clearly see that gradient descent applies large updates to A[1] (a bit too large, a smaller learning rate would have been a bit better) due to the large gradient in the A[1] direction. But due to the (comparatively) tiny gradient in the A[0] direction very small updates are done to A[0]. Gradient descent quickly converges on the optimal value of A[1], but is very very far away from finding the optimal value of A[0].

Let's take a quick look at what is going on mathematically to see why this happens. The model we are using is: \[ y'(x, A) = Ax = a_1 x_1 + a_2 x_2 \] Here, \(a_1\) is in the role of A[0] and \(a_2\) is A[1]. We substitute this into the loss function to get: \[ L(a_1, a_2) = \sum_{i=1}^m (a_1 x_1^{(i)} + a_2 x_2^{(i)} - y^{(i)})^2 \] Now if we differentiate \(L\) in the direction of \(a_1\) and \(a_2\) separately, we get: \[ \frac{\partial L}{\partial a_1} = \sum_{i=1}^m 2(a_1 x_1^{(i)} + a_2 x_2^{(i)} - y^{(i)})x_1^{(i)} \\ \frac{\partial L}{\partial a_2} = \sum_{i=1}^m 2(a_1 x_1^{(i)} + a_2 x_2^{(i)} - y^{(i)})x_2^{(i)} \] Here we can see the problem: the inner terms of the derivatives are the same, except one is multiplied by \(x_1^{(i)}\) and the other by \(x_2^{(i)}\). If \(x_2\) is on average 100 times bigger than \(x_1\) (which it is in the original data set), then we would expect \(\frac{\partial L}{\partial a_2}\) to be roughly 100 times bigger than \(\frac{\partial L}{\partial a_1}\). It isn't exactly 100 times larger, but with any reasonable data set it should be close. Since the derivatives in the directions of \(a_1\) and \(a_2\) are scaled completely differently, gradient descent fails to update both of them adequately.

The solution is simple: we need to rescale the input features before training. This is exactly what happened when we mysteriously divided by 100: we rescaled \(x_2\) to be comparable to \(x_1\). But we should work out a more methodical way of rescaling, rather than randomly dividing by 100.

Rescaling Features

This wasn't explored above, but there are really two ways that we potentially need to rescale features. Consider an example where \(x_1\) ranges between -1 and 1, but \(x_2\) ranges between 99 and 101: both of these features have (at least approximately) the same standard deviation, but \(x_2\) has a much larger mean. On the other hand, consider an example where \(x_1\) still ranges between -1 and 1, but \(x_2\) ranges between -100 and 100. This time, they have the same mean, but \(x_2\) has a much larger standard deviation. Both of these situations can make gradient descent and related algorithms slower and less reliable. So, our goal is to ensure that all features have the same mean and standard deviation.

Note: There are other methods to measure how different features are, and then subsequently rescale them. For a look at more of them, feel free to consult the Wikipedia article. However, after learning the technique presented here, any other technique is fairly similar and easy to implement if needed.

Without digressing too far into statistics, let's quickly review how to calculate the mean \(\mu\) and standard deviation \(\sigma\). Suppose we have already read in our data set in Python into a Numpy vector / matrix, and have all the values for the feature \(x_j\), for each \(j\). Mathematically, the mean for feature \(x_j\) is just the average: \[ \mu_j = \frac{\sum_{i=1}^m x_j^{(i)}}{m} \] In PyTorch there is a convenient .mean() method we will use.

The standard deviation \(\sigma\) is a bit more tricky. We measure how far each point \(x_j^{(i)}\) is from the mean \(\mu_j\), square this, then take the mean of all of this, and finally square root it: \[ \sigma_j = \sqrt{\frac{\sum_{i=1}^m (x_j^{(i)} - \mu_j)^2 }{m}} \] Again, PyTorch provides a convenient .std() method.

Note: Statistics nerds might point out that in the above equation we should divide by \(m - 1\) instead of \(m\) to obtain an unbiased estimate of the standard deviation. This might well be true, but in this usage it does not matter since we are only using this to rescale features relative to each other, and not make a rigorous statistical inference. To be more precise, doing so would involve multiplying each scaled feature by only a constant factor, and will not change any of their standard deviations or means relative to each other.

Once we have every mean \(\mu_j\) and standard deviation \(\sigma_j\), rescaling is easy: we simply rescale every feature like so:

\[ \begin{equation} \label{eq:scaling} x_j' = \frac{x_j - \mu_j}{\sigma_j} \end{equation} \]

This will force every feature to have a mean of 0 and a standard deviation of 1, and thus be scaled well relative to each other.

Note: Be careful to make sure to perform the rescaling at both training time and prediction time. That is, we first have to perform the rescaling on the whole training data set, and then train the model so as to achieve good training performance. Once the model is trained and we have a new, never before seen input \(x\), we also need to rescale its features to \(x'\) because the trained model only understands inputs that have already been rescaled (because we trained it that way).

Implementation and Experiments

Feature scaling can be applied to just about any multi-variable model. Since the only multi-variable model we have seen so far is multi-variable linear regression, we'll look at implementing feature scaling in that context, but the ideas and the code are pretty much the same for other models. First let's just copy-paste the original multi-variable linear regression code, and modify it to load the new synthetic data set:

import pandas as pd
import matplotlib.pyplot as plt
import torch
import torch.optim as optim

### Load the data

# First we load the entire CSV file into an m x 3
D = torch.tensor(pd.read_csv("linreg-scaling-synthetic.csv", header=None).values, dtype=torch.float)

# We extract all rows and the first 2 columns, and then transpose it
x_dataset = D[:, 0:2].t()

# We extract all rows and the last column, and transpose it
y_dataset = D[:, 2].t()

# And make a convenient variable to remember the number of input columns
n = 2


### Model definition ###

# First we define the trainable parameters A and b 
A = torch.randn((1, n), requires_grad=True)
b = torch.randn(1, requires_grad=True)

# Then we define the prediction model
def model(x_input):
    return A.mm(x_input) + b


### Loss function definition ###

def loss(y_predicted, y_target):
    return ((y_predicted - y_target)**2).sum()

### Training the model ###

# Setup the optimizer object, so it optimizes a and b.
optimizer = optim.Adam([A, b], lr=0.1)

# Main optimization loop
for t in range(2000):
    # Set the gradients to 0.
    optimizer.zero_grad()
    # Compute the current predicted y's from x_dataset
    y_predicted = model(x_dataset)
    # See how far off the prediction is
    current_loss = loss(y_predicted, y_dataset)
    # Compute the gradient of the loss with respect to A and b.
    current_loss.backward()
    # Update A and b accordingly.
    optimizer.step()
    print(f"t = {t}, loss = {current_loss}, A = {A.detach().numpy()}, b = {b.item()}")

If you run this code right now, you might see output like the following:

...
t = 1995, loss = 0.505296, A = [[1.9920335  0.01292674]], b = 0.089939
t = 1996, loss = 0.5042, A = [[1.9920422  0.01292683]], b = 0.0898404
t = 1997, loss = 0.5031, A = [[1.9920509  0.01292691]], b = 0.0897419
t = 1998, loss = 0.501984, A = [[1.9920596  0.01292699]], b = 0.0896435
t = 1999, loss = 0.50089, A = [[1.9920683  0.01292707]], b = 0.0895451

Note that if you run this code multiple times you will get different results each time due to the random initialization. The synthetic data was generated with the equation \(y = 2x_1 + 0.013x_2 + 0\). So after 2000 iterations of training we are getting close-ish to the correct values, but it's not fully trained. So let's implement feature scaling to fix this.

The first step is to compute the mean and standard deviation for each feature in the training data set. Add the following after x_dataset is loaded:

means = x_dataset.mean(1, keepdim=True)
deviations = x_dataset.std(1, keepdim=True)

The argument of 1 means that we compute the mean for each feature, averaged over all data points. That is, x_dataset is a \(2 \times 400\) matrix, and means is a \(2 \times 1\) matrix (vector). If we had used 0, means would be a \(1 \times 400\) matrix, which is not what we want. We also have to use keepdims=True to tell PyTorch to keep means as an actual \(2 \times 1\) matrix rather than squishing it into an array of length 2.

Now that we know the means and standard deviations of each feature, we have to use them to transform the inputs via Equation \((\ref{eq:scaling})\). We have two options: we could implement the transformation directly by transforming x_dataset before training, or we could include the transformation within the PyTorch model computations. But since we need to do the transformation again when we want to predict output given new inputs, including it in the PyTorch computations will be a bit more convenient.

When we define the model, we want to first transform the input according to Equation \((\ref{eq:scaling})\):

# First we define the trainable parameters A and b 
A = torch.randn((1, n), requires_grad=True)
b = torch.randn(1, requires_grad=True)

# Then we define the prediction model
def model(x_input):
    x_transformed = (x_input - means) / deviations
    return A.mm(x_transformed) + b

One important concept is that when this line of code runs, means and deviations have already been computed, so to PyTorch they are constants. This means that PyTorch will not compute gradients for them.

Also, note that since x_input and means are compatibly sized matrices, the subtraction (and division) will be done separately for each feature, automatically. So while Equation \((\ref{eq:scaling})\) technically says how to scale one feature individually, this single line of code scales all the features.

Note that when we want to use the model, either for training or later prediction, we should feed unscaled values to the model, since the model already performs the scaling.

That's all that we need to implement for feature scaling, it's really pretty easy. If we run this code now we should see:

...
t = 1995, loss = 1.17733e-07, A = [[6.0697703 3.9453504]], b = 16.5
t = 1996, loss = 1.17733e-07, A = [[6.0697703 3.9453504]], b = 16.5
t = 1997, loss = 1.17733e-07, A = [[6.0697703 3.9453504]], b = 16.5
t = 1998, loss = 1.17733e-07, A = [[6.0697703 3.9453504]], b = 16.5
t = 1999, loss = 1.17733e-07, A = [[6.0697703 3.9453504]], b = 16.5

The fact that none of the trained weights are updating anymore and that the loss function is very small is a good sign that we have achieved convergence. But the weights are not the values we expect... shouldn't we get A = [[2, 0.013]], b = 0, since that is the equation that generated the synthetic data? Actually, the weights we got are correct because the model is now being trained on the rescaled data set, so we get different weights at the end. See Challenge Problem 3 below to explore this more.

Concluding Remarks

We've seen how to implement feature scaling for a simple multi-variable linear regression model. While this is a fairly simple model, the principles are basically the same for applying feature scaling to more complex models (which we will see very soon). In fact, the code is pretty much identical: just compute the means and standard deviations, apply the formula to x_input to compute x_transformed, and anywhere you use x_input in the model, just use x_transformed instead.

Whether or not feature scaling helps is dependent on the problem and the model. If you have having trouble getting your model to converge, you can always implement feature scaling and see how it affects the training. While it wasn't used in this chapter, using TensorBoard (see chapter 2.2) is a great way to run experiments to compare training with and without feature scaling.

Challenge Problems

  1. Coding: Take one of the challenge problems from the previous chapter, and implement it with and without feature scaling, and then compare how training performs.
  2. Coding: Add TensorBoard support to the code from this chapter, and use it to explore for yourself the effect feature scaling has. Also, you can compare with different optimization algorithms.
  3. Theory: One problem with feature scaling is that we learn different model parameters. In this chapter, the original data set was generated with \(y = 2x_1 + 0.013x_2 + 0\), but the parameters that the model learned were \(a_1' = 6.0697703, a_2' = 3.9453504, b' = 16.5\). Note that I have called these learned parameters \(a_1'\) rather than \(a_1\) to indicate that these learned parameters are for the rescaled data.

    One important use of linear regression is to explore and understand a data set, not just predict outputs. After doing a regression, one can understand through the learned weights how severely each feature affects the output. For example, if we have two features \(A\) and \(B\) which are used to predict rates of cancer, and the learned weight for \(A\) is much larger than the learned weight for \(B\), this indicates that occurrence of \(A\) is more correlated with cancer than occurrence of \(B\) is, which is interesting in its own right. Unfortunately, performing feature scaling destroys this: since we have rescaled the training data, the weight for \(A\) (\(a_1'\)) has lost its relevance to the values of \(A\) in the real world. But all is not lost, since we can actually recover the "not-rescaled" parameters from the learned parameters, which we will derive now.

    First, suppose that we have learned the "rescaled" parameters \(a_1', a_2', b'\). That is, we predict the output \(y\) to be: \[\begin{equation}\label{eq:start}y = a_1' x_1' + a_2' x_2' + b'\end{equation}\]where \(x_1', x_2'\) are the rescaled features. Now, we want a model that uses some (yet unknown) weights \(a_1, a_2, b\) to predict the same output \(y\), but using the not-rescaled features \(x_1, x_2\). That is, we want to find \(a_1, a_2, b\) such that this is true:\[\begin{equation}\label{eq:goal}y = a_1 x_1 + a_2 x_2 + b\end{equation}\]To go about doing this, substitute for \(x_1'\) and \(x_2'\) using Equation \((\ref{eq:scaling})\) into Equation \((\ref{eq:start})\), and then compare it to Equation \((\ref{eq:goal})\) to obtain 3 equations: one to relate each scaled and not-scaled parameter. Finally, you can use these equations to be able to compute the "not-scaled" parameters in terms of the learned parameters. You can check your work by computing the not-scaled parameters in terms of the learned parameters for the synthetic data set, and verify that they match the expected values.

Complete Code

The complete example code is available on GitHub, as well as directly here:

import pandas as pd
import matplotlib.pyplot as plt
import torch
import torch.optim as optim

### Load the data

# First we load the entire CSV file into an m x 3
D = torch.tensor(pd.read_csv("linreg-scaling-synthetic.csv", header=None).values, dtype=torch.float)

# We extract all rows and the first 2 columns, and then transpose it
x_dataset = D[:, 0:2].t()

# We extract all rows and the last column, and transpose it
y_dataset = D[:, 2].t()

# And make a convenient variable to remember the number of input columns
n = 2


### Feature Scaling computations

# Pre-compute the means and standard deviations of independent variables
means = x_dataset.mean(1, keepdim=True)
deviations = x_dataset.std(1, keepdim=True)


### Model definition ###

# First we define the trainable parameters A and b 
A = torch.randn((1, n), requires_grad=True)
b = torch.randn(1, requires_grad=True)

# Then we define the prediction model
def model(x_input):
    x_transformed = (x_input - means) / deviations
    return A.mm(x_transformed) + b


### Loss function definition ###

def loss(y_predicted, y_target):
    return ((y_predicted - y_target)**2).sum()

### Training the model ###

# Setup the optimizer object, so it optimizes a and b.
optimizer = optim.Adam([A, b], lr=0.1)

# Main optimization loop
for t in range(2000):
    # Set the gradients to 0.
    optimizer.zero_grad()
    # Compute the current predicted y's from x_dataset
    y_predicted = model(x_dataset)
    # See how far off the prediction is
    current_loss = loss(y_predicted, y_dataset)
    # Compute the gradient of the loss with respect to A and b.
    current_loss.backward()
    # Update A and b accordingly.
    optimizer.step()
    print(f"t = {t}, loss = {current_loss}, A = {A.detach().numpy()}, b = {b.item()}")