Introduction

Welcome to Machine Learning with PyTorch, an easy to follow guide / blog about understanding machine learning concepts and their implementation in Python and PyTorch. My goal is to create an intuitive explanation of machine learning methods, paired with practical knowledge of implementation. This guide does assume some small working knowledge of programming, but Python specific syntax will be briefly introduced as needed. In addition, some mathematics background will be useful, in particular linear algebra and calculus, but is not needed to follow the intuition and implementation. Resources to review relevant math concepts will be provided as required. Finally, this guide is open source on GitHub, and I would be happy to consider edits, both tiny and large. Feel free to open pull requests with changes to improve this guide!

What exactly is machine learning?

Though the exact definition of machine learning is up for debate, we can roughly define machine learning as "a field of computer science that gives computers the ability to learn without being explicitly programmed" (Wikipedia). Traditionally, programming has been used to great effect to solve problems which have explicit steps. For example, consider the problem of determining if a number \(n\) is prime. We can easily determine if \(n\) is prime using fairly simple steps: check if \(n\) is divisible by \(2\), check if \(n\) is divisible by \(3\), \(\cdots\), check if \(n\) is divisible by \(n-1\). If \(n\) is divisible by any of these, then \(n\) is not prime, and otherwise it is prime. With traditional programming we then need to write these specific steps (an algorithm) in a programming language that our computer can interpret, and we have a working program which can classify numbers as prime much faster than a human ever could.

However, some tasks seem difficult or impossible to solve using specific steps like above. Rather than classifying prime numbers, let us instead try to classify images of dogs as either a Golden Retriever or Poodle.

Golden Retriever Poodle
Golden Retriever Image Poodle Image

Most people would easily be able to differentiate the two breeds above, and accurately dogs of those breeds. But, can you explain how you came to this conclusion, in very specific steps? Perhaps, one might say that the hair of the Golden Retriever seems less curly, but then one would need to explain how to determine that the hair is less curly. There does not seem to exist a natural way to describe the curliness of hair in specific steps. Maybe, the shape of facial features such as nose shape are useful for determining dog breed, but it still seems nearly impossible to describe a precise algorithm that uses this data.

Qualitatively, humans do not seem to comprehend the breed of a dog by performing some precise algorithm, and this suggests that a computer program to solve this problem should not either. By using machine learning, we can use data to help the computer learn to recognize a dog breed, objects in general, human speech, etc., without explicitly programming it.

So what can machine learning accomplish?

Lots! Machine learning is finding applications nearly everywhere, and it's still growing. Machine learning is full of constantly evolving research with new and exciting results, such as:

  1. Image Synthesis with Semantic Manipulation of Images:
  2. Natural Language Translation
  3. Magically Enhancing Image Resolution

For more, there is no shortage of inspiration and useful information on Reddit.

So, what is PyTorch, and why do we need it?

Without diving into the details of various machine learning methods, I'll generally state that there could be a fairly significant amount of math involved. Not necessarily super difficult math, but some pretty cumbersome math. Roughly, PyTorch takes care of the calculus grunt work for us: instead of spending a ton of time finding messy derivatives and many lines of code to compute them, PyTorch lets us focus on the fairly elegant work of defining our machine learning methods, and the more important math necessary to obtain intuition.

So, without further ado, we can get started by installing Python, PyTorch, and a few other useful tools. Just follow the appropriate directions for your operating system.

macOS Setup

Without a doubt, Python 3 is the future of Python, and thus we will use Python 3 with PyTorch. While I summarize the instructions to install PyTorch with Python 3 here, the official PyTorch documentation should always provide up-to-date and comprehensive instructions for installation as well.

Make sure Python 3 is installed

First, we need to make sure that Python 3 is installed. Your system might already have it installed. First, you can see if the python command is Python 3:

python --version

If the above is Python 3.x, then you can continue to the next section. Otherwise, you can see if python3 works:

python3 --version

If this still doesn't work, then you need to install Python 3. First, install Homebrew:

curl -fsSL "https://raw.githubusercontent.com/Homebrew/install/master/install" | /usr/bin/ruby

Then you can use Homebrew to install Python 3:

brew install python3

In the future, make sure you run the appropriate Python command for Python 3 (either python or python3), depending on your system.

Installing PyTorch and friends

First, we install numpy (used for linear algebra computations), matplotlib (useful for plotting) and pandas (useful for loading data sets):

pip3 install --upgrade numpy
pip3 install --upgrade matplotlib
pip3 install --upgrade pandas

Finally we can install PyTorch:

pip3 install --upgrade torch torchvision

Testing the installation

Let's just quickly test the installation, to verify everything is installed and ready to go! Run the python (or python3 command), and type the following code into Python:

import numpy as np
import pandas
import matplotlib.pyplot as plt
import torch
x = torch.rand(5, 3)
print(x)

If everything is installed correctly, you should see the output similar to:

tensor([[0.3380, 0.3845, 0.3217],
        [0.8337, 0.9050, 0.2650],
        [0.2979, 0.7141, 0.9069],
        [0.1449, 0.1132, 0.1375],
        [0.4675, 0.3947, 0.1426]])

If the above test produces errors, please carefully go through the official PyTorch installation guide to try to troubleshoot your problems.

Optional: Installing an IDE

At this point, everything we need for writing PyTorch code is installed. We can write our Python code in a text editor of our choice, and run it using python from terminal. In the future I won't assume use of an IDE specifically, but rather assume that you can create new Python files and run them, either from terminal or an IDE. However, some people may prefer using an IDE, which can reduce the time spent in terminal, and help provide better syntax and semantic checking of Python code. Personally, I use and recommend PyCharm, which you can download for free here. I won't provide specific installation instructions, but I found it very easy to install and configure.

Linux (Ubuntu) Setup

Without a doubt, Python 3 is the future of Python, and thus we will use Python 3 with PyTorch. While I summarize the instructions to install PyTorch with Python 3 here, the official PyTorch documentation should always provide up-to-date and comprehensive instructions for installation as well.

Make sure Python 3 is installed

First, we need to make sure that Python 3 is installed. Your system might already have it installed. First, you can see if the python command is Python 3:

python --version

If the above is Python 3.x, then you can continue to the next section. Otherwise, you can see if python3 works:

python3 --version

If this still doesn't work, then you need to follow your Linux distribution's method for installing Python 3. On Ubuntu this would be

sudo apt-get install python3-pip python3-dev

In the future, make sure you run the appropriate Python command for Python 3 (either python or python3), depending on your system.

Installing PyTorch and friends

First, we install numpy (used for linear algebra computations), matplotlib (useful for plotting) and pandas (useful for loading data sets):

pip3 install --upgrade numpy
pip3 install --upgrade matplotlib
pip3 install --upgrade pandas

Finally we can install PyTorch:

pip3 install --upgrade torch torchvision

Testing the installation

Let's just quickly test the installation, to verify everything is installed and ready to go! Run the python (or python3 command), and type the following code into Python:

import numpy as np
import pandas
import matplotlib.pyplot as plt
import torch
x = torch.rand(5, 3)
print(x)

If everything is installed correctly, you should see the output similar to:

tensor([[0.3380, 0.3845, 0.3217],
        [0.8337, 0.9050, 0.2650],
        [0.2979, 0.7141, 0.9069],
        [0.1449, 0.1132, 0.1375],
        [0.4675, 0.3947, 0.1426]])

If the above test produces errors, please carefully go through the official PyTorch installation guide to try to troubleshoot your problems.

Optional: Installing an IDE

At this point, everything we need for writing PyTorch code is installed. We can write our Python code in a text editor of our choice, and run it using python from terminal. In the future I won't assume use of an IDE specifically, but rather assume that you can create new Python files and run them, either from terminal or an IDE. However, some people may prefer using an IDE, which can reduce the time spent in terminal, and help provide better syntax and semantic checking of Python code. Personally, I use and recommend PyCharm, which you can download for free here. I won't provide specific installation instructions, but I found it very easy to install and configure.

Windows Setup

To be blunt, I only use macOS and Linux, and I have no experience installing PyTorch or even Python on Windows. Instead, I will simply direct you to the official documentation for installing PyTorch.

When installing Python, please ensure that you use Python 3, rather than Python 2. Also, install the Python packages NumPy, Pandas and matplotlib in addition to PyTorch.

Optional: Installing an IDE

At this point, everything we need for writing PyTorch code in installed. We can write our Python code in a text editor of our choice, and run it using python from terminal. In the future I won't assume use of an IDE specifically, but rather assume that you can create new Python files and run them, either from terminal or an IDE. However, some people may prefer using an IDE, which can reduce the time spent in terminal, and help provide better syntax and semantic checking of Python code. Personally, I use and recommend PyCharm, which you can download for free here. I won't provide specific installation instructions, but I found it very easy to install and configure.

Linear Regression

Roughly, there are two categories of machine learning:

  1. In supervised learning one already has access to a data set of example inputs and desired outputs, and teaches the computer to produce the desired outputs given the example inputs.
  2. In unsupervised learning one only has access to a data set: there is not necessarily any notion of "input" and "output". The goal is to teach the computer to autonomously discover underlying patterns in the data set.

For example, if we have a data set containing students' GPAs and SAT scores, we could attempt to train a supervised machine learning model to predict SAT scores given GPAs. Here, GPA would be the input, and SAT score would be the desired output. On the other hand, instead of trying to predict SAT scores based on GPA, we could use both SAT and GPA as input, and see if the computer can discover interesting patterns. In the plots below, supervised learning can easily predict the output variable based on the input variable via a straight line. A good unsupervised learning algorithm could hopefully discover the two clusters in the right plot.

Supervised Learning Unsupervised Learning
Regression Image Clustering Image

Unsupervised machine learning will be touched on later, but for now we focus on supervised machine learning: in general it is easier, and the objective is clearer. Specifically, there are two main types of supervised learning:

  1. Regression aims to predict a continuous output, given the inputs.
  2. Classification aims to predict a classification, given the inputs.

The GPA and SAT score example from above is an example of regression, since we are trying to predict the continuous output variable of SAT scores. We aren't trying to classify students based on GPAs, we just want to estimate what their SAT score will be. On the other hand, suppose we also have data about UC Davis college admissions: we could then try to predict whether or not a student will be accepted based on their GPA and SAT combined. This would be an example of classification, since the output of the prediction is not continuous, it is a classification (accepted or rejected). In the example plots below, the left plot shows predicting continuous SAT from GPA, while the right plot shows distinguishing UC Davis rejected applicants (marked with 'x') from UC Davis accepted applicants (marked with 'o'). The plot looks similar to the unsupervised plot from above, but the crucial difference is that we know have data about who was accepted and rejected.

Regression Classification
Regression Image Classification Image

In this chapter we look at regression only, and specifically the simplest form: linear regression. We start with the most basic form of linear regression (single variable linear regression), and then move on to consider variants of linear regression. Using these variants we can obtain models that are both fairly simple and powerful.

Single Variable Regression

Since this is the very first tutorial in this guide and no knowledge is assumed about machine learning or PyTorch, this tutorial is a bit on the long side. This tutorial will give you an overview of how to do machine learning work in general, a mathematical understanding of single variable linear regression, and how to implement it in PyTorch. If you already feel comfortable with the mathematical concept of linear regression, feel free to skip to the PyTorch implementation.

Motivation

Single variable linear regression is one of the fundamental tools for any interpretation of data. Using linear regression, we can predict continuous variable outcomes given some data, if the data has a roughly linear shape, i.e. it generally has the shape a line. For example, consider the plot below of 2015 US homicide deaths per age1, and the line of best fit next to it.

Original data Result of single variable linear regression
Homicide Plot Homicide Regression Plot

Visually, it appears that this data is approximated pretty well by a "line of best fit". This is certainly not the only way to approximate this data, but for now it's pretty good. Single variable linear regression is the tool to find this line of best fit. The line of best fit can then be used to guess how many homicide deaths there would be for ages we don't have data on. By the end of this tutorial you can run linear regression on this homicide data, and in fact solve any single variable regression problem.

Theory

Since we don't have any theory yet to understand linear regression, first we need to develop the theory necessary to program it.

Data set format

For regression problems, the goal is to predict a continuous variable output, given some input variables (also called features). For single variable regression, we only have one input variable, called \(x\), and our desired output \(y\). Our data set \(D\) then consists of many examples of \(x\) and \(y\), so: \[ D = \{ (x_1, y_1), (x_2, y_2), \cdots, (x_m, y_m) \} \] where \(m\) is the number of examples in the data set. For a concrete example, the homicide data set plotted above looks like: \[ D = \{ (21, 652), (22, 633), \cdots, (50, 197) \} \] We will write code to load data sets from files later.

Model concept

So, how can we mathematically model single linear regression? Since the goal is to find the perfect line, let's start by defining the model (the mathematical description of how predictions will be created) as a line: \[ y'(x, a, b) = ax + b \] where \(x\) is an input, \(y'\) is the prediction for the input \(x\), and \(a\) and \(b\) are model parameters. Note that although this is an equation for a line with \(x\) as the variable, the values of \(a\) and \(b\) determine what specific line it is. To find the best line, we just need to find the best values for \(a\) (the slope) and \(b\) (the y-intercept). For example, the line of best fit for the homicide data above has a slope of about \(a \approx -17.69\) and a y-intercept of \(b \approx 1000\). How we find the magic best values for \(a\) and \(b\) we don't know yet, but once we find them, prediction is easy, since we just use the formula above.

So, how do we find the correct values of the model parameters \(a\) and \(b\)? First, we need a way to define what the "best line" is exactly. To do so, we define a loss function (also called a cost function), which measures how bad a particular choice of \(a\) and \(b\) are. Values of \(a\) and \(b\) that seem poor (a line that does not fit the data set) should result in a large value of the loss function, whereas good values of \(a\) and \(b\) (a line that fits the data set well) should result in small values of the loss function. In other words, the loss function should measure how far the predicted line is from each of the data points, and add this value up for all data points. We can write this as: \[ L(a, b) = \sum_{i=1}^m (y'(x_i, a, b) - y_i)^2 \] Recall that there are \(m\) examples in the data set, \(x_i\) is the i'th input, and \(y_i\) is the i'th desired output. So, \((y'(x_i, a, b) - y_i)^2\) measures how far the i'th prediction is from the i'th desired output. For example, if the prediction \(y'\) is 7, and the correct output \(y\) is 10, then we would get \((7 - 10)^2 = 9.\) Squaring it is important so that it is always positive. Finally, we just add up all of these individual losses. Since the smallest possible values for the squared terms indicate that the line fits the data as closely as possible, the line of best fit (determined by the choice of \(a\) and \(b\)) occurs exactly at the smallest value of \(L(a, b)\). For this reason, the model is also called least squares regression.

Note: The choice to square \(y'(x_i, a, b) - y_i\) is somewhat arbitrary. Though we need to make it positive, we could achieve this in many ways, such as taking the absolute value. In a sense, the choice of models and loss functions is one of the creative aspect of machine learning, and often a certain loss function is chosen simply because it produces satisfying results. Manipulating the loss function to achieve more satisfying results will be done in a later chapter. However, there is also a concrete mathematical reason for why we specifically square it. It's too much to go into here, but it will be explored in bonus chapter 2.8

Creating loss functions (and this exact loss function) will continue to be used throughout this guide, from the most simple to more complex models.

Optimizing the model

At this point, we have fully defined both our model: \[ y'(x, a, b) = ax + b \] and our loss function, into which we can substitute the model: \[ L(a, b) = \sum_{i=1}^m (y'(x_i, a, b) - y_i)^2 = \sum_{i=1}^m (a x_i + b - y_i)^2 \] We crafted \(L(a, b)\) so that it is smallest exactly when each predicted \(y'\) is as close as possible to actual data \(y\). When this happens, since the distance between the data points and predicted line is as small as possible, using \(a\) and \(b\) produces the line of best fit. Therefore, our goal is to find the values of \(a\) and \(b\) that minimize the function \(L(a, b)\). But what does \(L\) really look like? Well, it is essentially a 3D parabola which looks like:

Minimum Plot

The red dot marked on the plot of \(L\) shows where the desired minimum is. We need an algorithm to find this minimum. From calculus, we know that at the minimum \(L\) must be entirely flat, that is the derivatives are both \(0\): \[ \frac{\partial L}{\partial a} = \sum_{i=1}^m 2(ax_i + b - y_i)x_i = 0 \\ \frac{\partial L}{\partial b} = \sum_{i=1}^m 2(ax_i + b - y_i) = 0 \ \] If you need to review this aspect of calculus, I would recommend Khan Academy videos. Now, for this problem it is possible to solve for \(a\) and \(b\) using the equations above, like we would in a typical calculus course. But for more advanced machine learning this is impossible, so instead we will learn to use an algorithm called gradient descent to find the minimum. The idea is intuitive: place a ball at an arbitrary location on the surface of \(L\), and it will naturally roll downhill towards the flat valley of \(L\) and thus find the minimum. We know the direction of "downhill" at any location since we know the derivatives of \(L\): the derivatives are the direction of greatest upward slope (this is known as the gradient), so the opposite (negative) derivatives are the most downhill direction. Therefore, if the ball is currently at location \((a, b)\), we can see where it would go by moving it to location \((a', b')\) like so: \[ a' = a - \alpha \frac{\partial L}{\partial a} \\ b' = b - \alpha \frac{\partial L}{\partial b} \\ \] where \(\alpha\) is a constant called the learning rate, which we will talk about more later. If we repeat this process then the ball will continue to roll downhill into the minimum. An animation of this process looks like:

Gradient Descent Animation

When we run the gradient descent algorithm for long enough, then it will find the optimal location for \((a, b)\). Once we have the optimal values of \(a\) and \(b\), then that's it, we can just use them to predict a rate of homicide deaths given any age, using the model: \[ y'(x) = ax + b \]

Implementation

Let's quickly review what we did when defining the theory of linear regression:

  1. Describe the data set
  2. Define the model
  3. Define the loss function
  4. Run the gradient descent optimization algorithm
  5. Use the optimal model to make predictions
  6. Profit!

When coding this we will follow the exact same steps. So, create a new file single_var_reg.py in the text editor or IDE of your choice (or experiment in the Python REPL by typing python at command line), and download the homicide death rate data set into the same directory.

Importing the data

First, we need to import all the modules we will need:

import pandas as pd
import matplotlib.pyplot as plt
import torch
import torch.optim as optim

We use Pandas to easily load the CSV homicide data, and convert them to a PyTorch tensor:

D = pd.read_csv("homicide.csv")
x_dataset = torch.tensor(D.age.values, dtype=torch.float)
y_dataset = torch.tensor(D.num_homicide_deaths.values, dtype=torch.float)

Note that x_dataset and y_dataset are not single numbers, but are actually what are called tensors. Roughly, a tensor is just an \(n\) dimensional grid of numbers, so a 1-tensor is a vector, i.e. a single list of numbers. Both x_dataset and y_dataset are 1-tensors (vectors). The vectors are each 30 numbers long, since there are 30 data points in the CSV file. So, (x_dataset[0], y_dataset[0]) would be \((x_1, y_1) = (21, 652)\). When we look at multi variable regression later, we will have to work much more with matrices and linear algebra, but for now you only need to worry about vectors.

Whenever possible, I would recommend plotting data, since this helps you verify that you loaded the data set correctly and gain visual intuition about the shape of the data. This is also pretty easy using matplotlib:

plt.plot(x_dataset.numpy(), y_dataset.numpy(), 'x') # The 'x' means that data points will be marked with an x
plt.xlabel('Age')
plt.ylabel('US Homicide Deaths in 2015')
plt.title('Relationship between age and homicide deaths in the US')
plt.show()

To plot the data in the PyTorch tensors, we need to convert them to NumPy arrays (since that is what matplotlib expects). We achieve this with .numpy(). If all goes well, the plot should look like this:

Homicide Plot

You need to close the plot for your code to continue executing.

Defining the model

We have our data prepared and plotted, so now we need to define our model. Recall that the model equation is: \[ y' = ax + b \] Before, we thought of \(x\) and \(y'\) as single numbers. However, we just loaded our data set as vectors (lists of numbers), so it will be much more convenient to define our model using vectors instead of single numbers. If we use the convention that \(x\) and \(y'\) are vectors, then we don't need to change the equation, just our interpretation of it. Multiplying the vector \(x\) by the single number \(a\) just multiplies every number in \(x\) by \(a\), and likewise for adding \(b\). So, the above equation interpreted using vectors is the same thing as: \[ \begin{bmatrix} y_{1}', & y_{2}', & \dots, & y_{m}' \end{bmatrix} = \begin{bmatrix} ax_{1} + b, & ax_{2} + b, & \dots, & ax_{m} + b \end{bmatrix} \]

Fortunately, PyTorch does the work for us of interpreting the simple equation \(y' = ax + b\) as the more complicated looking vector equation. First, we define our model parameters \(a\) and \(b\):

a = torch.randn(1, requires_grad=True)
b = torch.randn(1, requires_grad=True)

This says that we create a and b to be PyTorch trainable variables. When we define the model as \(y' = ax + b\) there is a crucial point that \(a\) and \(b\) are the variables which we must update during training, while \(x\) is not updated. This means that we want PyTorch to compute the gradients of the loss function with respect to \(a\) and \(b\), but not with respect to \(x\). We have to tell PyTorch to keep track of gradients for a and b by setting requires_grad=True. In short, make sure you use requires_grad=True for any variable that you want to be updated during training.

Lastly, the function torch.randn(1, ...) creates a (and b) to be a vector of length 1 (i.e. just a number), with the initial value chosen randomly from the normal distribution (the precise initial values don't matter).

With the model parameters defined, we can now define \(y'\), the output of our model:

def model(x_input):
    return a * x_input + b

Note that model is a Python function, which will take as input a vector x_input, multiply it by the single number a and add on the single number b. This is where PyTorch does the work for us of reinterpreting our model equation to work using vectors. Finally, the result that is computed is a PyTorch tensor (in this case a vector).

And that's it to define the model!

Defining the loss function

We have the model defined, so now we need to define the loss function. Recall that the loss function is how the model is evaluated (smaller loss values are better), and it is also the function that we need to minimize in terms of \(a\) and \(b\). Previously we said the loss function was: \[ L(a, b) = \sum_{i=1}^m (y'(x_i, a, b) - y_i)^2 \]

However, \(y'\) and \(y\) are now being interpreted as vectors. We can rewrite the loss function as: \[ L(a, b) = \mathrm{sum}((y' - y)^2) \]

Note that since \(y'\) and \(y\) are vectors, \(y' - y\) is also a vector that just contains every number stored in \(y'\) minus every corresponding number in \(y'\). Likewise, \((y' - y)^2\) is also a vector, with every number individually squared. Then, the \(\mathrm{sum}\) function (which I just made up) adds up every number stored in the vector \((y' - y)^2\). This is the same as the original loss function, but is a vector interpretation of it instead. We can code this directly as a Python function:

def loss(y_predicted, y_target):
    return ((y_predicted - y_target)**2).sum()

The .sum() function is an operation which adds up all the numbers stored in a vector. With just these two lines of code we have defined our loss function.

Minimizing the loss function with gradient descent

With our model and loss function defined, we are now ready to use the gradient descent algorithm to minimize the loss function, and thus find the optimal \(a\) and \(b\). Fortunately, PyTorch has already implemented the gradient descent algorithm for us, we just need to use it. The algorithm acts almost like a ball rolling downhill into the minimum of the function, but it does so in discrete time steps. PyTorch does not handle this aspect, we need to be responsible for performing each time step of gradient descent. So, roughly in pseudo-code we want to do this:

for t in range(10000):
    # Tell PyTorch to do 1 time step of gradient descent

We can't do this yet, since we don't yet have a way to tell PyTorch to perform 1 time step of gradient descent. To do so, we create an optimizer with a learning rate (\(\alpha)\) of \(0.2\):

optimizer = optim.Adam([a, b], lr=0.2)

The optim.Adam optimizer knows how to perform the gradient descent algorithm for us (actually a faster version of gradient descent). Note that this does not yet minimize \(L\). This code only create an optimizer object which we will use to minimize \(L\). Note that we indicate which variables we want the optimizer to optimize (that is, modify). For an explanation of the lr=0.2 parameter (and the 10000 loop iterations), see the end of this tutorial.

Using our new optimizerobject we are ready to write the optimization loop pseudo-code that we originally wanted. Let's look at the code first, and then break it down:

for t in range(10000):
    optimizer.zero_grad() # 1.
    y_predicted = model(x_dataset) # 2.
    current_loss = loss(y_predicted, y_dataset) # 3.
    current_loss.backward() # 4.
    optimizer.step() # 5.
    print(f"t = {t}, loss = {current_loss}, a = {a}, b = {b}") # 6.

Let's walk through each of the 6 steps to understand all that is happening here:

  1. Under the hood, PyTorch keeps track of a gradient for each variable and step of the model and loss computation. The first thing we do is set all of these stored gradients to 0, so that we don't reuse any previous, old gradient computations.
  2. Using the current values of a and b we compute the predictions of the model.
  3. We then compute the value of our loss function for the predictions we just made.
  4. At this point we have the current value of the loss \(L\). What we want to do is compute \(\frac{\partial L}{\partial a}\) and \(\frac{\partial L}{\partial b}\). We ask PyTorch to do this for us using .backward(). The name comes from the fact that in order to find the derivatives, PyTorch works "backward", starting with the loss and working back to \(a\) and \(b\). However, the details of how PyTorch computes it are not that important right now. What matters is that .backward() does this desired computation, and stores the results somewhere (you can see the results by doing a.grad if you wish).
  5. Crucially though, .backward() does NOT actually update the values of a and b. Instead, we ask the optimizer to update a and b, based on the currently computed gradients.
  6. Finally we just optionally print out some current info so we can observe the training.

What we want to see from the print statements is that the gradient descent algorithm converged, which means that the algorithm stopped making significant progress because it found the minimum location of the loss function. When the last few print outputs look like:

t = 9992, loss = 39295.83984375, a = -17.27234649658203, b = 997.331298828125
t = 9993, loss = 39295.8359375, a = -17.272356033325195, b = 997.3316040039062
t = 9994, loss = 39295.84765625, a = -17.27235984802246, b = 997.3319091796875
t = 9995, loss = 39295.828125, a = -17.272371292114258, b = 997.3322143554688
t = 9996, loss = 39295.8359375, a = -17.272380828857422, b = 997.33251953125
t = 9997, loss = 39295.8359375, a = -17.272382736206055, b = 997.3328247070312
t = 9998, loss = 39295.84375, a = -17.272397994995117, b = 997.3331298828125
t = 9999, loss = 39295.82421875, a = -17.272401809692383, b = 997.3334350585938

then we can tell that we have achieved convergence, and therefore found the best values of \(a\) and \(b\).

Using the trained model to make predictions

At this point we have a fully trained model, and know the best values of \(a\) and \(b\). In fact, the equation of the line of best fit is just: \[ y' = -17.2711x + 997.285 \]

The last remaining thing for this tutorial is to plot the predictions of the model on top of a plot of the data. First, we need to create a bunch of input ages that we will predict the homicide rates for. We could use x_dataset as the input ages, but it is more interesting to create a new vector of input ages, since then we can predict homicide rates even for ages that were not in the data set. Outside of the training for loop, we can use the function linspace to create a bunch of evenly spaced values between 20 and 55:

# x_test_data has values similar to [20.0, 20.1, 20.2, ..., 54.9, 55.0]
x_test_data = torch.linspace(20, 55)

Then, we can compute the model's prediction:

y_test_prediction = model(x_test_data).detach()

Note that we must use .detach() to tell PyTorch not to perform gradient calculations for this computation. Finally, we can plot the original data and the line together:

plt.plot(x_dataset.numpy(), y_dataset.numpy(), 'x')
plt.plot(x_test_data.numpy(), y_test_prediction.numpy())
plt.xlabel('Age')
plt.ylabel('US Homicide Deaths in 2015')
plt.title('Age and homicide death linear regression')
plt.show()

This yields a plot like:

Homicide Linear Regression Plot

Concluding Remarks

Through following this post you have learned two main concepts. First, you learned the general form of supervised machine learning workflows:

  1. Get your data set
  2. Define your model (the mechanism for how outputs will be predicted from inputs)
  3. Define your loss function
  4. Minimize your loss function (usually with a variant of gradient descent, such as optim.Adam)
  5. Once your loss function is minimized, use your trained model to do cool stuff

Second, you learned how to implement linear regression (following the above workflow) using PyTorch. Let's briefly discuss the above 5 steps, and where to go to improve on them.

1. The Data Set

This one is pretty simple: we need data sets that contain both input and output data. However, we need a data set that is large enough to properly train our model. With linear regression this is fairly easy: this data set only had 33 data points, and the results were pretty good. With larger and more complex models that we will look at later, this becomes much more of a challenge.

2. Defining the model

For single variable linear regression we used the model \(y' = ax + b\). Geometrically, this means that the model can only guess lines. Since the homicide data is roughly in the shape of a line, it worked well for this problem. But there are very few problems that are so simple, so soon we will look at more complex models. One other limitation of the current model is it only accepts one input variable. But if our data set had both age and ethnicity, for example, perhaps we could more accurately predict homicide death rate. We will also discuss soon a more complex model that handles multiple input variables.

3. Defining the loss function

For single variable regression, the loss function we used, \(L = \sum (y' - y)^2\), is the standard. However, there are a few considerations: first, this loss functions is suitable for this simple model, but with more advanced models this loss function isn't good enough. We will see why soon. Second, the optimization algorithm converged pretty slowly, needing about \(10000\) iterations. One cause is that the surface of the loss function is almost flat in a certain direction (you can see this in the 3D plot of it). Though this isn't inherently a problem with the formula for the loss function, the problem surfaces in the loss function. We will also see how to address this problem soon, and converge much faster.

4. Minimizing the loss functions

Recall that we created and used the optimizer like so:

optimizer = optim.Adam([a, b], lr=0.2)
for t in range(10000):
    # Run one step of optimizer

You might be wondering what the magic numbers of lr=0.2 (\( \alpha \)) and 10000 are. Let's start with the learning rate. In each iteration, gradient descent (and variants of it) take one small step that is determined by the derivative of the loss function. The learning rate is just the relative size of the step. So to take larger steps, we can use a larger learning rate. A larger learning rate can help us to converge more quickly, since we cover more distance in each iteration. But a learning rate too large can cause gradient descent to diverge that is, it won't reliably find the minimum.

So, once you have chosen a learning rate, then you need to run the optimizer for enough iterations so it actually converges to the minimum. The easiest way to make sure it runs long enough is just to monitor the value of the loss function, as we did in this tutorial.

Lastly, we didn't use normal gradient descent for optimization in this tutorial. Instead we used optim.Adam. With small scale problems like this, there isn't much of a qualitative difference to intuit. In general the Adam optimizer is faster, smarter, and more reliable than vanilla gradient descent, but this comes into play a lot more with harder problems.

5. Use the trained model

Technically, using the trained model is the easiest part of machine learning: with the best parameters \(a\) and \(b\), you can simply plug new age values into \(x\) to predict new homicide rates. However, trusting that these predictions are correct is another matter entirely. Later in this guide we can look at various statistical techniques that can help determine how much we can trust a trained model, but for now consider some oddities with our trained homicide rate model.

One rather weird thing is that it accepts negative ages: according to the model, 1083 people who are -5 years old die from homicide every year in the US. Now, clearly this makes no sense since people don't have negative ages. So perhaps we should only let the model be valid for people with positives ages. Ok, so then 980 people who are 1 year old die from homicide every year. While this isn't impossible, it does seem pretty high compared to the known data of 652 for 21 year olds. It might seem possible (likely even) that fewer homicides occur for 1 year olds than 21 year olds: but we don't have the data for that, and even if we did, our model could not predict it correctly since it only models straight lines. Without more data, we have no basis to conclude that the number of \(1\) year old homicides is even close to 980.

While this might seem like a simple observation in this case, this problem manifests itself continually in machine learning, causing a variety of ethical problems. For example, in 2016 Microsoft released a chatbot on Twitter and it quickly learned to say fairly horrible and racist things. More seriously, machine learning is now being used to predict and guide police in cracking down on crime. While the concept might be well-intentioned, the results are despicable, as shown in an article by The Conversation:

Our recent study, by Human Rights Data Analysis Group’s Kristian Lum and William Isaac, found that predictive policing vendor PredPol’s purportedly race-neutral algorithm targeted black neighborhoods at roughly twice the rate of white neighborhoods when trained on historical drug crime data from Oakland, California. [...] But estimates – created from public health surveys and population models – suggest illicit drug use in Oakland is roughly equal across racial and income groups. If the algorithm were truly race-neutral, it would spread drug-fighting police attention evenly across the city.

With examples like these, we quickly move from a technical discussion about machine learning to a discussion about ethics. While the study of machine learning is traditionally heavily theoretical, I strongly believe that to effectively and fairly apply machine learning in society, we must spend significant effort evaluating the ethics of machine learning models.

This is an open question, and one that I certainly don't have an answer to right now. For the short term we can focus on the problem of not trusting a simple linear regression model to properly predict data outside of what it has been trained on, while in the long term keeping in mind that "with great power comes great responsibility."

Challenge Problems

Feel free to complete as many of these as you wish, to get more practice with single variable linear regression. Note that for different problems you might have to adjust the learning rate and / or the number of training iterations.

  1. Learn how to use PyTorch to generate an artificial data set that is appropriate for single variable linear regression, and then train a model on it. As a hint, for any \(x\) value you could create an artificial \(y\) value like so: \(y = ax + b + \epsilon \), where \(\epsilon\) is a random number that isn't too big, and \(a\) and \(b\) are fixed constants of your choice. If done correctly, your trained model should learn by itself the numbers you chose for \(a\) and \(b\).
  2. Run single variable linear regression on a data set of your choice. You can look at my list of regression data sets for ideas, you can search Kaggle, or you can search online, such as I did for the homicide data set. Many data sets might have multiple input variables, and right now you only know how to do single variable linear regression. We will deal with multiple variables soon, but for now you can always use only 1 of the input variables and ignore the rest.
  3. Experiment with altering the loss function, and observe the effects on the trained model. For example, you could change \((y' - y)^2\) to \(\mid y' - y \mid \) (you will need to lookup PyTorch documentation for the corresponding PyTorch functions), or really anything you can think of.

Complete Code

The complete example code is available on GitHub, as well as directly here:

import pandas as pd
import matplotlib.pyplot as plt
import torch
import torch.optim as optim

# Load the data
D = pd.read_csv("homicide.csv")
x_dataset = torch.tensor(D.age.values, dtype=torch.float)
y_dataset = torch.tensor(D.num_homicide_deaths.values, dtype=torch.float)

# Plot the data
plt.plot(x_dataset.numpy(), y_dataset.numpy(), 'x') # The 'x' means that data points will be marked with an x
plt.xlabel('Age')
plt.ylabel('US Homicide Deaths in 2015')
plt.title('Relationship between age and homicide deaths in the US')
plt.show()



### Model definition ###

# First we define the trainable parameters a and b 
a = torch.randn(1, requires_grad=True) # requires_grad means it is trainable
b = torch.randn(1, requires_grad=True)

# Then we define the prediction model
def model(x_input):
    return a * x_input + b


### Loss function definition ###

def loss(y_predicted, y_target):
    return ((y_predicted - y_target)**2).sum()

### Training the model ###

# Setup the optimizer object, so it optimizes a and b.
optimizer = optim.Adam([a, b], lr=0.2)

# Main optimization loop
for t in range(10000):
    # Set the gradients to 0.
    optimizer.zero_grad()
    # Compute the current predicted y's from x_dataset
    y_predicted = model(x_dataset)
    # See how far off the prediction is
    current_loss = loss(y_predicted, y_dataset)
    # Compute the gradient of the loss with respect to a and b.
    current_loss.backward()
    # Update a and b accordingly.
    optimizer.step()
    print(f"t = {t}, loss = {current_loss}, a = {a.item()}, b = {b.item()}")


### Using the trained model to make predictions ###

# x_test_data has values similar to [20.0, 20.1, 20.2, ..., 54.9, 55.0]
x_test_data = torch.linspace(20, 55)
# Predict the homicide rate for each age in x_test_data
# .detach() tells PyTorch to not find gradients for this computation.
y_test_prediction = model(x_test_data).detach()

# Plot the original data and the prediction line
plt.plot(x_dataset.numpy(), y_dataset.numpy(), 'x')
plt.plot(x_test_data.numpy(), y_test_prediction.numpy())
plt.xlabel('Age')
plt.ylabel('US Homicide Deaths in 2015')
plt.title('Age and homicide death linear regression')
plt.show()

References

1

Centers for Disease Control and Prevention, National Center for Health Statistics. Underlying Cause of Death 1999-2015 on CDC WONDER Online Database, released December, 2016. Data are from the Multiple Cause of Death Files, 1999-2015, as compiled from data provided by the 57 vital statistics jurisdictions through the Vital Statistics Cooperative Program. Accessed at https://wonder.cdc.gov/ucd-icd10.html on Nov 22, 2017 2:18:46 PM.

Exploring Optimization Convergence

In the previous chapter we created a simple single variable linear regression model to fit a data set. While the Python code was actually fairly short and simple, I did gloss over some details related to the optimization, and I hope to use this short chapter to answer some dangling questions about it. Since this chapter doesn't introduce new models or concepts you can skip it (or come back later) if you prefer. However, getting a feel for optimization is useful for training just about any model, not just the single variable linear regression model, and this chapter should give you insight that is useful for the rest of this guide and just about any machine learning you do.

To explore optimization we are going to exactly copy the code from the previous chapter, and experiment with it. To review, let's look at the part of the code from before that performed the optimization (just look at the previous chapter if you need to review the entire thing):

# Setup the optimizer object, so it optimizes a and b.
optimizer = optim.Adam([a, b], lr=0.2)

# Main optimization loop
for t in range(10000):
    # Set the gradients to 0.
    optimizer.zero_grad()
    # Compute the current predicted y's from x_dataset
    y_predicted = model(x_dataset)
    # See how far off the prediction is
    current_loss = loss(y_predicted, y_dataset)
    # Compute the gradient of the loss with respect to a and b.
    current_loss.backward()
    # Update a and b accordingly.
    optimizer.step()
    print(f"t = {t}, loss = {current_loss}, a = {a.item()}, b = {b.item()}")

The main unanswered questions to address are:

  1. I said that the learning rate affects how large of steps the optimization algorithm takes in one unit of time. But how do we choose an appropriate value for the learning rate, such as 0.2?
  2. What is this AdamOptimizer exactly, are there other choices for optimizers, and how do they differ?
  3. Currently this code runs for 10000 iterations, and that seems good enough to fully optimize L. But how do we choose this appropriate amount of time for training?

There aren't exact or easy answers to the above questions, but answering these questions is made even harder by the fact that we can not effectively visualize the training progress with the code we have. Currently we have some print(...) statements, which is good enough to see that the training error is decreasing, but not much more than that. Let's start with learning how to visualize training, since this will help us address the other questions and give us a deeper intuition about optimization.

How to visualize training progress

Extracting hyperparameters

One of the simplest ways to visualize training progress is to plot the value of the loss function over time. We could certainly plot the value of the loss function using matplotlib, like we plotted the data set. But PyTorch actually lets us plot training progress conveniently in real time by communicating with a tool called TensorBoard. It's pretty handy, but first we need to do some work to refactor our current code.

The first thing I'd like to do is move hyperparameters to the top of the script. What exactly are hyperparameters? They are parameters that affect how the training of the model proceeds, but are not part of the model itself. For example, \(a\) and \(b\) are parameters, while the learning rate \(\alpha\) is a hyperparameter. The hyperparameters we have are:

  1. The learning rate, currently 0.2.
  2. The number of training iterations, currently 10000.
  3. The choice of optimization routine, currently tf.train.AdamOptimizer (This isn't often thought of as a hyperparameter, but here we'll think of it as one).

So, we can just put these into constants at the top of the code. Also, we are going to change these values to give us a simpler starting point:

import pandas as pd
import matplotlib.pyplot as plt
import torch
import torch.optim as optim

### Hyperparameters ###

LEARNING_RATE = 0.000001 # Much smaller training rate, to make sure the optimization is at least reliable.
NUM_ITERS = 20000 # Twice as many training iterations, just gives us more room to experiment later.
OPTIMIZER_CONSTRUCTOR = optim.SGD # This is the simplest optimization algorithm.

# .. all the rest of the code, hyperparameter constants properly substituted

Installing TensorBoard

Next we need to actually install TensorBoard on our system. The following command should do it:

pip3 install tb-nightly future

Setting up TensorBoard

Now, we need to add some code to configure TensorBoard for our model. We will need to add a new import:

from torch.utils.tensorboard import SummaryWriter

We want to tell PyTorch where to save a log of all the parameter and loss values. We setup the log location based on the values of the hyperparameters:

# ... above this is where loss is defined

### TensorBoard Writer Setup ###

log_name = f"{LEARNING_RATE}, {OPTIMIZER_CONSTRUCTOR.__name__}"
writer = SummaryWriter(log_dir=f"runs/{log_name}")
print("To see tensorboard, run: tensorboard --logdir=runs/")

# ... below this is where we create the optimizer and session

The point of basing the log location names on the hyperparameter values is so that we can later compare logs that used different hyperparameters.

The last thing we need to do is make PyTorch actually write the values of a, b and the loss to the log as training progresses. In our training code before we had a print statement that showed how training was going, but now we can replace it (or keep it, you choose):

# Main optimization loop
for t in range(NUM_ITERS):
    # Set the gradients to 0.
    optimizer.zero_grad()
    # Compute the current predicted y's from x_dataset
    y_predicted = model(x_dataset)
    # See how far off the prediction is
    current_loss = loss(y_predicted, y_dataset)
    # Compute the gradient of the loss with respect to a and b.
    current_loss.backward()
    # Update a and b accordingly.
    optimizer.step()

    # This is where we write the current values of a, b, and loss to the log.
    # global_step=t tells tensorboard at what step of the training this is.
    writer.add_scalar('a', a, global_step=t)
    writer.add_scalar('b', b, global_step=t)
    writer.add_scalar('L', current_loss, global_step=t)

# After we are done with the writer, we should close the log file.
writer.close()

And as a final touch, we can also delete or comment out the code that plots the data set, since it reduces the amount of clicking we have to do when trying different hyperparameters. The complete modified code should look something like this:

import pandas as pd
import matplotlib.pyplot as plt
import torch
import torch.optim as optim
from torch.utils.tensorboard import SummaryWriter

### Hyperparameters ###

LEARNING_RATE = 0.000001 # Much smaller training rate, to make sure the optimization is at least reliable.
NUM_ITERS = 20000 # Twice as many training iterations, just gives us more room to experiment later.
OPTIMIZER_CONSTRUCTOR = optim.SGD # This is the simplest optimization algorithm.

# Load the data
D = pd.read_csv("homicide.csv")
x_dataset = torch.tensor(D.age.values, dtype=torch.float)
y_dataset = torch.tensor(D.num_homicide_deaths.values, dtype=torch.float)


### Model definition ###

# First we define the trainable parameters a and b 
a = torch.randn(1, requires_grad=True) # requires_grad means it is trainable
b = torch.randn(1, requires_grad=True)

# Then we define the prediction model
def model(x_input):
    return a * x_input + b


### Loss function definition ###

def loss(y_predicted, y_target):
    return ((y_predicted - y_target)**2).sum()


### TensorBoard Writer Setup ###

log_name = f"{LEARNING_RATE}, {OPTIMIZER_CONSTRUCTOR.__name__}"
writer = SummaryWriter(log_dir=f"runs/{log_name}")
print("To see tensorboard, run: tensorboard --logdir=runs/")


### Training the model ###

# Setup the optimizer object, so it optimizes a and b.
optimizer = OPTIMIZER_CONSTRUCTOR([a, b], lr=LEARNING_RATE)

# Main optimization loop
for t in range(NUM_ITERS):
    # Set the gradients to 0.
    optimizer.zero_grad()
    # Compute the current predicted y's from x_dataset
    y_predicted = model(x_dataset)
    # See how far off the prediction is
    current_loss = loss(y_predicted, y_dataset)
    # Compute the gradient of the loss with respect to a and b.
    current_loss.backward()
    # Update a and b accordingly.
    optimizer.step()

    writer.add_scalar('a', a, global_step=t)
    writer.add_scalar('b', b, global_step=t)
    writer.add_scalar('L', current_loss, global_step=t)

writer.close()

Using TensorBoard

With the Python code prepared, we can now run it and use TensorBoard to visualize the training. First, run the Python code as usual, using Terminal, or your IDE. This will write logs to runs/{log_name} (relative to the directory from where you run the code). Now, we can open this up with TensorBoard, by running this command in Terminal:

tensorboard --logdir=runs/

Once you run that command, go to http://localhost:6006 in your web browser, and you should see plots of L, a, and b as the training progressed. With the hyperparameter choices from above, the plots should look like:

TensorBoard 1

Interpreting the TensorBoard plots

Ok, so now we have some plots showing the progression of L, a, and b. What can these plots tell us? Well, with the current choice of LEARNING_RATE = 0.000001, the plot for L clearly continues to decrease continually during training. This is good, since this means that the optim.SGD is doing it's job of decreasing the value of the loss function. However, the loss function continues to decrease quickly even towards the end of training: it is reasonable to expect that the loss function would continue to decrease substantially if we continued to train for more iterations. Therefore, we have not found the minimum of the loss function.

To remedy this, we could do one of 3 things: run the training for more iterations, increase the learning rate, or experiment with another optimization algorithm. We don't want to train for more iterations unless we have to, since that just takes more time, so we will start with increasing the learning rate.

Increasing the learning rate

Currently the learning rate is 0.000001, and is too small. A pretty easy way to try new learning rates is to go roughly by powers of 10. For now, we can try 0.000001, 0.000005, 0.00001. For each of these, you can just tweak the LEARNING_RATE constant, and re-run the code. You don't need to re-run the TensorBoard command (but you can), but make sure to reload the http://localhost:6006 page once you do all the runs. You should get plots that look like:

TensorBoard 2

We can clearly see the improvement by increasing the learning rate. The final loss obtained with a learning rate of 0.00001 is much smaller than our original loss obtained with a learning rate of 0.000001. In addition, we can see that the loss is decreasing more slowly at the end of training. However, the loss function still hasn't converged, as it is still decreasing significantly. Again, to fix this we could train for longer, but as before we can try increasing the learning rate even more. We can try a learning rate of 0.00005, and we get this:

TensorBoard 3

Huh? No plot!?!?? If you mouse over the L plot, TensorBoard will say that the value of L is NaN (not a number). What gives? Well, if the learning rate is too big then optim.SGD can explode: a and b (and consequently L) increase to infinity and become NaN. What "too big" is depends on the specific problem.

At this point there isn't a lot more that we can do by just tweaking the learning rate: it's either too big and causes the optimization to explode, or is too small to achieve convergence in 20000 iterations. We can certainly try more learning rates between 0.00001 and 0.00005, but it won't be a ton better than what we already have.

Using different optimization algorithms

So far we have been using optim.SGD. It is the simplest, classic way to iteratively train machine learning models. As discussed in the previous chapter, it is like moving a ball downhill, according to the current slope (aka the derivative). Generally, gradient descent is part of a class of optimization algorithms called first-order methods, since it uses only information from the first derivative of the loss function, and not higher-order derivatives. First-order methods are currently the dominant way to train most machine learning models.

In fact, there are many first order methods, other than simple gradient descent. Most of them are designed to offer an advantage over other first-order methods via speed to find convergence, reliability, ease of use, etc. For a fairly in-depth exploration, see this blog post. To see the different optimization algorithms that are built-in to PyTorch, see the documentation here. In this list, you can see the optim.Adam that we used before, and the classic optim.SGD that we just tried.

We can experiment with any number of these. Here, I'll demonstrate experimenting with optim.Adagrad, but feel free to play around with any of them. To use optim.Adagrad we just need to change OPTIMIZER_CONSTRUCTOR, and set LEARNING_RATE back to 0.000001 for good measure:

LEARNING_RATE = 0.000001
OPTIMIZER_CONSTRUCTOR = optim.Adagrad

Running this we see:

TensorBoard 4

Well, this is disappointing: L did not seem to decrease at all during training. The problem is that the learning rate used by gradient descent is really an entirely different learning rate from the Adagrad one: conceptually they are similar, but are on entirely different scales numerically. So, we just need to try different learning rates for Adagrad now. Since the value of L stayed constant with this very small learning rate, we expect that we need to try much larger learning rates for Adagrad. In fact, by trying learning rates of 0.5, 1, and 5, we get these plots:

TensorBoard 5

Now this is looking like progress! For the first time we start to get a sense of the loss rapidly decreasing, and then slowing down substantially. In addition, the final value of a is now negative (which we know is correct) compared to previous runs which ended either positive or close to zero. However, by looking at the plots for a and b (and to a lesser degree L) we can see that we still haven't achieved convergence: a and b haven't stopped changing substantially at the end of training. So, time to increase the learning rate even more! Before doing so, I am going to delete the logs of previous runs, except for the Adagrad run with a learning rate of 5, so that we can read the plots more clearly:

# This will delete logs of all the runs
rm -rf runs/
# Or, you can delete a specific run, for example:
rm -rf runs/10,\ Adagrad/

By trying learning rates of 10 and 50, we finally achieve convergence:

TensorBoard 6

Qualitatively, this looks like convergence (with a learning rate of 10, and certainly with a learning rate of 50) since the progress that Adagrad is making on decreasing L (and adjusting a and b) has hit a brick wall: no matter how long we run Adagrad, we can't seem to get a loss function value lower than about \(3.9296 \cdot 10^4 \), and similarly for the values of a and b. We've finally trained our model completely.

Unfortunately, I don't know of an easy way to intuitively understand the differences between Adagrad, Adam, and other first-order methods. This blog post does give some mathematical analysis that explains what each algorithm tries to improve upon, and some reasoning for choosing an algorithm, but it can be tricky to apply to real problems. In general, you can always start with the simplest algorithm (gradient descent), and if it isn't converging quickly enough for you, then you can switch to a more sophisticated algorithm, such as Adagrad, Adam, or others.

Concluding Remarks

The experimental nature of this chapter should illustrate the practicalities of machine learning: a lot of cutting-edge machine learning currently involves running multiple experiments to try to find the best combination of hyperparameters. There isn't a golden rule for choosing the optimization algorithm and hyperparameters, but hopefully this chapter demonstrates how to alter the algorithm and hyperparameters in PyTorch and monitor convergence using TensorBoard. The most important takeaways are:

  1. Learning how to use TensorBoard
  2. Recognizing convergence
  3. Recognizing the symptoms of too small of a learning rate
  4. Recognizing the symptoms of too large of a learning rate

In future chapters I won't include the code specifically for TensorBoard (unless it is important to that chapter) since I don't want it to get in the way of actual models, but I would highly encourage you to insert your own TensorBoard summary code, and monitor plots of convergence in TensorBoard, since it is useful both educationally and practically.

Challenge Problems

  1. Experiment on your own with a few other built-in PyTorch optimization algorithms, and try different learning rates. If you prefer a more focused goal, try to beat my configuration of an Adagrad optimizer with a learning rate of 50, and converge faster. Also note that some optimization algorithms have additional hyperparameters other than the learning rate. See the PyTorch documentation for information about these.
  2. One other cause of slow convergence for the homicide rate linear regression is the somewhat extreme scaling of the problem. The \(y\) variable is a whole order of magnitude greater than the \(x\) variable, and this affects optimization. We will actually look at this problem specifically in chapter 2.4, but for now you can experiment on your own with one solution: instead of using the \(x\) and \(y\) data directly from the data set, modify them first to rescale them. A quick, hacky way is to modify the code that loads the data, so that \(x\) and \(y\) vary between 0 and 1:
# Load the data, and convert to 1x30 vectors
D = pd.read_csv("homicide.csv")
# 21 and 50 are the min and max of x_data
x_data = (torch.tensor(D.age.values, dtype=torch.float) - 21.0) / (50.0 - 21.0)
# 196 and 653 are the min and max of y_data
y_data = (torch.tensor(D.num_homicide_deaths.values, dtype=torch.float) - 196.0) / (653.0 - 196.0)

On your own, add this code and see if you can achieve convergence using only gradient descent. You can also see how quickly you can achieve convergence using a more advanced algorithm such as Adam.

Complete Code

The complete code including TensorBoard summaries, and using Adagrad is available on GitHub and directly below. Note that this code lacks the plotting of the data and the linear regression line:

import pandas as pd
import matplotlib.pyplot as plt
import torch
import torch.optim as optim
from torch.utils.tensorboard import SummaryWriter

### Hyperparameters ###

LEARNING_RATE = 50 # Much smaller training rate, to make sure the optimization is at least reliable.
NUM_ITERS = 20000 # Twice as many training iterations, just gives us more room to experiment later.
OPTIMIZER_CONSTRUCTOR = optim.Adagrad # This is the simplest optimization algorithm.

# Load the data
D = pd.read_csv("homicide.csv")
x_dataset = torch.tensor(D.age.values, dtype=torch.float)
y_dataset = torch.tensor(D.num_homicide_deaths.values, dtype=torch.float)


### Model definition ###

# First we define the trainable parameters a and b 
a = torch.randn(1, requires_grad=True) # requires_grad means it is trainable
b = torch.randn(1, requires_grad=True)

# Then we define the prediction model
def model(x_input):
    return a * x_input + b


### Loss function definition ###

def loss(y_predicted, y_target):
    return ((y_predicted - y_target)**2).sum()


### TensorBoard Writer Setup ###
log_name = f"{LEARNING_RATE}, {OPTIMIZER_CONSTRUCTOR.__name__}"
writer = SummaryWriter(log_dir=f"runs/{log_name}")
print("To see tensorboard, run: tensorboard --logdir=runs/")


### Training the model ###

# Setup the optimizer object, so it optimizes a and b.
optimizer = OPTIMIZER_CONSTRUCTOR([a, b], lr=LEARNING_RATE)

# Main optimization loop
for t in range(NUM_ITERS):
    # Set the gradients to 0.
    optimizer.zero_grad()
    # Compute the current predicted y's from x_dataset
    y_predicted = model(x_dataset)
    # See how far off the prediction is
    current_loss = loss(y_predicted, y_dataset)
    # Compute the gradient of the loss with respect to a and b.
    current_loss.backward()
    # Update a and b accordingly.
    optimizer.step()

    writer.add_scalar('a', a, global_step=t)
    writer.add_scalar('b', b, global_step=t)
    writer.add_scalar('L', current_loss, global_step=t)

writer.close()

Multi Variable Regression

In chapter 2.1 we learned the basics of PyTorch by creating a single variable linear regression model. In this chapter we expand this model to handle multiple variables. Note that less time will be spent explaining the basics of PyTorch: only new concepts will be explained, so feel free to refer to previous chapters as needed.

Motivation

Recall that a single variable linear regression model can learn to predict an output variable \(y\) under these conditions:

  1. There is only one input variable, \(x\)
  2. There is a linear relationship between \(y\) and \(x\), that is, \(y \approx ax + b\)

In practice, the above conditions are very limiting: if you have a simple data set then by all means you should try using single variable linear regression, but in most cases we have significantly more complex data. For example, consider using the following (abbreviated) data from the 1990 census to learn to predict housing prices. Note that each row represents a single housing district:

House Median Age Total Rooms Total Bedrooms ... Median House Value
41.0 880.0 129.0 ... 452600.0
21.0 7099.0 1106.0 ... 358500.0
52.0 1467.0 190.0 ... 352100.0
52.0 1274.0 235.0 ... 341300.0
52.0 1627.0 280.0 ... 342200.0
52.0 919.0 213.0 ... 269700.0
... ... ... ... ...

To predict the values of houses, we have at least 3 real-valued variables (age, number of rooms, number of dedrooms) that could potentially be useful. To analyze this sort of complex, real-world data we need to learn to handle multiple input variables.

One approach to handling multiple variables would be to reduce the number of input variables to only 1 variable, and then training a single variable linear regression model using that. In fact, an important area of research in machine learning (and one that will be covered later) called dimensionality reduction deals with this problem of reducing the number of variables. However, it's important to realize that the number of variables can only be reduced so far, and its extremely rare that you can reduce a data set to only 1 variable. For now you need to take this statement on faith, but in later chapters we will investigate it more thoroughly.

So, it seems that we will have to deal with training models that can handle multiple variables. In this chapter we learn how to allow multiple input variables in our linear regression model. Such a model is called multi variable linear regression, or just linear regression.

Theory

Most of the theory is similar to the theory for single variable linear regression, but we will need to augment and generalize it to handle multiple variables.

Data set format

Previously we defined our data set \(D\) as consisting of many example pairs of \(x\) and \(y\), where \(m\) is the number of examples: \[ D = \{ (x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \cdots, (x^{(m)}, y^{(m))} \} \]

Note that I have changed the notation compared to before. The notation \(x^{(i)}\) refers to the \(i\)'th \(x\) training example, it does NOT mean \(x\) to the \(i\)'th power, which would be written as \(x^i\). I promise the notation change will be useful shortly.

Alternatively, we can write \(D\) as 2 vectors of shape 1 x \(m\): \[ D_x = \begin{bmatrix} x^{(1)}, x^{(2)}, \dots, x^{(m)} \end{bmatrix} \\ D_y = \begin{bmatrix} y^{(1)}, y^{(2)}, \dots, y^{(m)} \end{bmatrix} \]

But now, we need each \(x^{(i)}\) example to contain multiple numbers, one for each input variable. Let \(n\) be the number of input variables. Then the easiest way to write this is to let each \(x^{(i)}\) be a vector of shape \(n\) x 1. That is, \[ x^{(i)} = \begin{bmatrix} x^{(i)}_1 \\ x^{(i)}_2 \\ \vdots \\ x^{(i)}_j \\ \vdots \\ x^{(i)}_n \end{bmatrix} \] Note that the notation \(x^{(i)}_j\) denotes the \(j\)'th input variable in the \(i\)'th example data.

Since each \(x^{(i)}\) has \(n\) rows, and \(D_x\) has \(m\) columns, each of which is an \(x^{(i)}\), we can write \(D_x\) as a massive \(n \times m\) matrix: \[ D_x = \begin{bmatrix} x^{(1)}, x^{(2)}, \dots, x^{(m)} \end{bmatrix} = \begin{bmatrix} x^{(1)}_1 & x^{(2)}_1 & \dots & x^{(i)}_1 & \dots & x^{(m)}_1 \\ x^{(1)}_2 & x^{(2)}_2 & \dots & x^{(i)}_2 & \dots & x^{(m)}_2 \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ x^{(1)}_j & x^{(2)}_j & \dots & x^{(i)}_j & \dots & x^{(m)}_j \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ x^{(1)}_n & x^{(2)}_n & \dots & x^{(i)}_n & \dots & x^{(m)}_n \\ \end{bmatrix} \] So, each column of \(D_x\) represents a single input data example. We don't need to change the 1 x \(m\) vector \(D_y\), since we still only have 1 output variable.

Model concept

So, we now have an input data matrix \(D_x\) with each column vector representing a single input data example, and we have the corresponding \(D_y\) row vector, each entry of which is an output data example. How do we define a model which can linearly estimate the output \(y'^{(i)}\) given the input data vector \(x^{(i)}\)? Let's build it up from simple concepts, and build towards more complex linear algebra.

Since we want \(y'^{(i)}\) to depend linearly on each \(x^{(i)}_j\) for \(1 \leq j \leq n\), we can write: \[ y'^{(i)} = a_1 x^{(i)}_1 + a_2 x^{(i)}_2 + \cdots + a_j x^{(i)}_j + \cdots + a_n x^{(i)}_n + b \]

This is fine mathematically, but it's not very general. Suppose \(n = 100\): then we would have to literally write out 100 terms in our PyTorch code. We can generalize this using linear algebra. Let \(A\) be a row vector of shape 1 x \(n\), containing each \(a_j\): \[ A = \begin{bmatrix} a_1, a_2, \cdots, a_j, \cdots, a_n \end{bmatrix} \]

Now, let's see what happens if we compute \(A x^{(i)}\), as matrix multiplication. Note that \(A\) has shape 1 x \(n\) and \(x^{(i)}\) has shape \(n\) x 1. This is perfect! When performing matrix multiplication, the inner dimensions (in this case \(n\) and \(n\)) have to match, and the outer dimensions (in this case \(1\) and \(1\)) determine the output shape of the multiplication. So \(A x^{(i)}\) will have shape 1 x 1, or in other words, just a single number, in fact it is exactly \(y'^{(i)}\). How does this matrix multiplication exactly work? I'll refer you to this video by Khan Academy, and explain it briefly in this case. Here, it is easier since \(A\) is a row vector, and \(x^{(i)}\) is a column vector. We simply multiply each corresponding entry, and add it all up: \[ A x^{(i)} + b = \begin{bmatrix} a_1, a_2, \cdots, a_j, \cdots, a_n \end{bmatrix} \begin{bmatrix} x^{(i)}_1 \\ x^{(i)}_2 \\ \vdots \\ x^{(i)}_j \\ \vdots \\ x^{(i)}_n \end{bmatrix} + b = a_1 x^{(i)}_1 + a_2 x^{(i)}_2 + \cdots + a_j x^{(i)}_j + \cdots + a_n x^{(i)}_n + b = y'^{(i)} \]

This matrix equation, \(y'(x, A, b) = Ax + b\) is exactly what we want as our model. As one final note, recall that in the actual implementation, we don't want \(x\) and \(y'\) to represent just one input data and predicted output, we want them to represent several. Since \(x\) is a column vector, the natural way to represent multiple input data points is with a matrix, very similar to the matrix \(D_x\), just not necessarily with all the columns of \(D_x\), and \(y'\) should be a row vector. Specifically, \(A\) has shape 1 x \(n\), \(x\) has shape \(n\) x \(m\), and \(y\) has shape 1 x \(m\), where \(m\) is the number of data points.

Now defining the loss function is pretty much the same as before, just using the new model: \[ L(A, b) = \sum_{i=1}^m (y'(x^{(i)}, A, b) - y^{(i)})^2 = \sum_{i=1}^m (A x^{(i)} + b - y^{(i)})^2 \]

To minimize the loss function, we use the same process as before, gradient descent. However, previously the gradient descent was altering 2 variables (\(a\) and \(b\)) so as to minimize the loss function, and so we could plot the loss function and gradient descent progress in terms of \(a\) and \(b\). However, now the optimization needs to alter many more variables, since \(A\) actually contains \(n\) variables, the gradient descent must be performed in \(n+1\) dimensional space, and we don't have an easy way to visualize this.

With the more general linear algebra formulation of linear regression under our belts, let's move on to actually coding stuff.

Implementation

As before, we need to: import data, define the model, define the loss function, run gradient descent, and finally make predictions. Many steps will be similar to the single variable case, but for completeness I will walk through them briefly.

For building and testing the implementation we will use a synthetic data set consisting of \(n=2\) input variables. You can download the synthetic data set here. By synthetic, I mean that I purposefully created a very nicely behaved data set so that we can practice implementing multi variable linear regression, and verify that we converged to the right answer. In fact, the synthetic data is generated as \(y = 2x_1 + 1.3x_2 + 4 + \varepsilon \) where \(\varepsilon\) is random noise. If we implement multi variable linear regression correctly, then we should obtain approximately \(A = \begin{bmatrix} 2, 1.3 \end{bmatrix}, b = 4\). This plot illustrates what the data looks like in 3 dimensions, essentially a plane in 3 dimensions with some random fluctuations:

scatter

Importing the data

As explained above, the input data set can be organized as an \(n \times m\) matrix. Since we will load the entire data set (input and output) from a single CSV file, and we have 2 input variables, the CSV file will contain 3 columns: the first 2 are the input variables, and the last one is the output variable. So, first we load the CSV file into an \(m\) x 3 matrix, and then separate the first 2 columns from the last:

import pandas as pd
import matplotlib.pyplot as plt
import torch
import torch.optim as optim

### Load the data

# First we load the entire CSV file into an m x 3
D = torch.tensor(pd.read_csv("linreg-multi-synthetic-2.csv", header=None).values, dtype=torch.float)

# We extract all rows and the first 2 columns, and then transpose it
x_dataset = D[:, 0:2].t()

# We extract all rows and the last column, and transpose it
y_dataset = D[:, 2].t()

# And make a convenient variable to remember the number of input columns
n = 2

The syntax D[:, 0:2] might be new, particularly if you haven't worked with Python much before. In the single variable implementation we used Panda's functionality to access the columns by column name. This is a great approach, but sometimes you might need to be more flexible in how you access columns of data.

Note: The basic syntax for subscripting a matrix is: D[3, 6] (for example), which refers to the row at index 3 and the column at index 6 in the matrix D. Note that in Python the row and column indices start at 0! This means that D[0, 0] refers to the top-left entry of matrix D. If you are coming from a pure math background, or have used MATLAB or R before, it is a common error to assume the indices start at 1.

Now for slicing, the : character is used to indicate a range. If it is used by itself, it indicates the entire range of rows / columns. For example, D[:, 42] refers to all rows of D, and the column at index 42. If it is used with indices, then i:j indicates the range of rows / columns at indices i, i+1, ..., j-1, but not including j.

So, D[:, 0:2] means to read the values in D at all rows and at columns with index 0 and 1 (the entire first 2 columns, i.e. the input data columns). Likewise, D[:, 2] means to read the values in D at all rows and at the column of index 2 (the entire last column, i.e. the output data column).

This matrix subscripting and slicing is almost what we want, but not quite. The problem is that D[:, 0:2], which contains our \(D_x\) data, is a matrix of shape \(m \times n\), but earlier we decided that we wanted \(D_x\) to be an \(n \times m\) matrix, so we need to flip it. To do so, we use the transpose of the matrix. Mathematically we write the transpose of a matrix \(A\) as \(A^T\), and in Python we can compute it using A.t(). Essentially, the transpose of a matrix simply flips it along the diagonal, as shown in this animation:

Matrix transpose.gif
By LucasVB - Link

So, D[:, 0:2].t() is a matrix of shape \(n \times m\), and is our correct data input matrix \(D_x\). We save this matrix to the variable x_dataset. Likewise, we also transpose D[:, 2] to correctly compute \(D_y\), and save it in y_dataset.

At this point we have our \(m \times n\) input data matrix x_dataset and our \(m \times 1\) output vector y_dataset loaded. In addition, we conveniently have the number of columns stored in n, so now we can start defining our model.

Defining the model

As shown above, we want our model parameters to consist of a matrix \(A\) of size \(1 \times n\) and a single number \(b\). Then, we define: \[ y'(x, A, b) = Ax + b \]

In order to implement this, we define the trainable variables, the output prediction, and the loss function:

### Model definition ###

# First we define the trainable parameters A and b 
A = torch.randn((1, n), requires_grad=True)
b = torch.randn(1, requires_grad=True)

# Then we define the prediction model
def model(x_input):
    return A.mm(x_input) + b


### Loss function definition ###

def loss(y_predicted, y_target):
    return ((y_predicted - y_target)**2).sum()

There are two differences with this code and the previous single variable regression code. First, we we define A, we give it size of (1, n) rather than 1. This means that now A is a matrix of size 1 x n (i.e. a row vector) rather than a scalar as before. Second, when we define the model, we can no longer write A * x because * means scalar multiplication. Instead, we use A.mm(x_input) to indicate matrix multiplication.

Training the model

At this point, we have our model and loss function all setup. The remaining code to train the model is extremely similar to the code for single variable regression, so I'll simply display it here, and then explain the few differences:

### Training the model ###

# Setup the optimizer object, so it optimizes a and b.
optimizer = optim.Adam([A, b], lr=0.1)

# Main optimization loop
for t in range(2000):
    # Set the gradients to 0.
    optimizer.zero_grad()
    # Compute the current predicted y's from x_dataset
    y_predicted = model(x_dataset)
    # See how far off the prediction is
    current_loss = loss(y_predicted, y_dataset)
    # Compute the gradient of the loss with respect to A and b.
    current_loss.backward()
    # Update A and b accordingly.
    optimizer.step()
    print(f"t = {t}, loss = {current_loss}, A = {A.detach().numpy()}, b = {b.item()}")

First, we have a different learning rate than the learning rate used in single variable regression. Even though the training algorithm is the same, since this is a different problem than single variable regression, we need find a good learning rate specific to this problem. A great way to do this for your own problems is using TensorBoard, as explained in the chapter Optimization Convergence.

Besides this, the only other conceptual difference is that at each step of the optimizer we are modifying the entire vector A simultaneously (in addition to b), rather than just a single number. However, PyTorch abstracts this away for us, and conceptually we just need to know that we are training the variable A.

The final print statements should output something close to:

t = 1994, loss = 1447991.625, A = [[2.0054889 1.3021096]], b = 3.950239419937134
t = 1995, loss = 1447991.625, A = [[2.0054889 1.3021096]], b = 3.950239419937134
t = 1996, loss = 1447991.625, A = [[2.0054889 1.3021096]], b = 3.950239419937134
t = 1997, loss = 1447991.625, A = [[2.0054889 1.3021096]], b = 3.950239419937134
t = 1998, loss = 1447991.625, A = [[2.0054889 1.3021096]], b = 3.950239419937134
t = 1999, loss = 1447991.625, A = [[2.0054889 1.3021096]], b = 3.950239419937134

At this point we have converged to our approximate solution of \(A \approx \begin{bmatrix} 2.005, 1.302 \end{bmatrix}, b \approx 3.95\). Note that this is not exactly the same as the expected answer of \(A = \begin{bmatrix} 2, 1.3 \end{bmatrix}, b \approx 4\), primarily because some random noise was added to each point in the data set.

The model is fully trained, so now given a new input \(x\) we could now predict the output \(y' = Ax + b\), using the learned information from all input variables.

Concluding Remarks

Linear regression with multiple variables is only slightly different in essence from single variable linear regression. The main difference is abstracting the linear operation \(ax\) where \(a\) and \(x\) are single numbers to the linear operation \(Ax\), where now \(A\) is a matrix, \(x\) is a vector. In addition, at the implementation level we also have to deal with loading data in a more sophisticated manner, but otherwise the code is mostly the same. In later chapters we will use this abstraction we have built to define even more powerful models.

Challenge Problems

So far this chapter has used a synthetic data set, linreg-multi-synthetic-2.csv, for easy demonstration. The exercises are primarily concerned with getting practice at applying this model to real-world data. Note that in real-world data not all columns are useful, and some might not have a linear relationship with the output variable. Including these unhelpful columns in your model might decrease the accuracy of your model. You should try plotting various columns vs. the output column to determine which seem most helpful in predicting the output, and then only include these useful columns as your input.

In addition, many data sets will have so called messy data, which require you to do some manipulation in Python to make sure the data is imported cleanly and properly. For example, some rows might containg missing data: for these your code can not crash or incorrectly import the data. Instead, you need to adopt a strategy to still import the data as best as you can: for example, you can simply ignore any rows that have incomplete data.

Note that we have not discussed how to rigorously evaluate how good a model is yet. For now you can use the value of the loss function, along with some intuition and creating plots. Evaluation will be discussed more in chapter 2.7.

  1. Download this red wine quality data set, and try to predict the quality of the wine (last column) from the physicochemical input data (other columns).
  2. Download this car MPG data set, and try to predict the MPG (first column) based on some of the other columns.
  3. Download this California 1990 Housing Value data set, and try to predict the house values based on various factors.

Complete Code

The complete example code is available on GitHub, as well as directly here:

import pandas as pd
import matplotlib.pyplot as plt
import torch
import torch.optim as optim

### Load the data

# First we load the entire CSV file into an m x 3
D = torch.tensor(pd.read_csv("linreg-multi-synthetic-2.csv", header=None).values, dtype=torch.float)

# We extract all rows and the first 2 columns, and then transpose it
x_dataset = D[:, 0:2].t()

# We extract all rows and the last column, and transpose it
y_dataset = D[:, 2].t()

# And make a convenient variable to remember the number of input columns
n = 2


### Model definition ###

# First we define the trainable parameters A and b 
A = torch.randn((1, n), requires_grad=True)
b = torch.randn(1, requires_grad=True)

# Then we define the prediction model
def model(x_input):
    return A.mm(x_input) + b


### Loss function definition ###

def loss(y_predicted, y_target):
    return ((y_predicted - y_target)**2).sum()

### Training the model ###

# Setup the optimizer object, so it optimizes a and b.
optimizer = optim.Adam([A, b], lr=0.1)

# Main optimization loop
for t in range(2000):
    # Set the gradients to 0.
    optimizer.zero_grad()
    # Compute the current predicted y's from x_dataset
    y_predicted = model(x_dataset)
    # See how far off the prediction is
    current_loss = loss(y_predicted, y_dataset)
    # Compute the gradient of the loss with respect to A and b.
    current_loss.backward()
    # Update A and b accordingly.
    optimizer.step()
    print(f"t = {t}, loss = {current_loss}, A = {A.detach().numpy()}, b = {b.item()}")

Feature Scaling

In chapters 2.1, 2.2, 2.3 we used the gradient descent algorithm (or variants of) to minimize a loss function, and thus achieve a line of best fit. However, it turns out that the optimization in chapter 2.3 was much, much slower than it needed to be. While this isn’t a big problem for these fairly simple linear regression models that we can train in seconds anyways, this inefficiency becomes a much more drastic problem when dealing with large data sets and models.

Example of the Problem

First, let’s look at a concrete example of the problem, by again considering a synthetic data set. Like in chapter 2.3 I generated another simple synthetic data set consisting of 2 independent variables \(x_1\) and \(x_2\), and one dependent variable \(y = 2x_1 + 0.013x_2\). However, note that the range for \(x_1\) is 0 to 10, but the range for \(x_2\) is 0 to 1000. Let's call this data set \(D\). A few sample data points look like this:

\(x_1\) \(x_2\) \(y\)
7.36 1000 34.25
9.47 0 19.24
0.52 315.78 3.50
1.57 315.78 11.02
6.31 263.15 13.93
1.57 526.31 10.21
0.52 105.26 3.41
4.21 842.10 16.27
2.63 894.73 19.04
3.68 210.52 4.60
... ... ...

For simplification I have excluded a constant term from this synthetic data set, and when training models I "cheated" by not training the constant term \(b\) and just setting it to 0, so as to simplify visualization. The model that I used was simply:

def model(x_input):
    y_predicted = A.mm(x_input) # A is a 1x2 matrix

Since this is just a simple application of the ideas from chapter 2.3, one would expect this to work fairly easily. However, when using an optim.SGD optimizer, training goes rather poorly. Here is a plot showing the training progress of A[0] and the loss function side-by-side:

Training progress without scaling

This optimization was performed with a learning rate of 0.0000025, which is about the largest before it would diverge. The loss function quickly decreases at first, but then quickly stalls, and decreases quite slowly. Meanwhile, \(A_0\) is increasing towards the expected value of approximately 2.0, but very slowly. Within 5000 iterations this model fails to finish training.

Now let's do something rather strange: take the data set \(D\), and create a new data set \(D'\) by dividing all the \(x_2\) values by 100. The resulting data set \(D'\) looks roughly like this:

\(x_1'\) \(x_2'\) \(y'\)
7.36 10 34.25
9.47 0 19.24
0.52 3.1578 3.50
1.57 3.1578 11.02
6.31 2.6315 13.93
1.57 5.2631 10.21
0.52 1.0526 3.41
4.21 8.4210 16.27
2.63 8.9473 19.04
3.68 2.1052 4.60
... ... ...

A crucial note is that while \(D'\) is technically different from \(D\), it contains exactly the same information: one can convert between them freely, by dividing or multiplying the second column by 100. In fact, since this transformation is linear, and we are using a linear model, we can train our model on \(D'\) instead. We would just expect to obtain a value of 1.3 for A[1] rather than 0.013. So let's give it a try!

The first interesting observation is that we can use a much larger learning rate. The largest learning rate we could use with \(D\) was 0.0000025, but with \(D'\) we can use a learning rate of 0.01. And when we plot A[0] and the loss function for both \(D\) and \(D'\) we see something pretty crazy:

Training progress with scaling

While training on \(D\) wasn't even close to done after 5000 iterations, training on \(D'\) seems to have completed nearly instantly. If we zoom in on the first 60 iterations, we can see the training more clearly:

Training progress with scaling zoom

So this incredibly simple data set transformation has changed a problem that was untrainable within 5000 iterations to one that can be trained practically instantly with 50 iterations. What is this black magic, and how does it work?

Visualizing Gradient Descent with Level Sets

One of the best ways to gain insight in machine learning is by visualization. As seen in chapter 2.1 we can visualize loss functions using plots. Since we have 2 parameters A[0] and A[1] of the loss function, it would be a 3D plot. We used this for visualization in chapter 2.1, but frankly it's a bit messy looking. Instead, we will use level sets (also called contour plots), which use lines to indicate where the loss function has a constant value. An example is easiest, so here is the contour plot for the loss function for \(D'\), the one that converges quickly:

Contour plot of D'

Each ellipse border is where the loss function has a particular constant value (the innermost ones are labeled), and the red X marks the spot of the optimal values of A[0] and A[1]. By convention the particular constant values of the loss function are evenly spaced, which means that contour lines that are closer together indicate a "steeper" slope. In this plot, the center near the X is quite shallow, while far away is pretty steep. In addition, one diagonal axis of the ellipses is steeper than the other diagonal axis.

We can also plot how A[0] and A[1] evolve during training on the contour plot, to get a feel for how gradient descent is working:

Contour plot of D' with training

Here, we can see that the initial values for A[0] and A[1] start pretty far off, but gradient descent quickly zeros in towards the X. As a side note, the line connecting each pair of dots is perpendicular to the line of the level set: see if you can figure out why.

So if that is what the contour plot looks like for the "good" case of \(D'\), how does it look for \(D\)? Well, it looks rather strange:

Contour plot of D

Don't be fooled: like the previous plot the level sets also form ellipses, but they are so stretched that they are nearly straight lines at this scale. This means that vertically (A[0] is constant and we vary A[1]) there is substantial gradient, but horizontally (A[1] is constant and we vary A[0]) there is practically no slope. Put another way, gradient descent only knows to vary A[1], and (almost) doesn't vary A[0]. We can test this hypothesis by plotting how gradient descent updates A[0] and A[1], like above. Since gradient descent makes such little progress though, we have to zoom in a lot to see what is going on:

Contour plot of D

We can clearly see that gradient descent applies large updates to A[1] (a bit too large, a smaller learning rate would have been a bit better) due to the large gradient in the A[1] direction. But due to the (comparatively) tiny gradient in the A[0] direction very small updates are done to A[0]. Gradient descent quickly converges on the optimal value of A[1], but is very very far away from finding the optimal value of A[0].

Let's take a quick look at what is going on mathematically to see why this happens. The model we are using is: \[ y'(x, A) = Ax = a_1 x_1 + a_2 x_2 \] Here, \(a_1\) is in the role of A[0] and \(a_2\) is A[1]. We substitute this into the loss function to get: \[ L(a_1, a_2) = \sum_{i=1}^m (a_1 x_1^{(i)} + a_2 x_2^{(i)} - y^{(i)})^2 \] Now if we differentiate \(L\) in the direction of \(a_1\) and \(a_2\) separately, we get: \[ \frac{\partial L}{\partial a_1} = \sum_{i=1}^m 2(a_1 x_1^{(i)} + a_2 x_2^{(i)} - y^{(i)})x_1^{(i)} \\ \frac{\partial L}{\partial a_2} = \sum_{i=1}^m 2(a_1 x_1^{(i)} + a_2 x_2^{(i)} - y^{(i)})x_2^{(i)} \] Here we can see the problem: the inner terms of the derivatives are the same, except one is multiplied by \(x_1^{(i)}\) and the other by \(x_2^{(i)}\). If \(x_2\) is on average 100 times bigger than \(x_1\) (which it is in the original data set), then we would expect \(\frac{\partial L}{\partial a_2}\) to be roughly 100 times bigger than \(\frac{\partial L}{\partial a_1}\). It isn't exactly 100 times larger, but with any reasonable data set it should be close. Since the derivatives in the directions of \(a_1\) and \(a_2\) are scaled completely differently, gradient descent fails to update both of them adequately.

The solution is simple: we need to rescale the input features before training. This is exactly what happened when we mysteriously divided by 100: we rescaled \(x_2\) to be comparable to \(x_1\). But we should work out a more methodical way of rescaling, rather than randomly dividing by 100.

Rescaling Features

This wasn't explored above, but there are really two ways that we potentially need to rescale features. Consider an example where \(x_1\) ranges between -1 and 1, but \(x_2\) ranges between 99 and 101: both of these features have (at least approximately) the same standard deviation, but \(x_2\) has a much larger mean. On the other hand, consider an example where \(x_1\) still ranges between -1 and 1, but \(x_2\) ranges between -100 and 100. This time, they have the same mean, but \(x_2\) has a much larger standard deviation. Both of these situations can make gradient descent and related algorithms slower and less reliable. So, our goal is to ensure that all features have the same mean and standard deviation.

Note: There are other methods to measure how different features are, and then subsequently rescale them. For a look at more of them, feel free to consult the Wikipedia article. However, after learning the technique presented here, any other technique is fairly similar and easy to implement if needed.

Without digressing too far into statistics, let's quickly review how to calculate the mean \(\mu\) and standard deviation \(\sigma\). Suppose we have already read in our data set in Python into a Numpy vector / matrix, and have all the values for the feature \(x_j\), for each \(j\). Mathematically, the mean for feature \(x_j\) is just the average: \[ \mu_j = \frac{\sum_{i=1}^m x_j^{(i)}}{m} \] In PyTorch there is a convenient .mean() method we will use.

The standard deviation \(\sigma\) is a bit more tricky. We measure how far each point \(x_j^{(i)}\) is from the mean \(\mu_j\), square this, then take the mean of all of this, and finally square root it: \[ \sigma_j = \sqrt{\frac{\sum_{i=1}^m (x_j^{(i)} - \mu_j)^2 }{m}} \] Again, PyTorch provides a convenient .std() method.

Note: Statistics nerds might point out that in the above equation we should divide by \(m - 1\) instead of \(m\) to obtain an unbiased estimate of the standard deviation. This might well be true, but in this usage it does not matter since we are only using this to rescale features relative to each other, and not make a rigorous statistical inference. To be more precise, doing so would involve multiplying each scaled feature by only a constant factor, and will not change any of their standard deviations or means relative to each other.

Once we have every mean \(\mu_j\) and standard deviation \(\sigma_j\), rescaling is easy: we simply rescale every feature like so:

\[ \begin{equation} \label{eq:scaling} x_j' = \frac{x_j - \mu_j}{\sigma_j} \end{equation} \]

This will force every feature to have a mean of 0 and a standard deviation of 1, and thus be scaled well relative to each other.

Note: Be careful to make sure to perform the rescaling at both training time and prediction time. That is, we first have to perform the rescaling on the whole training data set, and then train the model so as to achieve good training performance. Once the model is trained and we have a new, never before seen input \(x\), we also need to rescale its features to \(x'\) because the trained model only understands inputs that have already been rescaled (because we trained it that way).

Implementation and Experiments

Feature scaling can be applied to just about any multi-variable model. Since the only multi-variable model we have seen so far is multi-variable linear regression, we'll look at implementing feature scaling in that context, but the ideas and the code are pretty much the same for other models. First let's just copy-paste the original multi-variable linear regression code, and modify it to load the new synthetic data set:

import pandas as pd
import matplotlib.pyplot as plt
import torch
import torch.optim as optim

### Load the data

# First we load the entire CSV file into an m x 3
D = torch.tensor(pd.read_csv("linreg-scaling-synthetic.csv", header=None).values, dtype=torch.float)

# We extract all rows and the first 2 columns, and then transpose it
x_dataset = D[:, 0:2].t()

# We extract all rows and the last column, and transpose it
y_dataset = D[:, 2].t()

# And make a convenient variable to remember the number of input columns
n = 2


### Model definition ###

# First we define the trainable parameters A and b 
A = torch.randn((1, n), requires_grad=True)
b = torch.randn(1, requires_grad=True)

# Then we define the prediction model
def model(x_input):
    return A.mm(x_input) + b


### Loss function definition ###

def loss(y_predicted, y_target):
    return ((y_predicted - y_target)**2).sum()

### Training the model ###

# Setup the optimizer object, so it optimizes a and b.
optimizer = optim.Adam([A, b], lr=0.1)

# Main optimization loop
for t in range(2000):
    # Set the gradients to 0.
    optimizer.zero_grad()
    # Compute the current predicted y's from x_dataset
    y_predicted = model(x_dataset)
    # See how far off the prediction is
    current_loss = loss(y_predicted, y_dataset)
    # Compute the gradient of the loss with respect to A and b.
    current_loss.backward()
    # Update A and b accordingly.
    optimizer.step()
    print(f"t = {t}, loss = {current_loss}, A = {A.detach().numpy()}, b = {b.item()}")

If you run this code right now, you might see output like the following:

...
t = 1995, loss = 0.505296, A = [[1.9920335  0.01292674]], b = 0.089939
t = 1996, loss = 0.5042, A = [[1.9920422  0.01292683]], b = 0.0898404
t = 1997, loss = 0.5031, A = [[1.9920509  0.01292691]], b = 0.0897419
t = 1998, loss = 0.501984, A = [[1.9920596  0.01292699]], b = 0.0896435
t = 1999, loss = 0.50089, A = [[1.9920683  0.01292707]], b = 0.0895451

Note that if you run this code multiple times you will get different results each time due to the random initialization. The synthetic data was generated with the equation \(y = 2x_1 + 0.013x_2 + 0\). So after 2000 iterations of training we are getting close-ish to the correct values, but it's not fully trained. So let's implement feature scaling to fix this.

The first step is to compute the mean and standard deviation for each feature in the training data set. Add the following after x_dataset is loaded:

means = x_dataset.mean(1, keepdim=True)
deviations = x_dataset.std(1, keepdim=True)

The argument of 1 means that we compute the mean for each feature, averaged over all data points. That is, x_dataset is a \(2 \times 400\) matrix, and means is a \(2 \times 1\) matrix (vector). If we had used 0, means would be a \(1 \times 400\) matrix, which is not what we want. We also have to use keepdims=True to tell PyTorch to keep means as an actual \(2 \times 1\) matrix rather than squishing it into an array of length 2.

Now that we know the means and standard deviations of each feature, we have to use them to transform the inputs via Equation \((\ref{eq:scaling})\). We have two options: we could implement the transformation directly by transforming x_dataset before training, or we could include the transformation within the PyTorch model computations. But since we need to do the transformation again when we want to predict output given new inputs, including it in the PyTorch computations will be a bit more convenient.

When we define the model, we want to first transform the input according to Equation \((\ref{eq:scaling})\):

# First we define the trainable parameters A and b 
A = torch.randn((1, n), requires_grad=True)
b = torch.randn(1, requires_grad=True)

# Then we define the prediction model
def model(x_input):
    x_transformed = (x_input - means) / deviations
    return A.mm(x_transformed) + b

One important concept is that when this line of code runs, means and deviations have already been computed, so to PyTorch they are constants. This means that PyTorch will not compute gradients for them.

Also, note that since x_input and means are compatibly sized matrices, the subtraction (and division) will be done separately for each feature, automatically. So while Equation \((\ref{eq:scaling})\) technically says how to scale one feature individually, this single line of code scales all the features.

Note that when we want to use the model, either for training or later prediction, we should feed unscaled values to the model, since the model already performs the scaling.

That's all that we need to implement for feature scaling, it's really pretty easy. If we run this code now we should see:

...
t = 1995, loss = 1.17733e-07, A = [[6.0697703 3.9453504]], b = 16.5
t = 1996, loss = 1.17733e-07, A = [[6.0697703 3.9453504]], b = 16.5
t = 1997, loss = 1.17733e-07, A = [[6.0697703 3.9453504]], b = 16.5
t = 1998, loss = 1.17733e-07, A = [[6.0697703 3.9453504]], b = 16.5
t = 1999, loss = 1.17733e-07, A = [[6.0697703 3.9453504]], b = 16.5

The fact that none of the trained weights are updating anymore and that the loss function is very small is a good sign that we have achieved convergence. But the weights are not the values we expect... shouldn't we get A = [[2, 0.013]], b = 0, since that is the equation that generated the synthetic data? Actually, the weights we got are correct because the model is now being trained on the rescaled data set, so we get different weights at the end. See Challenge Problem 3 below to explore this more.

Concluding Remarks

We've seen how to implement feature scaling for a simple multi-variable linear regression model. While this is a fairly simple model, the principles are basically the same for applying feature scaling to more complex models (which we will see very soon). In fact, the code is pretty much identical: just compute the means and standard deviations, apply the formula to x_input to compute x_transformed, and anywhere you use x_input in the model, just use x_transformed instead.

Whether or not feature scaling helps is dependent on the problem and the model. If you have having trouble getting your model to converge, you can always implement feature scaling and see how it affects the training. While it wasn't used in this chapter, using TensorBoard (see chapter 2.2) is a great way to run experiments to compare training with and without feature scaling.

Challenge Problems

  1. Coding: Take one of the challenge problems from the previous chapter, and implement it with and without feature scaling, and then compare how training performs.
  2. Coding: Add TensorBoard support to the code from this chapter, and use it to explore for yourself the effect feature scaling has. Also, you can compare with different optimization algorithms.
  3. Theory: One problem with feature scaling is that we learn different model parameters. In this chapter, the original data set was generated with \(y = 2x_1 + 0.013x_2 + 0\), but the parameters that the model learned were \(a_1' = 6.0697703, a_2' = 3.9453504, b' = 16.5\). Note that I have called these learned parameters \(a_1'\) rather than \(a_1\) to indicate that these learned parameters are for the rescaled data.

    One important use of linear regression is to explore and understand a data set, not just predict outputs. After doing a regression, one can understand through the learned weights how severely each feature affects the output. For example, if we have two features \(A\) and \(B\) which are used to predict rates of cancer, and the learned weight for \(A\) is much larger than the learned weight for \(B\), this indicates that occurrence of \(A\) is more correlated with cancer than occurrence of \(B\) is, which is interesting in its own right. Unfortunately, performing feature scaling destroys this: since we have rescaled the training data, the weight for \(A\) (\(a_1'\)) has lost its relevance to the values of \(A\) in the real world. But all is not lost, since we can actually recover the "not-rescaled" parameters from the learned parameters, which we will derive now.

    First, suppose that we have learned the "rescaled" parameters \(a_1', a_2', b'\). That is, we predict the output \(y\) to be: \[\begin{equation}\label{eq:start}y = a_1' x_1' + a_2' x_2' + b'\end{equation}\]where \(x_1', x_2'\) are the rescaled features. Now, we want a model that uses some (yet unknown) weights \(a_1, a_2, b\) to predict the same output \(y\), but using the not-rescaled features \(x_1, x_2\). That is, we want to find \(a_1, a_2, b\) such that this is true:\[\begin{equation}\label{eq:goal}y = a_1 x_1 + a_2 x_2 + b\end{equation}\]To go about doing this, substitute for \(x_1'\) and \(x_2'\) using Equation \((\ref{eq:scaling})\) into Equation \((\ref{eq:start})\), and then compare it to Equation \((\ref{eq:goal})\) to obtain 3 equations: one to relate each scaled and not-scaled parameter. Finally, you can use these equations to be able to compute the "not-scaled" parameters in terms of the learned parameters. You can check your work by computing the not-scaled parameters in terms of the learned parameters for the synthetic data set, and verify that they match the expected values.

Complete Code

The complete example code is available on GitHub, as well as directly here:

import pandas as pd
import matplotlib.pyplot as plt
import torch
import torch.optim as optim

### Load the data

# First we load the entire CSV file into an m x 3
D = torch.tensor(pd.read_csv("linreg-scaling-synthetic.csv", header=None).values, dtype=torch.float)

# We extract all rows and the first 2 columns, and then transpose it
x_dataset = D[:, 0:2].t()

# We extract all rows and the last column, and transpose it
y_dataset = D[:, 2].t()

# And make a convenient variable to remember the number of input columns
n = 2


### Feature Scaling computations

# Pre-compute the means and standard deviations of independent variables
means = x_dataset.mean(1, keepdim=True)
deviations = x_dataset.std(1, keepdim=True)


### Model definition ###

# First we define the trainable parameters A and b 
A = torch.randn((1, n), requires_grad=True)
b = torch.randn(1, requires_grad=True)

# Then we define the prediction model
def model(x_input):
    x_transformed = (x_input - means) / deviations
    return A.mm(x_transformed) + b


### Loss function definition ###

def loss(y_predicted, y_target):
    return ((y_predicted - y_target)**2).sum()

### Training the model ###

# Setup the optimizer object, so it optimizes a and b.
optimizer = optim.Adam([A, b], lr=0.1)

# Main optimization loop
for t in range(2000):
    # Set the gradients to 0.
    optimizer.zero_grad()
    # Compute the current predicted y's from x_dataset
    y_predicted = model(x_dataset)
    # See how far off the prediction is
    current_loss = loss(y_predicted, y_dataset)
    # Compute the gradient of the loss with respect to A and b.
    current_loss.backward()
    # Update A and b accordingly.
    optimizer.step()
    print(f"t = {t}, loss = {current_loss}, A = {A.detach().numpy()}, b = {b.item()}")

Nonlinear Regression: content coming soon!

Regularization: content coming soon!

Evaluation: content coming soon!

Why we square errors: content coming soon!