Introduction

Welcome to Machine Learning with TensorFlow, an easy to follow guide / blog about understanding machine learning concepts and their implementation in Python and TensorFlow. My goal is to create an intuitive explanation of machine learning methods, paired with practical knowledge of implementation with TensorFlow. This guide does assume some small working knowledge of programming, but Python specific syntax will be briefly introduced as needed. In addition, some mathematics background will be useful, in particular linear algebra and calculus, but is not needed to follow the intuition and implementation. Resources to review relevant math concepts will be provided as required. Finally, this guide is open source on GitHub, and I would be happy to consider edits, both tiny and large. Feel free to open pull requests with changes to improve this guide!

What exactly is machine learning?

Though the exact definition of machine learning is up for debate, we can roughly define machine learning as "a field of computer science that gives computers the ability to learn without being explicitly programmed" (Wikipedia). Traditionally, programming has been used to great effect to solve problems which have explicit steps. For example, consider the problem of determining if a number \(n\) is prime. We can easily determine if \(n\) is prime using fairly simple steps: check if \(n\) is divisible by \(2\), check if \(n\) is divisible by \(3\), \(\cdots\), check if \(n\) is divisible by \(n-1\). If \(n\) is divisible by any of these, then \(n\) is not prime, and otherwise it is prime. With traditional programming we then need to write these specific steps (an algorithm) in a programming language that our computer can interpret, and we have a working program which can classify numbers as prime much faster than a human ever could.

However, some tasks seem difficult or impossible to solve using specific steps like above. Rather than classifying prime numbers, let us instead try to classify images of dogs as either a Golden Retriever or Poodle.

Golden Retriever Poodle
Golden Retriever Image Poodle Image

Most people would easily be able to differentiate the two breeds above, and accurately dogs of those breeds. But, can you explain how you came to this conclusion, in very specific steps? Perhaps, one might say that the hair of the Golden Retriever seems less curly, but then one would need to explain how to determine that the hair is less curly. There does not seem to exist a natural way to describe the curliness of hair in specific steps. Maybe, the shape of facial features such as nose shape are useful for determining dog breed, but it still seems nearly impossible to describe a precise algorithm that uses this data.

Qualitatively, humans do not seem to comprehend the breed of a dog by performing some precise algorithm, and this suggests that a computer program to solve this problem should not either. By using machine learning, we can use data to help the computer learn to recognize a dog breed, objects in general, human speech, etc., without explicitly programming it.

So what can machine learning accomplish?

Lots! Machine learning is finding applications nearly everywhere, and it's still growing. Machine learning is full of constantly evolving research with new and exciting results, such as:

  1. Image Synthesis with Semantic Manipulation of Images:
  2. Natural Language Translation
  3. Magically Enhancing Image Resolution

For more, there is no shortage of inspiration and useful information on Reddit.

So, what is TensorFlow, and why do we need it?

Without diving into the details of various machine learning methods, I'll generally state that there could be a fairly significant amount of math involved. Not necessarily super difficult math, but some pretty cumbersome math. Roughly, TensorFlow takes care of the calculus grunt work for us: instead of spending a ton of time finding messy derivatives and many lines of code to compute them, TensorFlow lets us focus on the fairly elegant work of defining our machine learning methods, and the more important math necessary to obtain intuition.

So, without further ado, we can get started by installing Python, TensorFlow, and a few other useful tools. Just follow the appropriate directions for your operating system.

macOS Setup

Without a doubt, Python 3 is the future of Python, and thus we will use Python 3 with TensorFlow. While I summarize the instructions to install TensorFlow with Python 3 here, the official TensorFlow documentation should always provide up-to-date and comprehensive instructions for installation as well. If you run into any installation problems, please file a bug here.

Installing Python 3

macOS does not come installed with Python 3 (only Python 2.7), so we need to install it ourselves. The first step is to open terminal (Terminal.app on your Mac), and install the Homebrew package manager for macOS, if you haven't already installed it before. Just paste the appropriate command into terminal, and follow Homebrew's installation prompts:

  • If you use bash, ksh, zsh, csh, etc.
# For bash, ksh, zsh, csh, and other shells:
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
  • If you are awesome and use fish:
curl -fsSL "https://raw.githubusercontent.com/Homebrew/install/master/install" | /usr/bin/ruby

Now that Homebrew is installed, we need to install Python 3:

brew install python3

You can test that python3 installed correctly by running the command python3. You should see output similar to:

~ $ python3
Python 3.6.3 (default, Nov 20 2017, 14:17:35)
[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.38)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>

The Python version should be 3.x.y instead of 2.x.y. Once this is confirmed, quit Python by entering the code exit().

Installing TensorFlow and friends

Installing Python 3 (python3) also installs the Python 3 package manager (pip3). We can use pip3 to install a Python utility called virtualenv (stands for virtual environment). Virtualenv is an extremely easy and convenient way to install Python packages in a local and contained manner. The recommended way to install TensorFlow is by using virtualenv, since this ensures that the installation will be self-contained, and will not affect the rest of your system. So, first we need to install virtualenv:

pip3 install --upgrade virtualenv

Now, create a new virtualenv:

# In this command, ~/tensorflow is the destination directory where the virtualenv will be created.
# You can choose a different location if you prefer, but the rest of the installation tutorial will assume ~/tensorflow
python3 -m venv --system-site-packages ~/tensorflow

Then, we need to activate the virtualenv so we can install things inside it:

  • If you are using bash, sh, ksh, zsh:
source ~/tensorflow/bin/activate
  • If you are using csh, tcsh:
source ~/tensorflow/bin/activate.csh
  • If you are cool and use fish:
source ~/tensorflow/bin/activate.fish

After activating the virtualenv, your command prompt should change to look like: (tensorflow)$. At this point, we are finally ready to install TensorFlow:

pip3 install --upgrade tensorflow

In addition, we also install a few other Python packages that are useful for machine learning:

pip3 install --upgrade numpy # Used for linear algebra. Essential for using TensorFlow
pip3 install --upgrade matplotlib # Used for plotting data, which is very useful for machine learning
pip3 install --upgrade pandas # Used for loading data sets

Just so you know, you can always exit the virtualenv using the command deactivate. In the future, to use TensorFlow you need to activate the virtualenv first using the appropriate source command listed above.

Testing the installation

Let's just quickly test the installation, to verify everything is installed and ready to go! If it isn't currently activated, activate your virtualenv with the appropriate source command from above. Then, run the python command, and type the following code into Python:

import tensorflow as tf
import numpy as np
import pandas
import matplotlib.pyplot as plt
hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print(sess.run(hello))

If everything is installed correctly, you should see the output:

Hello, TensorFlow!

If the above test produces errors, please see this list of common TensorFlow installation problems or Google it, and file a bug here.

Optional: Installing an IDE

At this point, everything we need for writing TensorFlow code is installed. We can write our Python code in a text editor of our choice, and run it using python from terminal. In the future I won't assume use of an IDE specifically, but rather assume that you can create new Python files and run them, either from terminal or an IDE. However, some people may prefer using an IDE, which can reduce the time spent in Terminal, and help provide better syntax and semantic checking of Python code. Personally, I use and recommend PyCharm, which you can download for free here. I won't provide specific installation instructions, but I found it very easy to install and configure to use the ~/tensorflow virtualenv.

Linux (Ubuntu) Setup

Without a doubt, Python 3 is the future of Python, and thus we will use Python 3 with TensorFlow. While I summarize the instructions to install TensorFlow with Python 3 here, the official TensorFlow documentation should always provide up-to-date and comprehensive instructions for installation as well. If you run into any installation problems, please file a bug here. Note that Ubuntu is the only officially supported Linux distribution for TensorFlow, but with a bit of effort TensorFlow can likely be installed on other distributions as well.

Optional: Install GPU Support

On Linux TensorFlow has the ability to run your machine learning models on a GPU, which can make computation significantly faster. However, the speed difference is negligible until we start building extremely large models with massive amounts of data. So if you have a supported GPU, you could choose to install GPU support now, or just wait until later. To see if your GPU is supported and instructions for installing GPU support, see here.

Installing Pip and Virtualenv

First, we need to install pip, and virtualenv. This should be easy using apt-get:

sudo apt-get install python3-pip python3-dev python-virtualenv

Installing TensorFlow and friends

Virtualenv is an extremely easy and convenient way to install Python packages in a local and contained manner. The recommended way to install TensorFlow is by using virtualenv, since this ensures that the installation will be self-contained, and will not affect the rest of your system. Now, we need to create a new virtualenv:

# In this command, ~/tensorflow is the destination directory where the virtualenv will be created.
# You can choose a different location if you prefer, but the rest of the installation tutorial will assume ~/tensorflow
python3 -m venv --system-site-packages ~/tensorflow

Then, we need to activate the virtualenv so we can install things inside it:

  • If you are using bash, sh, ksh, zsh:
source ~/tensorflow/bin/activate
  • If you are using csh, tcsh:
source ~/tensorflow/bin/activate.csh
  • If you are cool and use fish:
source ~/tensorflow/bin/activate.fish

After activating the virtualenv, your command prompt should change to look like: (tensorflow)$. At this point, we are finally ready to install TensorFlow:

pip3 install --upgrade tensorflow # If installing WITHOUT GPU support
pip3 install --upgrade tensorflow-gpu # If installing WITH GPU support

In addition, we also install a few other Python packages that are useful for machine learning:

pip3 install --upgrade numpy # Used for linear algebra. Essential for using TensorFlow
pip3 install --upgrade matplotlib # Used for plotting data, which is very useful for machine learning
pip3 install --upgrade pandas # Used for loading data sets

Once everything is installed, you can exit the virtualenv using the command deactivate.

Testing the installation

Let's just quickly test the installation, to verify everything is installed and ready to go! If it isn't currently activated, activate your virtualenv with the appropriate activate command from above. Then, run the python command, and type the following code into Python:

import tensorflow as tf
import numpy as np
import pandas
import matplotlib.pyplot as plt
hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print(sess.run(hello))

If everything is installed correctly, you should see the output:

Hello, TensorFlow!

If the above test produces errors, please see this list of common TensorFlow installation problems or Google it, and file a bug here.

Optional: Installing an IDE

At this point, everything we need for writing TensorFlow code is installed. We can write our Python code in a text editor of our choice, and run it using python from terminal. In the future I won't assume use of an IDE specifically, but rather assume that you can create new Python files and run them, either from terminal or an IDE. However, some people may prefer using an IDE, which can reduce the time spent in terminal, and help provide better syntax and semantic checking of Python code. Personally, I use and recommend PyCharm, which you can download for free here. I won't provide specific installation instructions, but I found it very easy to install and configure to use the ~/tensorflow virtualenv.

Windows Setup

To be blunt, I only use macOS and Linux, and I have no experience installing TensorFlow or even Python on Windows. Instead, I will simply direct you to the official documentation for TensorFlow on Windows.

When installing Python, please ensure that you use Python 3, rather than Python 2. Also, install the Python packages NumPy, Pandas and matplotlib in addition to TensorFlow.

Optional: Installing an IDE

At this point, everything we need for writing TensorFlow code in installed. We can write our Python code in a text editor of our choice, and run it using python from terminal. In the future I won't assume use of an IDE specifically, but rather assume that you can create new Python files and run them, either from terminal or an IDE. However, some people may prefer using an IDE, which can reduce the time spent in terminal, and help provide better syntax and semantic checking of Python code. Personally, I use and recommend PyCharm, which you can download for free here. I won't provide specific installation instructions, but I found it very easy to install and configure.

Linear Regression

Roughly, there are two categories of machine learning:

  1. In supervised learning one already has access to a data set of example inputs and desired outputs, and teaches the computer to produce the desired outputs given the example inputs.
  2. In unsupervised learning one only has access to a data set: there is not necessarily any notion of "input" and "output". The goal is to teach the computer to autonomously discover underlying patterns in the data set.

For example, if we have a data set containing students' GPAs and SAT scores, we could attempt to train a supervised machine learning model to predict SAT scores given GPAs. Here, GPA would be the input, and SAT score would be the desired output. On the other hand, instead of trying to predict SAT scores based on GPA, we could use both SAT and GPA as input, and see if the computer can discover interesting patterns. In the plots below, supervised learning can easily predict the output variable based on the input variable via a straight line. A good unsupervised learning algorithm could hopefully discover the two clusters in the right plot.

Supervised Learning Unsupervised Learning
Regression Image Clustering Image

Unsupervised machine learning will be touched on later, but for now we focus on supervised machine learning: in general it is easier, and the objective is clearer. Specifically, there are two main types of supervised learning:

  1. Regression aims to predict a continuous output, given the inputs.
  2. Classification aims to predict a classification, given the inputs.

The GPA and SAT score example from above is an example of regression, since we are trying to predict the continuous output variable of SAT scores. We aren't trying to classify students based on GPAs, we just want to estimate what their SAT score will be. On the other hand, suppose we also have data about UC Davis college admissions: we could then try to predict whether or not a student will be accepted based on their GPA and SAT combined. This would be an example of classification, since the output of the prediction is not continuous, it is a classification (accepted or rejected). In the example plots below, the left plot shows predicting continuous SAT from GPA, while the right plot shows distinguishing UC Davis rejected applicants (marked with 'x') from UC Davis accepted applicants (marked with 'o'). The plot looks similar to the unsupervised plot from above, but the crucial difference is that we know have data about who was accepted and rejected.

Regression Classification
Regression Image Classification Image

In this chapter we look at regression only, and specifically the simplest form: linear regression. We start with the most basic form of linear regression (single variable linear regression), and then move on to consider variants of linear regression. Using these variants we can obtain models that are both fairly simple and powerful.

Single Variable Regression

Since this is the very first tutorial in this guide and no knowledge is assumed about machine learning or TensorFlow, this tutorial is a bit on the long side. This tutorial will give you an overview of how to do machine learning work in general, a mathematical understanding of single variable linear regression, and how to implement it in TensorFlow. If you already feel comfortable with the mathematical concept of linear regression, feel free to skip to the TensorFlow implementation.

Motivation

Single variable linear regression is one of the fundamental tools for any interpretation of data. Using linear regression, we can predict continuous variable outcomes given some data, if the data has a roughly linear shape, i.e. it generally has the shape a line. For example, consider the plot below of 2015 US homicide deaths per age1, and the line of best fit next to it.

Original data Result of single variable linear regression
Homicide Plot Homicide Regression Plot

Visually, it appears that this data is approximated pretty well by a "line of best fit". This is certainly not the only way to approximate this data, but for now it's pretty good. Single variable linear regression is the tool to find this line of best fit. The line of best fit can then be used to guess how many homicide deaths there would be for ages we don't have data on. By the end of this tutorial you can run linear regression on this homicide data, and in fact solve any single variable regression problem.

Theory

Since we don't have any theory yet to understand linear regression, first we need to develop the theory necessary to program it.

Data set format

For regression problems, the goal is to predict a continuous variable output, given some input variables (also called features). For single variable regression, we only have one input variable, called \(x\), and our desired output \(y\). Our data set \(D\) then consists of many examples of \(x\) and \(y\), so: \[ D = \{ (x_1, y_1), (x_2, y_2), \cdots, (x_m, y_m) \} \] where \(m\) is the number of examples in the data set. For a concrete example, the homicide data set plotted above looks like: \[ D = \{ (21, 652), (22, 633), \cdots, (50, 197) \} \] We will write code to load data sets from files later.

Model concept

So, how can we mathematically model single linear regression? Since the goal is to find the perfect line, let's start by defining the model (the mathematical description of how predictions will be created) as a line: \[ y'(x, a, b) = ax + b \] where \(x\) is an input, \(a, b\) are constants, and \(y'\) is the prediction for the input \(x\). Note that although this is an equation for a line with \(x\) as the variable, the values of \(a\) and \(b\) determine what specific line it is. To find the best line, we just need to find the best values for \(a\) (the slope) and \(b\) (the y-intercept). For example, the line of best fit for the homicide data above has a slope of about \(a \approx -17.69\) and a y-intercept of \(b \approx 1000\). How we find the magic best values for \(a\) and \(b\) we don't know yet, but once we find them, prediction is easy, since we just use the formula above.

So, how do we find the correct values of \(a\) and \(b\)? First, we need a way to define what the "best line" is exactly. To do so, we define a loss function (also called a cost function), which measures how bad a particular choice of \(a\) and \(b\) are. Values of \(a\) and \(b\) that seem poor (a line that does not fit the data set) should result in a large value of the loss function, whereas good values of \(a\) and \(b\) (a line that fits the data set well) should result in small values of the loss function. In other words, the loss function should measure how far the predicted line is from each of the data points, and add this value up for all data points. We can write this as: \[ L(a, b) = \sum_{i=1}^m (y'(x_i, a, b) - y_i)^2 \] Recall that there are \(m\) examples in the data set, \(x_i\) is the i'th input, and \(y_i\) is the i'th desired output. So, \((y'(x_i, a, b) - y_i)^2\) measures how far the i'th prediction is from the i'th desired output. For example, if the prediction \(y'\) is 7, and the correct output \(y\) is 10, then we would get \((7 - 10)^2 = 9.\) Squaring it is important so that it is always positive. Finally, we just add up all of these individual losses. Since the smallest possible values for the squared terms indicate that the line fits the data as closely as possible, the line of best fit (determined by the choice of \(a\) and \(b\)) occurs exactly at the smallest value of \(L(a, b)\). For this reason, the model is also called least squares regression.

Note: The choice to square \(y'(x_i, a, b) - y_i\) is somewhat arbitrary. Though we need to make it positive, we could achieve this in many ways, such as taking the absolute value. In a sense, the choice of models and loss functions is one of the creative aspect of machine learning, and often a certain loss function is chosen simply because it produces satisfying results. Manipulating the loss function to achieve more satisfying results will be done in a later chapter. However, there is also a concrete mathematical reason for why we specifically square it. It's too much to go into here, but it will be explored in bonus chapter 2.8

Creating loss functions (and this exact loss function) will continue to be used throughout this guide, from the most simple to more complex models.

Optimizing the model

At this point, we have fully defined both our model: \[ y'(x, a, b) = ax + b \] and our loss function, into which we can substitute the model: \[ L(a, b) = \sum_{i=1}^m (y'(x_i, a, b) - y_i)^2 = \sum_{i=1}^m (a x_i + b - y_i)^2 \] We crafted \(L(a, b)\) so that it is smallest exactly when each predicted \(y'\) is as close as possible to actual data \(y\). When this happens, since the distance between the data points and predicted line is as small as possible, using \(a\) and \(b\) produces the line of best fit. Therefore, our goal is to find the values of \(a\) and \(b\) that minimize the function \(L(a, b)\). But what does \(L\) really look like? Well, it is essentially a 3D parabola which looks like:

Minimum Plot

The red dot marked on the plot of \(L\) shows where the desired minimum is. We need an algorithm to find this minimum. From calculus, we know that at the minimum \(L\) must be entirely flat, that is the derivatives are both \(0\): \[ \frac{\partial L}{\partial a} = \sum_{i=1}^m 2(ax_i + b - y_i)x_i = 0 \\ \frac{\partial L}{\partial b} = \sum_{i=1}^m 2(ax_i + b - y_i) = 0 \ \] If you need to review this aspect of calculus, I would recommend Khan Academy videos. Now, for this problem it is possible to solve for \(a\) and \(b\) using the equations above, like we would in a typical calculus course. But for more advanced machine learning this is impossible, so instead we will learn to use an algorithm called gradient descent to find the minimum. The idea is intuitive: place a ball at an arbitrary location on the surface of \(L\), and it will naturally roll downhill towards the flat valley of \(L\) and thus find the minimum. We know the direction of "downhill" at any location since we know the derivatives of \(L\): the derivatives are the direction of greatest upward slope (this is known as the gradient), so the opposite (negative) derivatives are the most downhill direction. Therefore, if the ball is currently at location \((a, b)\), we can see where it would go by moving it to location \((a', b')\) like so: \[ a' = a - \alpha \frac{\partial L}{\partial a} \\ b' = b - \alpha \frac{\partial L}{\partial b} \\ \] where \(\alpha\) is a constant called the learning rate, which we will talk about more later. If we repeat this process then the ball will continue to roll downhill into the minimum. An animation of this process looks like:

Gradient Descent Animation

When we run the gradient descent algorithm for long enough, then it will find the optimal location for \((a, b)\). Once we have the optimal values of \(a\) and \(b\), then that's it, we can just use them to predict a rate of homicide deaths given any age, using the model: \[ y'(x) = ax + b \]

Implementation

Let's quickly review what we did when defining the theory of linear regression:

  1. Describe the data set
  2. Define the model
  3. Define the loss function
  4. Run the gradient descent optimization algorithm
  5. Use the optimal model to make predictions
  6. Profit!

When coding this we will follow the exact same steps. So, create a new file single_var_reg.py in the text editor or IDE of your choice (or experiment in the Python REPL by typing python at command line), and download the homicide death rate data set into the same directory.

Importing the data

First, we need to import all the modules we will need:

import numpy as np
import tensorflow as tf
import pandas as pd
import matplotlib.pyplot as plt

We use Pandas to easily load the CSV homicide data:

D = pd.read_csv("homicide.csv")
x_data = np.matrix(D.age.values)
y_data = np.matrix(D.num_homicide_deaths.values)

Note that x_data and y_data are not single numbers, but are actually vectors. The vectors are 30 numbers long, since there are 30 data points in the CSV file. So, (x_data[0], y_data[0]) would be \((x_1, y_1) = (21, 652)\). When we look at multi variable regression later, we will have to work much more with vectors, matrices and linear algebra, but for now you can think of x_data and y_data just as lists of numbers. Also, we have to use np.matrix(...) to convert the array of numbers D.age.values to an actual numpy vector (likewise for D.num_homicide_deaths.values).

Whenever possible, I would recommend plotting data, since this helps you verify that you loaded the data set correctly and gain visual intuition about the shape of the data. This is also pretty easy using matplotlib:

plt.plot(x_data.T, y_data.T, 'x') # The 'x' means that data points will be marked with an x
plt.xlabel('Age')
plt.ylabel('US Homicide Deaths in 2015')
plt.title('Relationship between age and homicide deaths in the US')
plt.show()

When we converted the data to vectors using np.matrix(), numpy created vectors with the shape 1 x 30. That is, x_data consists of only 1 row of numbers, and 30 columns. This is actually great for us when working with TensorFlow, but matplotlib wants vectors that have the shape 30 x 1 (30 rows and 1 column). Writing x_data.T calculates the transpose of x_data, which flips it from a 1 x 30 vector to a 30 x 1 vector. It's fine if you don't understand this now, as we will learn more linear algebra later. Anyways, the plot should look like this: Homicide Plot

You need to close the plot for your code to continue executing.

Defining the model

We have our data prepared and plotted, so now we need to define our model. Recall that the model equation is: \[ y' = ax + b \] Before, we thought of \(x\) and \(y'\) as single numbers. However, we just loaded our data set as vectors (lists of numbers), so it will be much more convenient to define our model using vectors instead of single numbers. If we use the convention that \(x\) and \(y'\) are vectors, then we don't need to change the equation, just our interpretation of it. Multiplying the vector \(x\) by the single number \(a\) just multiplies every number in \(x\) by \(a\), and likewise for adding \(b\). So, the above equation interpreted using vectors is the same thing as: \[ \begin{bmatrix} y_{1}', & y_{2}', & \dots, & y_{m}' \end{bmatrix} = \begin{bmatrix} ax_{1} + b, & ax_{2} + b, & \dots, & ax_{m} + b \end{bmatrix} \]

Fortunately, TensorFlow does the work for us of interpreting the simple equation \(y' = ax + b\) as the more complicated looking vector equation. We just have to tell TensorFlow which things are vectors (\(x\) and \(y'\)), and which are not vectors (\(a\) and \(b\)). First, we define \(x\):

x = tf.placeholder(tf.float32, shape=(1, None))

This says that we create a placeholder that stores floating-point numbers, and has a shape of 1 x None. The shape of 1 x None tells TensorFlow that \(x\) is a vector with 1 row, and some unspecified number of columns. Although we don't tell TensorFlow the number of columns, this is enough to tell TensorFlow that \(x\) is a vector.

Secondly, note that we create a tf.placeholder: x does not have a numerical value right now. Instead, we will later feed the values of x_data into x. In short, use a tf.placeholder whenever there are values you wish to fill in later (usually data).

Now, we define \(a\) and \(b\):

a = tf.get_variable("a", shape=())
b = tf.get_variable("b", shape=())

Unlike x, we create a and b to be a variable, instead of a placeholder. The main difference between a variable and a placeholder is that TensorFlow will automatically find the best values of variables by using gradient descent (later). In other words, a placeholder changes values whenever we choose to feed it different numeric values. A variable changes values continually and automatically during gradient descent. Use a variable for something that is trainable, that is, something whose optimal value will be found by an algorithm such as gradient descent. Since the goal of linear regression is to find the best values of \(a\) and \(b\), the (only) TensorFlow variables in our model are a and b. The conceptual difference between a TensorFlow placeholder and variable is crucial to using TensorFlow properly.

The parameters ("a", shape=()) indicate the name of the variable, and that a is a single number, not a vector. In comparison to x, note that a shape of (1, None) indicates a vector, while a shape of () indicates a single number.

With x, a and b defined, we can define \(y'\):

y_predicted = a*x + b

Now, y_predicted is not a TensorFlow variable, since it is not directly trainable, nor is it a TensorFlow placeholder, since we will not directly feed values into it. More generically, we call y_predicted a TensorFlow node.

And that's it to define the model!

Defining the loss function

We have the model defined, so now we need to define the loss function. Recall that the loss function is how the model is evaluated (smaller loss values are better), and it is also the function that we need to minimize in terms of \(a\) and \(b\). Since the loss function compares the model's output \(y'\) to the correct output \(y\) from the data set, we need to define \(y\) in TensorFlow. Since \(y\) consists of outside data (and we don't need to train it), we create it as a tf.placeholder:

y = tf.placeholder(tf.float32, shape=(1, None))

Like x, y is also a vector, since after all y must store the correct output for each value stored in x.

Now, we are ready to setup the loss function. Recall that the loss function is: \[ L(a, b) = \sum_{i=1}^m (y'(x_i, a, b) - y_i)^2 \]

However, \(y'\) and \(y\) are now being interpreted as vectors. We can rewrite the loss function as: \[ L(a, b) = \mathrm{sum}((y' - y)^2) \] Note that since \(y'\) and \(y\) are vectors, \(y' - y\) is also a vector that just contains every number stored in \(y'\) minus every corresponding number in \(y'\). Likewise, \((y' - y)^2\) is also a vector, with every number individually squared. Then, the \(\mathrm{sum}\) function (which I just made up) adds up every number stored in the vector \((y' - y)^2\). This is the same as the original loss function, but is a vector interpretation of it instead. We can code this directly:

L = tf.reduce_sum((y_predicted - y)**2)

The tf.reduce_sum function is an operation which adds up all the numbers stored in a vector. It is called "reduce" since it reduces a large vector down to a single number (the sum). The word "reduce" here has nothing to do with the fact that we will minimize the loss function.

With just these two lines of code we have defined our loss function.

Minimizing the loss function with gradient descent

With our model and loss function defined, we are now ready to use the gradient descent algorithm to minimize the loss function, and thus find the optimal \(a\) and \(b\). Fortunately, TensorFlow as already implemented the gradient descent algorithm for us, we just need to use it. The algorithm acts almost like a ball rolling downhill into the minimum of the function, but it does so in discrete time steps. TensorFlow does not handle this aspect, we need to be responsible for performing each time step of gradient descent. So, roughly in pseudo-code we want to do this:

for t in range(10000):
    # Tell TensorFlow to do 1 time step of gradient descent

We can't do this yet, since we don't yet have a way to tell TensorFlow to perform 1 time step of gradient descent. To do so, we create an optimizer with a learning rate (\(\alpha)\) of \(0.2\):

optimizer = tf.train.AdamOptimizer(learning_rate=0.2).minimize(L)

The tf.train.AdamOptimizer knows how to perform the gradient descent algorithm for us (actually a faster version of gradient descent). Note that this does not yet minimize \(L\). This code only create an optimizer object which we will use later to minimize \(L\). For an explanation of the learning_rate=0.2 parameter (and the 10000 loop iterations), see the end of this tutorial.

The second problem we have is we don't know how to make TensorFlow run actual computations. Everything so far has been only defining things for TensorFlow, not computing things with concrete numbers. To do so, we also need to create a session:

session = tf.Session()

A TensorFlow session is how we always have to perform actual computations with TensorFlow. We actually need to perform a computation right now, before doing gradient descent. Previously, we defined the variables a and b, but they don't have any numeric value right now. They need to have some initial value so gradient descent can work. To solve this, we have TensorFlow initialize a and b with random values:

session.run(tf.global_variables_initializer())

The session.run function is how we always have to run computations with TensorFlow: the parameter is what computation we want to perform.

Finally, we are ready to run the optimization loop pseudo-code that we originally wanted. Using session.run it looks like:

for t in range(10000):
    _, current_loss, current_a, current_b = session.run([optimizer, L, a, b], feed_dict={
        x: x_data,
        y: y_data
    })
    print("t = %g, loss = %g, a = %g, b = %g" % (t, current_loss, current_a, current_b))

Let's break this down. We use session.run, but we pass it an array of computations that we want to perform. Specifically, we want to perform 4 computations:

  1. optimizer: performs 1 time step of gradient descent, and updates a and b
  2. L: returns the current value of the loss function
  3. a: returns the current value of a
  4. b: likewise returns the current value of b.

Of these computations, only the optimizer does not return a value. So, session.run will return 3 values for us, which we store into variables using the syntax:

_, current_loss, current_a, current_b = session.run([optimizer, L, a, b], ...)

Ok, but what is all of this feed_dict stuff? Recall that x and y are placeholders, and have no actual numerical value on their own. To perform 1 time step of gradient descent, we need to "feed" our actual data (x_data and y_data) into the x and y placeholders. So, we use the feed_dict parameter of session.run to feed x_data into x, and y_data into y, by means of a dictionary.

Finally, the last line of the loop prints out the current values of t, L, a and b. We don't need to print these values out, but it is helpful to observe how the training is progressing.

What we want to see from the print statements is that the gradient descent algorithm converged, which means that the algorithm stopped making significant progress because it found the minimum location of the loss function. When the last few print outputs look like:

t = 9992, loss = 39295.8, a = -17.271, b = 997.281
t = 9993, loss = 39295.8, a = -17.271, b = 997.282
t = 9994, loss = 39295.9, a = -17.271, b = 997.282
t = 9995, loss = 39295.9, a = -17.271, b = 997.283
t = 9996, loss = 39295.8, a = -17.2711, b = 997.283
t = 9997, loss = 39295.8, a = -17.2711, b = 997.284
t = 9998, loss = 39295.9, a = -17.2711, b = 997.284
t = 9999, loss = 39295.8, a = -17.2711, b = 997.285

then we can tell that we have achieved convergence, and therefore found the best values of \(a\) and \(b\).

Using the trained model to make predictions

At this point we have a fully trained model, and know the best values of \(a\) and \(b\). In fact, the equation of the line of best fit is just: \[ y' = -17.2711x + 997.285 \]

The last remaining thing for this tutorial is to plot the predictions of the model on top of a plot of the data. First, we need to create a bunch of input ages that we will predict the homicide rates for. We could use x_data as the input ages, but it is more interesting to create a new vector of input ages, since then we can predict homicide rates even for ages that were not in the data set. Outside of the training for loop, we can use the numpy function linspace to create a bunch of evenly spaced values between 20 and 55:

# x_test_data has values similar to [20.0, 20.1, 20.2, ..., 54.9, 55.0]
x_test_data = np.matrix(np.linspace(20, 55))

Then, we can feed x_test_data into the x placeholder, and save the outputs of the prediction in y_test_data:

y_test_data = session.run(y_predicted, feed_dict={
    x: x_test_data
})

Finally, we can plot the original data and the line together:

plt.plot(x_data.T, y_data.T, 'x')
plt.plot(x_test_data.T, y_test_data.T)
plt.xlabel('Age')
plt.ylabel('US Homicide Deaths in 2015')
plt.title('Age and homicide death linear regression')
plt.show()

This yields a plot like:

Homicide Linear Regression Plot

Concluding Remarks

Through following this post you have learned two main concepts. First, you learned the general form of supervised machine learning workflows:

  1. Get your data set
  2. Define your model (the mechanism for how outputs will be predicted from inputs)
  3. Define your loss function
  4. Minimize your loss function (usually with a variant of gradient descent, such as tf.train.AdamOptimizer)
  5. Once your loss function is minimized, use your trained model to do cool stuff

Second, you learned how to implement linear regression (following the above workflow) using TensorFlow. Let's briefly discuss the above 5 steps, and where to go to improve on them.

1. The Data Set

This one is pretty simple: we need data sets that contain both input and output data. However, we need a data set that is large enough to properly train our model. With linear regression this is fairly easy: this data set only had 33 data points, and the results were pretty good. With larger and more complex models that we will look at later, this becomes much more of a challenge.

2. Defining the model

For single variable linear regression we used the model \(y' = ax + b\). Geometrically, this means that the model can only guess lines. Since the homicide data is roughly in the shape of a line, it worked well for this problem. But there are very few problems that are so simple, so soon we will look at more complex models. One other limitation of the current model is it only accepts one input variable. But if our data set had both age and ethnicity, for example, perhaps we could more accurately predict homicide death rate. We will also discuss soon a more complex model that handles multiple input variables.

3. Defining the loss function

For single variable regression, the loss function we used, \(L = \sum (y' - y)^2\), is the standard. However, there are a few considerations: first, this loss functions is suitable for this simple model, but with more advanced models this loss function isn't good enough. We will see why soon. Second, the optimization algorithm converged pretty slowly, needing about \(10000\) iterations. One cause is that the surface of the loss function is almost flat in a certain direction (you can see this in the 3D plot of it). Though this isn't inherently a problem with the formula for the loss function, the problem surfaces in the loss function. We will also see how to address this problem soon, and converge muster faster.

4. Minimizing the loss functions

Recall that we created and used the optimizer like so:

optimizer = tf.train.AdamOptimizer(learning_rate=0.2).minimize(L)
for t in range(10000):
    # Run one step of optimizer

You might be wondering what the magic numbers of learning_rate=0.2 (\( \alpha \)) and 10000 are. Let's start with the learning rate. In each iteration, gradient descent (and variants of it) take one small step that is determined by the derivative of the loss function. The learning rate is just the relative size of the step. So to take larger steps, we can use a larger learning rate. A larger learning rate can help us to converge more quickly, since we cover more distance in each iteration. But a learning rate too large can cause gradient descent to diverge that is, it won't reliably find the minimum.

So, once you have chosen a learning rate, then you need to run the optimizer for enough iterations so it actually converges to the minimum. The easiest way to make sure it runs long enough is just to monitor the value of the loss function, as we did in this tutorial.

Lastly, we didn't use normal gradient descent for optimization in this tutorial. Instead we used tf.train.AdamOptimizer. With small scale problems like this, there isn't much of a qualitative difference to intuit. In general the Adam optimizer is faster, smarter, and more reliable than vanilla gradient descent, but this comes into play a lot more with harder problems.

5. Use the trained model

Technically, using the trained model is the easiest part of machine learning: with the best parameters \(a\) and \(b\), you can simply plug new age values into \(x\) to predict new homicide rates. However, trusting that these predictions are correct is another matter entirely. Later in this guide we can look at various statistical techniques that can help determine how much we can trust a trained model, but for now consider some oddities with our trained homicide rate model.

One rather weird thing is that it accepts negative ages: according to the model, 1083 people who are -5 years old die from homicide every year in the US. Now, clearly this makes no sense since people don't have negative ages. So perhaps we should only let the model be valid for people with positives ages. Ok, so then 980 people who are 1 year old die from homicide every year. While this isn't impossible, it does seem pretty high compared to the known data of 652 for 21 year olds. It might seem possible (likely even) that fewer homicides occur for 1 year olds than 21 year olds: but we don't have the data for that, and even if we did, our model could not predict it correctly since it only models straight lines. Without more data, we have no basis to conclude that the number of \(1\) year old homicides is even close to 980.

While this might seem like a simple observation in this case, this problem manifests itself continually in machine learning, causing a variety of ethical problems. For example, in 2016 Microsoft released a chatbot on Twitter and it quickly learned to say fairly horrible and racist things. More seriously, machine learning is now being used to predict and guide police in cracking down on crime. While the concept might be well-intentioned, the results are despicable, as shown in an article by The Conversation:

Our recent study, by Human Rights Data Analysis Group’s Kristian Lum and William Isaac, found that predictive policing vendor PredPol’s purportedly race-neutral algorithm targeted black neighborhoods at roughly twice the rate of white neighborhoods when trained on historical drug crime data from Oakland, California. [...] But estimates – created from public health surveys and population models – suggest illicit drug use in Oakland is roughly equal across racial and income groups. If the algorithm were truly race-neutral, it would spread drug-fighting police attention evenly across the city.

With examples like these, we quickly move from a technical discussion about machine learning to a discussion about ethics. While the study of machine learning is traditionally heavily theoretical, I strongly believe that to effectively and fairly apply machine learning in society, we must spend significant effort evaluating the ethics of machine learning models.

This is an open question, and one that I certainly don't have an answer to right now. For the short term we can focus on the problem of not trusting a simple linear regression model to properly predict data outside of what it has been trained on, while in the long term keeping in mind that "with great power comes great responsibility."

Challenge Problems

Feel free to complete as many of these as you wish, to get more practice with single variable linear regression. Note that for different problems you might have to adjust the learning rate and / or the number of training iterations.

  1. Learn how to use numpy to generate an artificial data set that is appropriate for single variable linear regression, and then train a model on it. As a hint, for any \(x\) value you could create an artificial \(y\) value like so: \(y = ax + b + \epsilon \), where \(\epsilon\) is a random number that isn't too big, and \(a\) and \(b\) are fixed constants of your choice. If done correctly, your trained model should learn by itself the numbers you chose for \(a\) and \(b\).
  2. Run single variable linear regression on a data set of your choice. You can look at my list of regression data sets for ideas, you can search Kaggle, or you can search online, such as I did for the homicide data set. Many data sets might have multiple input variables, and right now you only know how to do single variable linear regression. We will deal with multiple variables soon, but for now you can always use only 1 of the input variables and ignore the rest.
  3. Experiment with altering the loss function, and observe the effects on the trained model. For example, you could change \((y' - y)^2\) to \(\mid y' - y \mid \) (use tf.abs(...)), or really anything you can think of.

Complete Code

The complete example code is available on GitHub, as well as directly here:

import numpy as np
import tensorflow as tf
import pandas as pd
import matplotlib.pyplot as plt

# Load the data, and convert to 1x30 vectors
D = pd.read_csv("homicide.csv")
x_data = np.matrix(D.age.values)
y_data = np.matrix(D.num_homicide_deaths.values)

# Plot the data
plt.plot(x_data.T, y_data.T, 'x')
plt.xlabel('Age')
plt.ylabel('US Homicide Deaths in 2015')
plt.title('Relationship between age and homicide deaths in the US')
plt.show()



### Model definition ###

# Define x (input data) placeholder
x = tf.placeholder(tf.float32, shape=(1, None))

# Define the trainable variables
a = tf.get_variable("a", shape=())
b = tf.get_variable("b", shape=())

# Define the prediction model
y_predicted = a*x + b


### Loss function definition ###

# Define y (correct data) placeholder
y = tf.placeholder(tf.float32, shape=(1, None))

# Define the loss function
L = tf.reduce_sum((y_predicted - y)**2)


### Training the model ###

# Define optimizer object
optimizer = tf.train.AdamOptimizer(learning_rate=0.2).minimize(L)

# Create a session and initialize variables
session = tf.Session()
session.run(tf.global_variables_initializer())

# Main optimization loop
for t in range(10000):
    _, current_loss, current_a, current_b = session.run([optimizer, L, a, b], feed_dict={
        x: x_data,
        y: y_data
    })
    print("t = %g, loss = %g, a = %g, b = %g" % (t, current_loss, current_a, current_b))


### Using the trained model to make predictions ###

# x_test_data has values similar to [20.0, 20.1, 20.2, ..., 79.9, 80.0]
x_test_data = np.matrix(np.linspace(20, 55))

# Predict the homicide rate for each age in x_test_data
y_test_data = session.run(y_predicted, feed_dict={
    x: x_test_data
})

# Plot the original data and the prediction line
plt.plot(x_data.T, y_data.T, 'x')
plt.plot(x_test_data.T, y_test_data.T)
plt.xlabel('Age')
plt.ylabel('US Homicide Deaths in 2015')
plt.title('Age and homicide death linear regression')
plt.show()

References

1

Centers for Disease Control and Prevention, National Center for Health Statistics. Underlying Cause of Death 1999-2015 on CDC WONDER Online Database, released December, 2016. Data are from the Multiple Cause of Death Files, 1999-2015, as compiled from data provided by the 57 vital statistics jurisdictions through the Vital Statistics Cooperative Program. Accessed at https://wonder.cdc.gov/ucd-icd10.html on Nov 22, 2017 2:18:46 PM.

Exploring Optimization Convergence

In the previous chapter we created a simple single variable linear regression model to fit a data set. While the Python code was actually fairly short and simple, I did gloss over some details related to the optimization, and I hope to use this short chapter to answer some dangling questions about it. Since this chapter doesn't introduce new models or concepts you can skip it (or come back later) if you prefer. However, getting a feel for optimization is useful for training just about any model, not just the single variable linear regression model, and this chapter should give you insight that is useful for the rest of this guide and just about any machine learning you do.

To explore optimization we are going to exactly copy the code from the previous chapter, and experiment with it. To review, let's look at the part of the code from before that performed the optimization (just look at the previous chapter if you need to review the entire thing):

# Define optimizer object
# L is what we want to minimize
optimizer = tf.train.AdamOptimizer(learning_rate=0.2).minimize(L)

# Create a session and initialize variables
session = tf.Session()
session.run(tf.global_variables_initializer())

# Main optimization loop
for t in range(10000):
    _, current_loss, current_a, current_b = session.run([optimizer, L, a, b], feed_dict={
        x: x_data,
        y: y_data
    })
    print("t = %g, loss = %g, a = %g, b = %g" % (t, current_loss, current_a, current_b))

The main unanswered questions to address are:

  1. I said that the learning rate affects how large of steps the optimization algorithm takes in one unit of time. But how do we choose an appropriate value for the learning rate, such as 0.2?
  2. What is this AdamOptimizer exactly, are there other choices for optimizers, and how do they differ?
  3. Currently this code runs for 10000 iterations, and that seems good enough to fully optimize L. But how do we choose this appropriate amount of time for training?

There aren't exact or easy answers to the above questions, but answering these questions is made even harder by the fact that we can not effectively visualize the training progress with the code we have. Currently we have some print(...) statements, which is good enough to see that the training error is decreasing, but not much more than that. Let's start with learning how to visualize training, since this will help us address the other questions and give us a deeper intuition about optimization.

How to visualize training progress

Extracting hyperparameters

One of the simplest ways to visualize training progress is to plot the value of the loss function over time. We could certainly plot the value of the loss function using matplotlib, like we plotted the data set. But TensorFlow actually has a tool built-in for plotting training progress, called TensorBoard. It's pretty handy, but first we need to do some work to refactor our current code.

The first thing I'd like to do is move hyperparameters to the top of the script. What exactly are hyperparameters? They are parameters that affect how the training of the model proceeds, but are not part of the model itself. For example, \(a\) and \(b\) are parameters, while the learning rate \(\alpha\) is a hyperparameter. The hyperparameters we have are:

  1. The learning rate, currently 0.2.
  2. The number of training iterations, currently 10000.
  3. The choice of optimization routine, currently tf.train.AdamOptimizer (This isn't often thought of as a hyperparameter, but here will think of it as one).

So, we can just put these into constants at the top of the code. Also, we are going to change these values to give us a simpler starting point:

import numpy as np
import tensorflow as tf
import pandas as pd
import matplotlib.pyplot as plt

### Hyperparameters ###

LEARNING_RATE = 0.000001 # Much smaller training rate, to make sure the optimization is at least reliable.
NUM_ITERS = 20000 # Twice as many training iterations, just gives us more room to experiment later.
OPTIMIZER_CONSTRUCTOR = tf.train.GradientDescentOptimizer # This is the simplest optimization algorithm.

# .. all the rest of the code, hyperparameter constants properly substituted

Setting up TensorBoard

Now, we need to add some code to configure TensorBoard for our model. We can setup TensorFlow to automatically keep track of the training progress of a, b, and L by using a tf.summary.scalar. Since we need to setup summaries for a, b, and L, but before we actually start training, insert this code after L is defined, but before we start training the model:

# ... above this is where L is defined

### Summary setup ###

tf.summary.scalar('a', a)
tf.summary.scalar('b', b)
tf.summary.scalar('L', L)
summary_node = tf.summary.merge_all()

log_name = "%g, %s" % (LEARNING_RATE, OPTIMIZER_CONSTRUCTOR.__name__)
summary_writer = tf.summary.FileWriter('/tmp/tensorflow/single_var_reg/' + log_name)
print("Open /tmp/tensorflow/single_var_reg/ with tensorboard")

# ... below this is where we create the optimizer and session

The first thing this code does is setup tf.summary.scalar nodes for a, b, and L, and then merge these 3 summaries into a single TensorFlow node, summary_node. Later we will need to run summary_node with our TensorFlow session.

The second thing this code does is setup where TensorFlow will write the logs to. We create a pretty log name by combining LEARNING_RATE and OPTIMIZER_CONSTRUCTOR (__name__ gets the name of the Python class), so that we can later compare logs that used different hyperparameters. Then, we create a tf.summary.FileWriter using the path /tmp/tensorflow/single_var_reg/ concatenated with the log name. This summary_writer is what we will use to write the summary_node to log files. Finally we print out a useful message so we remember where the logs are.

The last thing we need to do is make TensorFlow actually write the summary (summary_node) to the log (using summary_writer) as training progresses. In our training code before we had a print statement that showed how training was going, but now we can replace it (or keep it, you choose):

# Main optimization loop
for t in range(NUM_ITERS):
    # We don't need to run L, a, b, just the summary_node.
    # The output of summary_node is stored in summary.
    _, summary = session.run([optimizer, summary_node], feed_dict={
        x: x_data,
        y: y_data
    })
    # We write this summary to the log file (print statement was here).
    summary_writer.add_summary(summary, t)

And as a final touch, we can also delete or comment out the code that plots the data set, since it reduces the amount of clicking we have to do when trying different hyperparameters. The complete modified code should look something like this:

import numpy as np
import tensorflow as tf
import pandas as pd
import matplotlib.pyplot as plt

### Hyperparameters ###

LEARNING_RATE = 0.000001
OPTIMIZER_CONSTRUCTOR = tf.train.GradientDescentOptimizer
NUM_ITERS = 20000

# Load the data, and convert to 1x30 vectors
D = pd.read_csv("homicide.csv")
x_data = np.matrix(D.age.values)
y_data = np.matrix(D.num_homicide_deaths.values)


### Model definition ###

# Define x (input data) placeholder
x = tf.placeholder(tf.float32, shape=(1, None))

# Define the trainable variables
a = tf.get_variable("a", shape=())
b = tf.get_variable("b", shape=())

# Define the prediction model
y_predicted = a*x + b


### Loss function definition ###

# Define y (correct data) placeholder
y = tf.placeholder(tf.float32, shape=(1, None))

# Define the loss function
L = tf.reduce_sum((y_predicted - y)**2)


### Summary setup ###

log_name = "%g, %s" % (LEARNING_RATE, OPTIMIZER_CONSTRUCTOR.__name__)
tf.summary.scalar('a', a)
tf.summary.scalar('b', b)
tf.summary.scalar('L', L)
summary_node = tf.summary.merge_all()
summary_writer = tf.summary.FileWriter('/tmp/tensorflow/single_var_reg/' + log_name)
print("Open /tmp/tensorflow/single_var_reg/ with tensorboard")


### Training the model ###

# Define optimizer object
optimizer = OPTIMIZER_CONSTRUCTOR(learning_rate=LEARNING_RATE).minimize(L)

# Create a session and initialize variables
session = tf.Session()
session.run(tf.global_variables_initializer())

# Main optimization loop
for t in range(NUM_ITERS):
    _, summary = session.run([optimizer, summary_node], feed_dict={
        x: x_data,
        y: y_data
    })
    summary_writer.add_summary(summary, t)

Using TensorBoard

With the Python code prepared, we can now run it and use TensorBoard to visualize the training. First, run the Python code as usual, using Terminal, or your IDE. This will write logs to /tmp/tensorflow/single_var_reg/{log_name}. Now, we can open this up with TensorBoard, by running this command in Terminal (make sure to activate the virtual environment first):

python -m tensorboard.main --logdir=/tmp/tensorflow/single_var_reg/

Once you run that command, go to http://localhost:6006 in your web browser, and you should see plots of L, a, and b as the training progressed. With the hyperparameter choices from above, the plots should look like:

TensorBoard 1

Interpreting the TensorBoard plots

Ok, so now we have some plots showing the progression of L, a, and b. What can these plots tell us? Well, with the current choice of LEARNING_RATE = 0.000001, the plot for L clearly continues to decrease continually during training. This is good, since this means that the tf.train.GradientDescentOptimizer is doing it's job of decreasing the value of the loss function. However, the loss function continues to decrease quickly even towards the end of training: it is reasonable to expect that the loss function would continue to decrease substantially if we continued to train for more iterations. Therefore, we have not found the minimum of the loss function.

To remedy this, we could do one of 3 things: run the training for more iterations, increase the learning rate, or experiment with another optimization algorithm. We don't want to train for more iterations unless we have to, since that just takes more time, so we will start with increasing the learning rate.

Increasing the learning rate

Currently the learning rate is 0.000001, and is too small. A pretty easy way to try new learning rates is to go roughly by powers of 10. For now, we can try 0.000001, 0.000005, 0.00001. For each of these, you can just tweak the LEARNING_RATE constant, and re-run the code. You don't need to re-run the TensorBoard command (but you can), but make sure to reload the http://localhost:6006 page once you do all the runs. You should get plots that look like:

TensorBoard 2

We can clearly see the improvement by increasing the learning rate. The final loss obtained with a learning rate of 0.00001 is much smaller than our original loss obtained with a learning rate of 0.000001. In addition, we can see that the loss is decreasing more slowly at the end of training. However, the loss function still hasn't converged, as it is still decreasing significantly. Again, to fix this we could train for longer, but as before we can try increasing the learning rate even more. We can try a learning rate of 0.00005, and we get this:

TensorBoard 3

Huh? No plot!?!?? If you mouse over the L plot, TensorBoard will say that the value of L is NaN (not a number). What gives? Well, if the learning rate is too big then tf.train.GradientDescentOptimizer can explode: a and b (and consequently L) increase to infinity and become NaN. What "too big" is depends on the specific problem.

At this point there isn't a lot more that we can do by just tweaking the learning rate: it's either too big and causes the optimization to explode, or is too small to achieve convergence in 20000 iterations. We can certainly try more learning rates between 0.00001 and 0.00005, but it won't be a ton better than what we already have.

Using different optimization algorithms

So far we have been using tf.train.GradientDescentOptimizer. It is the simplest, classic way to iteratively train machine learning models. As discussed in the previous chapter, it is like moving a ball downhill, according to the current slope (aka the derivative). Generally, gradient descent is part of a class of optimization algorithms called first-order methods, since it uses only information from the first derivative of the loss function, and not higher-order derivatives. First-order methods are currently the dominant way to train most machine learning models.

In fact, there are many first order methods, other than simple gradient descent. Most of them are designed to offer an advantage over other first-order methods via speed to find convergence, reliability, ease of use, etc. For a fairly in-depth exploration, see this blog post. To see the different optimization algorithms that are built-in to TensorFlow, see the documentation here. In this list, you can see the tf.train.AdamOptimizer that we used before, and the classic tf.train.GradientDescentOptimizer.

We can experiment with any number of these. Here, I''ll demonstrate experimenting with tf.train.AdagradOptimizer, but feel free to play around with any of them. To use tf.train.AdagradOptimizer we just need to change OPTIMIZER_CONSTRUCTOR, and set LEARNING_RATE back to 0.000001 for good measure:

LEARNING_RATE = 0.000001
OPTIMIZER_CONSTRUCTOR = tf.train.AdagradOptimizer

Running this we see:

TensorBoard 4

Well, this is disappointing: L did not seem to decrease at all during training. The problem is that the learning rate used by gradient descent is really an entirely different learning rate from the Adagrad one: conceptually they are similar, but are on entirely different scales numerically. So, we just need to try different learning rates for Adagrad now. Since the value of L stayed constant with this very small learning rate, we expect that we need to try much larger learning rates for Adagrad. In fact, by trying learning rates of 0.5, 1, and 5, we get these plots:

TensorBoard 5

Now this is looking like progress! For the first time we start to get a sense of the loss rapidly decreasing, and then slowing down substantially. In addition, the final value of a is now negative (which we know is correct) compared to previous runs which ended either positive or close to zero. However, by looking at the plots for a and b (and to a lesser degree L) we can see that we still haven't achieved convergence: a and b haven't stopped changing substantially at the end of training. So, time to increase the learning rate even more! Before doing so, I am going to delete the logs of previous runs, except for the Adagrad run with a learning rate of 5, so that we can read the plots more clearly:

# This will delete logs of all the runs
rm -rf /tmp/tensorflow/single_var_reg/
# Or, you can delete a specific run, for example:
rm -rf /tmp/tensorflow/single_var_reg/5e-05,\ GradientDescentOptimizer/

By trying learning rates of 10 and 50, we finally achieve convergence:

TensorBoard 6

Qualitatively, this looks like convergence (with a learning rate of 10, and certainly with a learning rate of 50) since the progress that Adagrad is making on decreasing L (and adjusting a and b) has hit a brick wall: no matter how long we run Adagrad, we can't seem to get a loss function value lower than about \(3.9296 \cdot 10^4 \), and similarly for the values of a and b. We've finally trained our model completely.

Unfortunately, I don't know of an easy way to intuitively understand the differences between Adagrad, Adam, and other first-order methods. This blog post does give some mathematical analysis that explains what each algorithm tries to improve upon, and some reasoning for choosing an algorithm, but it can be tricky to apply to real problems. In general, you can always start with the simplest algorithm (gradient descent), and if it isn't converging quickly enough for you, then you can switch to a more sophisticated algorithm, such as Adagrad, Adam, or others.

Concluding Remarks

The experimental nature of this chapter should illustrate the practicalities of machine learning: a lot of cutting-edge machine learning currently involves running multiple experiments to try to find the best combination of hyperparameters. There isn't a golden rule for choosing the optimization algorithm and hyperparameters, but hopefully this chapter demonstrates how to alter the algorithm and hyperparameters in TensorFlow and monitor convergence using TensorBoard. The most important takeaways are:

  1. Learning how to use TensorBoard
  2. Recognizing convergence
  3. Recognizing the symptoms of too small of a learning rate
  4. Recognizing the symptoms of too large of a learning rate

In future chapters I won't include the code specifically for TensorBoard (unless it is important to that chapter) since I don't want it to get in the way of actual models, but I would highly encourage you to insert your own TensorBoard summary code, and monitor plots of convergence in TensorBoard, since it is useful both educationally and practically.

Challenge Problems

  1. Experiment on your own with a few other built-in TensorFlow optimization algorithms, and try different learning rates. If you prefer a more focused goal, try to beat my configuration of an Adagrad optimizer with a learning rate of 50, and converge faster. Also note that some optimization algorithms have additional hyperparameters other than the learning rate. See the TensorFlow documentation for information about these.
  2. One other cause of slow convergence for the homicide rate linear regression is the somewhat extreme scaling of the problem. The \(y\) variable is a whole order of magnitude greater than the \(x\) variable, and this affects optimization. We will actually look at this problem specifically in chapter 2.4, but for now you can experiment on your own with one solution: instead of using the \(x\) and \(y\) data directly from the data set, modify them first to rescale them. A quick, hacky way is to modify the code that loads the data, so that \(x\) and \(y\) vary between 0 and 1:
# Load the data, and convert to 1x30 vectors
D = pd.read_csv("homicide.csv")
# 21 and 50 are the min and max of x_data
x_data = (np.matrix(D.age.values) - 21.0) / (50.0 - 21.0)
# 196 and 653 are the min and max of y_data
y_data = (np.matrix(D.num_homicide_deaths.values) - 196.0) / (653.0 - 196.0)

On your own, add this code and see if you can achieve convergence using only gradient descent. You can also see how quickly you can achieve convergence using a more advanced algorithm such as Adam.

Complete Code

The complete code including TensorBoard summaries, and using Adagrad is available on GitHub and directly below. Note that this code lacks the plotting of the data and the linear regression line:

import numpy as np
import tensorflow as tf
import pandas as pd
import matplotlib.pyplot as plt

### Hyperparameters ###

LEARNING_RATE = 50.0
OPTIMIZER_CONSTRUCTOR = tf.train.AdagradOptimizer
NUM_ITERS = 20000

# Load the data, and convert to 1x30 vectors
D = pd.read_csv("homicide.csv")
x_data = np.matrix(D.age.values)
y_data = np.matrix(D.num_homicide_deaths.values)


### Model definition ###

# Define x (input data) placeholder
x = tf.placeholder(tf.float32, shape=(1, None))

# Define the trainable variables
a = tf.get_variable("a", shape=())
b = tf.get_variable("b", shape=())

# Define the prediction model
y_predicted = a*x + b


### Loss function definition ###

# Define y (correct data) placeholder
y = tf.placeholder(tf.float32, shape=(1, None))

# Define the loss function
L = tf.reduce_sum((y_predicted - y)**2)


### Summary setup ###

log_name = "%g, %s" % (LEARNING_RATE, OPTIMIZER_CONSTRUCTOR.__name__)
tf.summary.scalar('a', a)
tf.summary.scalar('b', b)
tf.summary.scalar('L', L)
summary_node = tf.summary.merge_all()
summary_writer = tf.summary.FileWriter('/tmp/tensorflow/single_var_reg/' + log_name)
print("Open /tmp/tensorflow/single_var_reg/ with tensorboard")


### Training the model ###

# Define optimizer object
optimizer = OPTIMIZER_CONSTRUCTOR(learning_rate=LEARNING_RATE).minimize(L)

# Create a session and initialize variables
session = tf.Session()
session.run(tf.global_variables_initializer())

# Main optimization loop
for t in range(NUM_ITERS):
    _, summary = session.run([optimizer, summary_node], feed_dict={
        x: x_data,
        y: y_data
    })
    summary_writer.add_summary(summary, t)
    # print("t = %g, loss = %g, a = %g, b = %g" % (t, current_loss, current_a, current_b))

Multi Variable Regression

In chapter 2.1 we learned the basics of TensorFlow by creating a single variable linear regression model. In this chapter we expand this model to handle multiple variables. Note that less time will be spent explaining the basics of TensorFlow: only new concepts will be explained, so feel free to refer to previous chapters as needed.

Motivation

Recall that a single variable linear regression model can learn to predict an output variable \(y\) under these conditions:

  1. There is only one input variable, \(x\)
  2. There is a linear relationship between \(y\) and \(x\), that is, \(y \approx ax + b\)

In practice, the above conditions are very limiting: if you have a simple data set then by all means you should try using single variable linear regression, but in most cases we have significantly more complex data. For example, consider using the following (abbreviated) data from the 1990 census to learn to predict housing prices. Note that each row represents a single housing district:

House Median Age Total Rooms Total Bedrooms ... Median House Value
41.0 880.0 129.0 ... 452600.0
21.0 7099.0 1106.0 ... 358500.0
52.0 1467.0 190.0 ... 352100.0
52.0 1274.0 235.0 ... 341300.0
52.0 1627.0 280.0 ... 342200.0
52.0 919.0 213.0 ... 269700.0
... ... ... ... ...

To predict the values of houses, we have at least 3 real-valued variables (age, number of rooms, number of dedrooms) that could potentially be useful. To analyze this sort of complex, real-world data we need to learn to handle multiple input variables.

One approach to handling multiple variables would be to reduce the number of input variables to only 1 variable, and then training a single variable linear regression model using that. In fact, an important area of research in machine learning (and one that will be covered later) called dimensionality reduction deals with this problem of reducing the number of variables. However, it's important to realize that the number of variables can only be reduced so far, and its extremely rare that you can reduce a data set to only 1 variable. For now you need to take this statement on faith, but in later chapters we will investigate it more thoroughly.

So, it seems that we will have to deal with training models that can handle multiple variables. In this chapter we learn how to allow multiple input variables in our linear regression model. Such a model is called multi variable linear regression, or just linear regression.

Theory

Most of the theory is similar to the theory for single variable linear regression, but we will need to augment and generalize it to handle multiple variables.

Data set format

Previously we defined our data set \(D\) as consisting of many example pairs of \(x\) and \(y\), where \(m\) is the number of examples: \[ D = \{ (x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \cdots, (x^{(m)}, y^{(m))} \} \]

Note that I have changed the notation compared to before. The notation \(x^{(i)}\) refers to the \(i\)'th \(x\) training example, it does NOT mean \(x\) to the \(i\)'th power, which would be written as \(x^i\). I promise the notation change will be useful shortly.

Alternatively, we can write \(D\) as 2 vectors of shape 1 x \(m\): \[ D_x = \begin{bmatrix} x^{(1)}, x^{(2)}, \dots, x^{(m)} \end{bmatrix} \\ D_y = \begin{bmatrix} y^{(1)}, y^{(2)}, \dots, y^{(m)} \end{bmatrix} \]

But now, we need each \(x^{(i)}\) example to contain multiple numbers, one for each input variable. Let \(n\) be the number of input variables. Then the easiest way to write this is to let each \(x^{(i)}\) be a vector of shape \(n\) x 1. That is, \[ x^{(i)} = \begin{bmatrix} x^{(i)}_1 \\ x^{(i)}_2 \\ \vdots \\ x^{(i)}_j \\ \vdots \\ x^{(i)}_n \end{bmatrix} \] Note that the notation \(x^{(i)}_j\) denotes the \(j\)'th input variable in the \(i\)'th example data.

Since each \(x^{(i)}\) has \(n\) rows, and \(D_x\) has \(m\) columns, each of which is an \(x^{(i)}\), we can write \(D_x\) as a massive \(n \times m\) matrix: \[ D_x = \begin{bmatrix} x^{(1)}, x^{(2)}, \dots, x^{(m)} \end{bmatrix} = \begin{bmatrix} x^{(1)}_1 & x^{(2)}_1 & \dots & x^{(i)}_1 & \dots & x^{(m)}_1 \\ x^{(1)}_2 & x^{(2)}_2 & \dots & x^{(i)}_2 & \dots & x^{(m)}_2 \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ x^{(1)}_j & x^{(2)}_j & \dots & x^{(i)}_j & \dots & x^{(m)}_j \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ x^{(1)}_n & x^{(2)}_n & \dots & x^{(i)}_n & \dots & x^{(m)}_n \\ \end{bmatrix} \] So, each column of \(D_x\) represents a single input data example. We don't need to change the 1 x \(m\) vector \(D_y\), since we still only have 1 output variable.

Model concept

So, we now have an input data matrix \(D_x\) with each column vector representing a single input data example, and we have the corresponding \(D_y\) row vector, each entry of which is an output data example. How do we define a model which can linearly estimate the output \(y'^{(i)}\) given the input data vector \(x^{(i)}\)? Let's build it up from simple concepts, and build towards more complex linear algebra.

Since we want \(y'^{(i)}\) to depend linearly on each \(x^{(i)}_j\) for \(1 \leq j \leq n\), we can write: \[ y'^{(i)} = a_1 x^{(i)}_1 + a_2 x^{(i)}_2 + \cdots + a_j x^{(i)}_j + \cdots + a_n x^{(i)}_n + b \]

This is fine mathematically, but it's not very general. Suppose \(n = 100\): then we would have to literally write out 100 terms in our TensorFlow code. We can generalize this using linear algebra. Let \(A\) be a row vector of shape 1 x \(n\), containing each \(a_j\): \[ A = \begin{bmatrix} a_1, a_2, \cdots, a_j, \cdots, a_n \end{bmatrix} \]

Now, let's see what happens if we compute \(A x^{(i)}\), as matrix multiplication. Note that \(A\) has shape 1 x \(n\) and \(x^{(i)}\) has shape \(n\) x 1. This is perfect! When performing matrix multiplication, the inner dimensions (in this case \(n\) and \(n\)) have to match, and the outer dimensions (in this case \(1\) and \(1\)) determine the output shape of the multiplication. So \(A x^{(i)}\) will have shape 1 x 1, or in other words, just a single number, in fact it is exactly \(y'^{(i)}\). How does this matrix multiplication exactly work? I'll refer you to this video by Khan Academy, and explain it briefly in this case. Here, it is easier since \(A\) is a row vector, and \(x^{(i)}\) is a column vector. We simply multiply each corresponding entry, and add it all up: \[ A x^{(i)} + b = \begin{bmatrix} a_1, a_2, \cdots, a_j, \cdots, a_n \end{bmatrix} \begin{bmatrix} x^{(i)}_1 \\ x^{(i)}_2 \\ \vdots \\ x^{(i)}_j \\ \vdots \\ x^{(i)}_n \end{bmatrix} + b = a_1 x^{(i)}_1 + a_2 x^{(i)}_2 + \cdots + a_j x^{(i)}_j + \cdots + a_n x^{(i)}_n + b = y'^{(i)} \]

This matrix equation, \(y'(x, A, b) = Ax + b\) is exactly what we want as our model. As one final note, recall that in the actual implementation, we don't want \(x\) and \(y'\) to represent just one input data and predicted output, we want them to represent several. Since \(x\) is a column vector, the natural way to represent multiple input data points is with a matrix, very similar to the matrix \(D_x\), just not necessarily with all the columns of \(D_x\), and \(y'\) should be a row vector. Specifically, \(A\) has shape 1 x \(n\), \(x\) has shape \(n\) x None, and \(y\) has shape 1 x None, using the TensorFlow convention that None represents a yet-to-be-determined matrix size.

Now defining the loss function is pretty much the same as before, just using the new model: \[ L(A, b) = \sum_{i=1}^m (y'(x^{(i)}, A, b) - y^{(i)})^2 = \sum_{i=1}^m (A x^{(i)} + b - y^{(i)})^2 \]

To minimize the loss function, we use the same process as before, gradient descent. However, previously the gradient descent was altering 2 variables (\(a\) and \(b\)) so as to minimize the loss function, and so we could plot the loss function and gradient descent progress in terms of \(a\) and \(b\). However, now the optimization needs to alter many more variables, since \(A\) actually contains \(n\) variables, the gradient descent must be performed in \(n+1\) dimensional space, and we don't have an easy way to visualize this.

With the more general linear algebra formulation of linear regression under our belts, let's move on to actually coding stuff.

Implementation

As before, we need to: import data, define the model, define the loss function, run gradient descent, and finally make predictions. Many steps will be similar to the single variable case, but for completeness I will walk through them briefly.

For building and testing the implementation we will use a synthetic data set consisting of \(n=2\) input variables. You can download the synthetic data set here. By synthetic, I mean that I purposefully created a very nicely behaved data set so that we can practice implementing multi variable linear regression, and verify that we converged to the right answer. In fact, the synthetic data is generated as \(y = 2x_1 + 1.3x_2 + 4 + \varepsilon \) where \(\varepsilon\) is random noise. If we implement multi variable linear regression correctly, then we should obtain approximately \(A = \begin{bmatrix} 2, 1.3 \end{bmatrix}, b = 4\). This plot illustrates what the data looks like in 3 dimensions, essentially a plane in 3 dimensions with some random fluctuations:

scatter

Importing the data

As explained above, the input data set can be organized as an \(n \times m\) matrix. Since we will load the entire data set (input and output) from a single CSV file, and we have 2 input variables, the CSV file will contain 3 columns: the first 2 are the input variables, and the last one is the output variable. So, first we load the CSV file into an \(m\) x 3 matrix, and then separate the first 2 columns from the last:

import numpy as np
import tensorflow as tf
import pandas as pd
import matplotlib.pyplot as plt

# First we load the entire CSV file into an m x 3
D = np.matrix(pd.read_csv("linreg-multi-synthetic-2.csv", header=None).values)

# We extract all rows and the first 2 columns into X_data
# Then we flip it
X_data = D[:, 0:2].transpose()

# We extract all rows and the last column into y_data
# Then we flip it
y_data = D[:, 2].transpose()

# And make a convenient variable to remember the number of input columns
n = 2

The syntax D[:, 0:2] might be new, particularly if you haven't worked with NumPy before. In the single variable implementation we used Panda's functionality to access the columns by column name. This is a great approach, but sometimes you might need to be more flexible in how you access columns of data.

Note: The basic syntax for subscripting a matrix is: D[3, 6] (for example), which refers to the row at index 3 and the column at index 6 in the matrix D. Note that in numpy the row and column indices start at 0! This means that D[0, 0] refers to the top-left entry of matrix D. If you are coming from a pure math background, or have used MATLAB before, it is a common error to assume the indices start at 1.

Now for slicing, the : character is used to indicate a range. If it is used by itself, it indicates the entire range of rows / columns. For example, D[:, 42] refers to all rows of D, and the column at index 42. If it is used with indices, then i:j indicates the range of rows / columns at indices i, i+1, ..., j-1, but not including j.

So, D[:, 0:2] means to read the values in D at all rows and at columns with index 0 and 1 (the entire first 2 columns, i.e. the input data columns). Likewise, D[:, 2] means to read the values in D at all rows and at the column of index 2 (the entire last column, i.e. the output data column).

This matrix subscripting and slicing is almost what we want, but not quite. The problem is that D[:, 0:2], which contains our \(D_x\) data, is a matrix of shape \(m \times n\), but earlier we decided that we wanted \(D_x\) to be an \(n \times m\) matrix, so we need to flip it. To do so, we use the transpose of the matrix. Mathematically we write the transpose of a matrix \(A\) as \(A^T\), and in Python we can compute it using A.transpose(). Essentially, the transpose of a matrix simply flips it along the diagonal, as shown in this animation:

Matrix transpose.gif
By LucasVB - Link

So, D[:, 0:2].transpose() is a matrix of shape \(n \times m\), and is our correct data input matrix \(D_x\). We save this matrix to the variable X_data. Likewise, we also transpose D[:, 2] to correctly compute \(D_y\), and save it in y_data.

At this point we have our \(m \times n\) input data matrix X_data and our \(m \times 1\) output vector y_data loaded. In addition, we conveniently have the number of columns stored in n, so now we can start defining our model.

Defining the model

As shown above, we want our model parameters to consist of a matrix \(A\) of size \(1 \times n\) and a single number \(b\). Then, we define: \[ y'(x, A, b) = Ax + b \]

First, we can define the input and correct output placeholders:

# Define data placeholders
x = tf.placeholder(tf.float32, shape=(n, None))
y = tf.placeholder(tf.float32, shape=(1, None))

And then we can define the trainable variables, the output prediction, and the loss function:

# Define trainable variables
A = tf.get_variable("A", shape=(1, n))
b = tf.get_variable("b", shape=())

# Define model output
y_predicted = tf.matmul(A, x) + b

# Define the loss function
L = tf.reduce_sum((y_predicted - y)**2)

Training the model

At this point, we have a 1 dimensional output y_predicted which we compare against y using L to train the model, which is exactly the same situation as single variable linear regression. The remaining code to train the model is extremely similar, so I'll simply display it here, and then explain the few differences:

# Define optimizer object
optimizer = tf.train.AdamOptimizer(learning_rate=0.1).minimize(L)

# Create a session and initialize variables
session = tf.Session()
session.run(tf.global_variables_initializer())

# Main optimization loop
for t in range(2000):
    _, current_loss, current_A, current_b = session.run([optimizer, L, A, b], feed_dict={
        x: X_data,
        y: y_data
    })
    print("t = %g, loss = %g, A = %s, b = %g" % (t, current_loss, str(current_A), current_b))

First, we have a different learning rate than the learning rate used in single variable regression. Even though the training algorithm is the same, since this is a different problem than single variable regression, we need find a good learning rate specific to this problem. A great way to do this for your own problems is using TensorBoard, as explained in the chapter Optimization Convergence.

Besides this, the only other conceptual difference is that at each step of the optimizer we are modifying the entire vector A (in addition to b), rather than just a single number. However, TensorFlow abstracts this away for us, and conceptually we just need to know that we are training the variable A.

The final print statements should output something close to:

t = 1994, loss = 1.44798e+06, A = [[ 2.00547647  1.3020972 ]], b = 3.95038
t = 1995, loss = 1.44798e+06, A = [[ 2.00547647  1.3020972 ]], b = 3.95038
t = 1996, loss = 1.44798e+06, A = [[ 2.00547647  1.3020972 ]], b = 3.95038
t = 1997, loss = 1.44798e+06, A = [[ 2.00547647  1.3020972 ]], b = 3.95038
t = 1998, loss = 1.44798e+06, A = [[ 2.00547647  1.3020972 ]], b = 3.95038
t = 1999, loss = 1.44798e+06, A = [[ 2.00547647  1.3020972 ]], b = 3.95038

At this point we have converged to our approximate solution of \(A \approx \begin{bmatrix} 2.005, 1.302 \end{bmatrix}, b \approx 3.95\). Note that this is not exactly the same as the expected answer of \(A = \begin{bmatrix} 2, 1.3 \end{bmatrix}, b \approx 4\), primarily because some random noise was added to each point in the data set.

The model is fully trained, so now given a new input \(x\) we could now predict the output \(y' = Ax + b\), using all the learned information from all input variables.

Concluding Remarks

Linear regression with multiple variables is only slightly different in essence from single variable linear regression. The main difference is abstracting the linear operation \(ax\) where \(a\) and \(x\) are single numbers to the linear operation \(Ax\), where now \(A\) is a matrix, \(x\) is a vector. In addition, at the implementation level we also have to deal with loading data in a more sophisticated manner, but otherwise the code is mostly the same. In later chapters we will use this abstraction we have built to define even more powerful models.

Challenge Problems

So far this chapter has used a synthetic data set, linreg-multi-synthetic-2.csv, for easy demonstration. The exercises are primarily concerned with getting practice at applying this model to real-world data. Note that in real-world data not all columns are useful, and some might not have a linear relationship with the MPG. Including these unhelpful columns in your model might decrease the accuracy of your model. You should try plotting various columns vs. the output column to determine which seem most helpful in predicting the output, and then only include these useful columns as your input.

In addition, many data sets will have so called messy data, which require you to do some manipulation in Python to make sure the data is imported cleanly and properly. For example, some rows might containg missing data: for these your code can not crash or incorrectly import the data. Instead, you need to adopt a strategy to still import the data as best as you can: for example, you can simply ignore any rows that have incomplete data.

Note that we have not discussed how to rigorously evaluate how good a model is yet. For now you can use the value of the loss function, along with some intuition and creating plots. Evaluation will be discussed more in chapter 2.7.

  1. Download this red wine quality data set, and try to predict the quality of the wine (last column) from the physicochemical input data (other columns).
  2. Download this car MPG data set, and try to predict the MPG (first column) based on some of the other columns.
  3. Download this California 1990 Housing Value data set, and try to predict the house values based on various factors.

Complete Code

The complete example code is available on GitHub, as well as directly here:

import numpy as np
import tensorflow as tf
import pandas as pd
import matplotlib.pyplot as plt

# First we load the entire CSV file into an m x 3
D = np.matrix(pd.read_csv("linreg-multi-synthetic-2.csv", header=None).values)

# We extract all rows and the first 2 columns into X_data
# Then we flip it
X_data = D[:, 0:2].transpose()

# We extract all rows and the last column into y_data
# Then we flip it
y_data = D[:, 2].transpose()

# And make a convenient variable to remember the number of input columns
n = 2

# Define data placeholders
x = tf.placeholder(tf.float32, shape=(n, None))
y = tf.placeholder(tf.float32, shape=(1, None))

# Define trainable variables
A = tf.get_variable("A", shape=(1, n))
b = tf.get_variable("b", shape=())

# Define model output
y_predicted = tf.matmul(A, x) + b

# Define the loss function
L = tf.reduce_sum((y_predicted - y)**2)

# Define optimizer object
optimizer = tf.train.AdamOptimizer(learning_rate=0.1).minimize(L)

# Create a session and initialize variables
session = tf.Session()
session.run(tf.global_variables_initializer())

# Main optimization loop
for t in range(2000):
    _, current_loss, current_A, current_b = session.run([optimizer, L, A, b], feed_dict={
        x: X_data,
        y: y_data
    })
    print("t = %g, loss = %g, A = %s, b = %g" % (t, current_loss, str(current_A), current_b))

Feature Scaling

In chapters 2.1, 2.2, 2.3 we used the gradient descent algorithm (or variants of) to minimize a loss function, and thus achieve a line of best fit. However, it turns out that the optimization in chapter 2.3 was much, much slower than it needed to be. While this isn’t a big problem for these fairly simple linear regression models that we can train in seconds anyways, this inefficiency becomes a much more drastic problem when dealing with large data sets and models.

Example of the Problem

First, let’s look at a concrete example of the problem, by again considering a synthetic data set. Like in chapter 2.3 I generated a simple synthetic data set consisting of 2 independent variables \(x_1\) and \(x_2\), and one dependent variable \(y = 2x_1 + 0.013x_2\). However, note that the range for \(x_1\) is 0 to 10, but the range for \(x_2\) is 0 to 1000. Let's call this data set \(D\). A few sample data points look like this:

\(x_1\) \(x_2\) \(y\)
7.36 1000 34.25
9.47 0 19.24
0.52 315.78 3.50
1.57 315.78 11.02
6.31 263.15 13.93
1.57 526.31 10.21
0.52 105.26 3.41
4.21 842.10 16.27
2.63 894.73 19.04
3.68 210.52 4.60
... ... ...

For simplification I have excluded a constant term from this synthetic data set, and when training models I "cheated" by not training the constant term \(b\) and just setting it to 0, so as to simplify visualization. The model that I used was simply:

y_predicted = tf.matmul(A, x) # A is a 1x2 matrix

Since this is just a simple application of the ideas from chapter 2.3, one would expect this to work fairly easily. However, when using a GradientDescentOptimizer, training goes rather poorly. Here is a plot showing the training progress of A[0] and the loss function side-by-side:

Training progress without scaling

This optimization was performed with a learning rate of 0.0000025, which is about the largest before it would diverge. The loss function quickly decreases at first, but then quickly stalls, and decreases quite slowly. Meanwhile, \(A_0\) is increasing towards the expected value of approximately 2.0, but very slowly. Within 5000 iterations this model fails to finish training.

Now let's do something rather strange: take the data set \(D\), and create a new data set \(D'\) by dividing all the \(x_2\) values by 100. The resulting data set \(D'\) looks roughly like this:

\(x_1'\) \(x_2'\) \(y'\)
7.36 10 34.25
9.47 0 19.24
0.52 3.1578 3.50
1.57 3.1578 11.02
6.31 2.6315 13.93
1.57 5.2631 10.21
0.52 1.0526 3.41
4.21 8.4210 16.27
2.63 8.9473 19.04
3.68 2.1052 4.60
... ... ...

A crucial note is that while \(D'\) is technically different from \(D\), it contains exactly the same information: one can convert between them freely, by dividing or multiplying the second column by 100. In fact, since this transformation is linear, and we are using a linear model, we can train our model on \(D'\) instead. We would just expect to obtain a value of 1.3 for A[1] rather than 0.013. So let's give it a try!

The first interesting observation is that we can use a much larger learning rate. The largest learning rate we could use with \(D\) was 0.0000025, but with \(D'\) we can use a learning rate of 0.01. And when we plot A[0] and the loss function for both \(D\) and \(D'\) we see something pretty crazy:

Training progress with scaling

While training on \(D\) wasn't even close to done after 5000 iterations, training on \(D'\) seems to have completed nearly instantly. If we zoom in on the first 60 iterations, we can see the training more clearly:

Training progress with scaling zoom

So this incredibly simple data set transformation has changed a problem that was untrainable within 5000 iterations to one that can be trained practically instantly with 50 iterations. What is this black magic, and how does it work?

Visualizing Gradient Descent with Level Sets

One of the best ways to gain insight in machine learning is by visualization. As seen in chapter 2.1 we can visualize loss functions using plots. Since we have 2 parameters A[0] and A[1] of the loss function, it would be a 3D plot. We used this for visualization in chapter 2.1, but frankly it's a bit messy looking. Instead, we will use level sets (also called contour plots), which use lines to indicate where the loss function has a constant value. An example is easiest, so here is the contour plot for the loss function for \(D'\), the one that converges quickly:

Contour plot of D'

Each ellipse border is where the loss function has a particular constant value (the innermost ones are labeled), and the red X marks the spot of the optimal values of A[0] and A[1]. By convention the particular constant values of the loss function are evenly spaced, which means that contour lines that are closer together indicate a "steeper" slope. In this plot, the center near the X is quite shallow, while far away is pretty steep. In addition, one diagonal axis of the ellipses is steeper than the other diagonal axis.

We can also plot how A[0] and A[1] evolve during training on the contour plot, to get a feel for how gradient descent is working:

Contour plot of D' with training

Here, we can see that the initial values for A[0] and A[1] start pretty far off, but gradient descent quickly zeros in towards the X. As a side note, the line connecting each pair of dots is perpendicular to the line of the level set: see if you can figure out why.

So if that is what the contour plot looks like for the "good" case of \(D'\), how does it look for \(D\)? Well, it looks rather strange:

Contour plot of D

Don't be fooled: like the previous plot the level sets also form ellipses, but they are so stretched that they are nearly straight lines at this scale. This means that vertically (A[0] is constant and we vary A[1]) there is substantial gradient, but horizontally (A[1] is constant and we vary A[0]) there is practically no slope. Put another way, gradient descent only knows to vary A[1], and (almost) doesn't vary A[0]. We can test this hypothesis by plotting how gradient descent updates A[0] and A[1], like above. Since gradient descent makes such little progress though, we have to zoom in a lot to see what is going on:

Contour plot of D

We can clearly see that gradient descent applies large updates to A[1] (a bit too large, a smaller learning rate would have been a bit better) due to the large gradient in the A[1] direction. But due to the (comparatively) tiny gradient in the A[0] direction very small updates are done to A[0]. Gradient descent quickly converges on the optimal value of A[1], but is very very far away from finding the optimal value of A[0].

Let's take a quick look at what is going on mathematically to see why this happens. The model we are using is: \[ y'(x, A) = Ax = a_1 x_1 + a_2 x_2 \] Here, \(a_1\) is in the role of A[0] and \(a_2\) is A[1]. We substitute this into the loss function to get: \[ L(a_1, a_2) = \sum_{i=1}^m (a_1 x_1^{(i)} + a_2 x_2^{(i)} - y^{(i)})^2 \] Now if we differentiate \(L\) in the direction of \(a_1\) and \(a_2\) separately, we get: \[ \frac{\partial L}{\partial a_1} = \sum_{i=1}^m 2(a_1 x_1^{(i)} + a_2 x_2^{(i)} - y^{(i)})x_1^{(i)} \\ \frac{\partial L}{\partial a_2} = \sum_{i=1}^m 2(a_1 x_1^{(i)} + a_2 x_2^{(i)} - y^{(i)})x_2^{(i)} \] Here we can see the problem: the inner terms of the derivatives are the same, except one is multiplied by \(x_1^{(i)}\) and the other by \(x_2^{(i)}\). If \(x_2\) is on average 100 times bigger than \(x_1\) (which it is in the original data set), then we would expect \(\frac{\partial L}{\partial a_2}\) to be roughly 100 times bigger than \(\frac{\partial L}{\partial a_1}\). It isn't exactly 100 times larger, but with any reasonable data set it should be close. Since the derivatives in the directions of \(a_1\) and \(a_2\) are scaled completely differently, gradient descent fails to update both of them adequately.

The solution is simple: we need to rescale the input features before training. This is exactly what happened when we mysteriously divided by 100: we rescaled \(x_2\) to be comparable to \(x_1\). But we should work out a more methodical way of rescaling, rather than randomly dividing by 100.

Rescaling Features

This wasn't explored above, but there are really two ways that we potentially need to rescale features. Consider an example where \(x_1\) ranges between -1 and 1, but \(x_2\) ranges between 99 and 101: both of these features have (at least approximately) the same standard deviation, but \(x_2\) has a much larger mean. On the other hand, consider an example where \(x_1\) still ranges between -1 and 1, but \(x_2\) ranges between -100 and 100. This time, they have the same mean, but \(x_2\) has a much larger standard deviation. Both of these situations can make gradient descent and related algorithms slower and less reliable. So, our goal is to ensure that all features have the same mean and standard deviation.

Note: There are other methods to measure how different features are, and then subsequently rescale them. For a look at more of them, feel free to consult the Wikipedia article. However, after implementing the technique presented here, any other technique is fairly similar and easy to implement if needed.

Without digressing too far into statistics, let's quickly review how to calculate the mean \(\mu\) and standard deviation \(\sigma\). Suppose we have already read in our data set in Python into a Numpy vector / matrix, and have all the values for the feature \(x_j\), for each \(j\). Mathematically, the mean for feature \(x_j\) is just the average: \[ \mu_j = \frac{\sum_{i=1}^m x_j^{(i)}}{m} \] In Numpy there is a convenient np.mean() function we will use.

The standard deviation \(\sigma\) is a bit more tricky. We measure how far each point \(x_j^{(i)}\) is from the mean \(\mu_j\), square this, then take the mean of all of this, and finally square root it: \[ \sigma_j = \sqrt{\frac{\sum_{i=1}^m (x_j^{(i)} - \mu_j)^2 }{m}} \] Again, Numpy provides a convenient np.std() function.

Note: Statistics nerds might point out that in the above equation we should divide by \(m - 1\) instead of \(m\) to obtain an unbiased estimate of the standard deviation. This might well be true, but in this usage it does not matter since we are only using this to rescale features relative to each other, and not make a rigorous statistical inference. To be more precise, doing so would involve multiplying each scaled feature by only a constant factor, and will not change any of their standard deviations or means relative to each other.

Once we have every mean \(\mu_j\) and standard deviation \(\sigma_j\), rescaling is easy: we simply rescale every feature like so:

\[ \begin{equation} \label{eq:scaling} x_j' = \frac{x_j - \mu_j}{\sigma_j} \end{equation} \]

This will force every feature to have a mean of 0 and a standard deviation of 1, and thus be scaled well relative to each other.

Note: Be careful to make sure to perform the rescaling at both training time and prediction time. That is, we first have to perform the rescaling on the whole training data set, and then train the model so as to achieve good training performance. Once the model is trained and we have a new, never before seen input \(x\), we also need to rescale its features to \(x'\) because the trained model only understands inputs that have already been rescaled (because we trained it that way).

Implementation and Experiments

Feature scaling can be applied to just about any multi-variable model. Since the only multi-variable model we have seen so far is multi-variable linear regression, we'll look at implementing feature scaling in that context, but the ideas and the code are pretty much the same for other models. First let's just copy-paste the original multi-variable linear regression code, and modify it to load the new synthetic data set:

import numpy as np
import tensorflow as tf
import pandas as pd
import matplotlib.pyplot as plt

# First we load the entire CSV file into an m x n matrix
D = np.matrix(pd.read_csv("linreg-scaling-synthetic.csv", header=None).values)

# Make a convenient variable to remember the number of input columns
n = 2

# We extract all rows and the first n columns into X_data
# Then we flip it
X_data = D[:, 0:n].transpose()

# We extract all rows and the last column into y_data
# Then we flip it
y_data = D[:, n].transpose()



# Define data placeholders
x = tf.placeholder(tf.float32, shape=(n, None))
y = tf.placeholder(tf.float32, shape=(1, None))

# Define trainable variables
A = tf.get_variable("A", shape=(1, n))
b = tf.get_variable("b", shape=())

# Define model output
y_predicted = tf.matmul(A, x) + b

# Define the loss function
L = tf.reduce_sum((y_predicted - y)**2)

# Define optimizer object
optimizer = tf.train.AdamOptimizer(learning_rate=0.1).minimize(L)

# Create a session and initialize variables
session = tf.Session()
session.run(tf.global_variables_initializer())

# Main optimization loop
for t in range(2000):
    _, current_loss, current_A, current_b = session.run([optimizer, L, A, b], feed_dict={
        x: X_data,
        y: y_data
    })
    print("t = %g, loss = %g, A = %s, b = %g" % (t, current_loss, str(current_A), current_b))

If you run this code right now, you might see output like the following:

...
t = 1995, loss = 0.505296, A = [[1.9920335  0.01292674]], b = 0.089939
t = 1996, loss = 0.5042, A = [[1.9920422  0.01292683]], b = 0.0898404
t = 1997, loss = 0.5031, A = [[1.9920509  0.01292691]], b = 0.0897419
t = 1998, loss = 0.501984, A = [[1.9920596  0.01292699]], b = 0.0896435
t = 1999, loss = 0.50089, A = [[1.9920683  0.01292707]], b = 0.0895451

Note that if you run this code multiple times you will get different results each time due to the random initialization. The synthetic data was generated with the equation \(y = 2x_1 + 0.013x_2 + 0\). So after 2000 iterations of training we are getting close-ish to the correct values, but it's not fully trained. So let's implement feature scaling to fix this.

The first step is to compute the mean and standard deviation for each feature in the training data set. Add the following after X_data is loaded:

means = X_data.mean(axis=1)
deviations = X_data.std(axis=1)

Using axis=1 means that we compute the mean for each feature, averaged over all data points. That is, X_data is a \(2 \times 400\) matrix, means is a \(2 \times 1\) matrix (vector). If we had used axis=0, means would be a \(1 \times 400\) matrix, which is not what we want.

Now that we know the means and standard deviations of each feature, we have to use them to transform the inputs via Equation \((\ref{eq:scaling})\). We have two options: we could implement the transformation directly using numpy to transform X_data before training, or we could include the transformation within the TensorFlow computations. But since we need to do the transformation again when we want to predict output given new inputs, including it in the TensorFlow computations will be a bit more convenient.

Just like before we setup x as a placeholder, so this code is identical:

# Define data placeholders
x = tf.placeholder(tf.float32, shape=(n, None))

Now we want to transform x according to Equation \((\ref{eq:scaling})\). We can define a new TensorFlow value x_scaled:

# Apply the rescaling
x_scaled = (x - means) / deviations

One important concept is that when this line of code runs, means and deviations have already been computed: they are actual matrices. So to TensorFlow they are constants: they are neither trainable variables that will be updated during optimization, nor are they placeholders that need to be fed values later. They have already been computed, and TensorFlow just uses the already computed values directly.

Also, note that since x and means are compatibly sized matrices, the subtraction (and division) will be done separately for each feature, automatically. So while Equation \((\ref{eq:scaling})\) technically says how to scale one feature individually, this single line of code scales all the features.

Now, everywhere that we use x in the model we should now use x_scaled instead. For us the only code that needs to change is in defining the model's prediction:

# Define model output, using the scaled features
y_predicted = tf.matmul(A, x_scaled) + b

Note that when we run the session we do not feed values to x_scaled, because we want to feed unscaled values to x which will then automatically get scaled when x_scaled is computed based on what is fed to x. We instead continue to feed values to x as before.

That's all that we need to implement for feature scaling, it's really pretty easy. If we run this code now we should see:

...
t = 1995, loss = 1.17733e-07, A = [[6.0697703 3.9453504]], b = 16.5
t = 1996, loss = 1.17733e-07, A = [[6.0697703 3.9453504]], b = 16.5
t = 1997, loss = 1.17733e-07, A = [[6.0697703 3.9453504]], b = 16.5
t = 1998, loss = 1.17733e-07, A = [[6.0697703 3.9453504]], b = 16.5
t = 1999, loss = 1.17733e-07, A = [[6.0697703 3.9453504]], b = 16.5

The fact that none of the trained weights are updating anymore and that the loss function is very small is a good sign that we have achieved convergence. But the weights are not the values we expect... shouldn't we get A = [[2, 0.013]], b = 0, since that is the equation that generated the synthetic data? Actually, the weights we got are correct because the model is now being trained on the rescaled data set, so we get different weights at the end. See Challenge Problem 3 below to explore this more.

Concluding Remarks

We've seen how to implement feature scaling for a simple multi-variable linear regression model. While this is a fairly simple model, the principles are basically the same for applying feature scaling to more complex models (which we will see very soon). In fact, the code is pretty much identical: just compute the means and standard deviations, apply the formula to x to compute x_scaled, and anywhere you use x in the model, just use x_scaled instead.

Whether or not feature scaling helps is dependent on the problem and the model. If you have having trouble getting your model to converge, you can always implement feature scaling and see how it affects the training. While it wasn't used in this chapter, using TensorBoard (see chapter 2.2) is a great way to run experiments to compare training with and without feature scaling.

Challenge Problems

  1. Coding: Take one of the challenge problems from the previous chapter, and implement it with and without feature scaling, and then compare how training performs.
  2. Coding: Add TensorBoard support to the code from this chapter, and use it to explore for yourself the effect feature scaling has. Also, you can compare with different optimization algorithms.
  3. Theory: One problem with feature scaling is that we learn different model parameters. In this chapter, the original data set was generated with \(y = 2x_1 + 0.013x_2 + 0\), but the parameters that the model learned were \(a_1' = 6.0697703, a_2' = 3.9453504, b' = 16.5\). Note that I have called these learned parameters \(a_1'\) rather than \(a_1\) to indicate that these learned parameters are for the rescaled data.

    One important use of linear regression is to explore and understand a data set, not just predict outputs. After doing a regression, one can understand through the learned weights how severely each feature affects the output. For example, if we have two features \(A\) and \(B\) which are used to predict rates of cancer, and the learned weight for \(A\) is much larger than the learned weight for \(B\), this indicates that occurrence of \(A\) is more correlated with cancer than occurrence of \(B\) is, which is interesting in its own right. Unfortunately, performing feature scaling destroys this: since we have rescaled the training data, the weight for \(A\) (\(a_1'\)) has lost its relevance to the values of \(A\) in the real world. But all is not lost, since we can actually recover the "not-rescaled" parameters from the learned parameters, which we will derive now.

    First, suppose that we have learned the "rescaled" parameters \(a_1', a_2', b'\). That is, we predict the output \(y\) to be: \[\begin{equation}\label{eq:start}y = a_1' x_1' + a_2' x_2' + b'\end{equation}\]where \(x_1', x_2'\) are the rescaled features. Now, we want a model that uses some (yet unknown) weights \(a_1, a_2, b\) to predict the same output \(y\), but using the not-rescaled features \(x_1, x_2\). That is, we want to find \(a_1, a_2, b\) such that this is true:\[\begin{equation}\label{eq:goal}y = a_1 x_1 + a_2 x_2 + b\end{equation}\]To go about doing this, substitute for \(x_1'\) and \(x_2'\) using Equation \((\ref{eq:scaling})\) into Equation \((\ref{eq:start})\), and then compare it to Equation \((\ref{eq:goal})\) to obtain 3 equations: one to relate each scaled and not-scaled parameter. Finally, you can use these equations to be able to compute the "not-scaled" parameters in terms of the learned parameters. You can check your work by computing the not-scaled parameters in terms of the learned parameters for the synthetic data set, and verify that they match the expected values.

Complete Code

The complete example code is available on GitHub, as well as directly here:

import numpy as np
import tensorflow as tf
import pandas as pd
import matplotlib.pyplot as plt

# First we load the entire CSV file into an m x n matrix
D = np.matrix(pd.read_csv("linreg-scaling-synthetic.csv", header=None).values)

# Make a convenient variable to remember the number of input columns
n = 2

# We extract all rows and the first n columns into X_data
# Then we flip it
X_data = D[:, 0:n].transpose()

# We extract all rows and the last column into y_data
# Then we flip it
y_data = D[:, n].transpose()

# We compute the mean and standard deviation of each feature
means = X_data.mean(axis=1)
deviations = X_data.std(axis=1)

print(means)
print(deviations)

# Define data placeholders
x = tf.placeholder(tf.float32, shape=(n, None))
y = tf.placeholder(tf.float32, shape=(1, None))

# Apply the rescaling
x_scaled = (x - means) / deviations

# Define trainable variables
A = tf.get_variable("A", shape=(1, n))
b = tf.get_variable("b", shape=())

# Define model output, using the scaled features
y_predicted = tf.matmul(A, x_scaled) + b

# Define the loss function
L = tf.reduce_sum((y_predicted - y)**2)

# Define optimizer object
optimizer = tf.train.AdamOptimizer(learning_rate=0.1).minimize(L)

# Create a session and initialize variables
session = tf.Session()
session.run(tf.global_variables_initializer())

# Main optimization loop
for t in range(2000):
    _, current_loss, current_A, current_b = session.run([optimizer, L, A, b], feed_dict={
        x: X_data,
        y: y_data
    })
    print("t = %g, loss = %g, A = %s, b = %g" % (t, current_loss, str(current_A), current_b))

Nonlinear Regression: content coming soon!

Regularization: content coming soon!

Evaluation: content coming soon!

Why we square errors: content coming soon!