[<--] Return to the list of AI and ANN lectures

# Multi-layer perceptrons (feed-forward nets), gradient descent, and back propagation

Let's have a quick summary of the perceptron (click here).

There are a number of variations we could have made in our procedure. I arbitrarily set the initial weights and biases to zero. In fact, this isn't a very good idea, because it gives too much symmetry to the initial state. It turns out that if you do this with the AND function, you can get into a situation where you cycle through the same sets of weights without converging. However, if you start with small random weights and biases, this breaks the symmetry and it works fine. I didn't do this in our example calculation because I wanted to keep the arithmetic simple.

Another parameter is the learning rate. If it is too small, it can take a long time to converge. If it is too big, you might continually jump over the optimum weight values and fail to converge. Another variation is to sum the changes in weights and biases over a set of four patterns, without applying them to modify the weights until all four patterns have been presented. This is called "training by epoch" as contrasted with "training by pattern". It can often give a little more stability to the process and prevent wild fluctuations in weight values. Obviously, trying out all these variations by hand can be tedious, or nearly impossible for a large network of perceptrons. This is where computer simulations come in. I think that any one of you could write a very simple computer program to explore the perceptron learning algorithm for problems involving a single perceptron with two inputs and a bias. In the next lecture, I'll give you a demo of a simulator program for more complicated networks.

## The XOR saga

The perceptron learning rule was a great advance. Our simple example of learning how to generate the truth table for the logical OR may not sound impressive, but we can imagine a perceptron with many inputs solving a much more complex problem. Rosenblatt was able to prove that the perceptron was able to learn any mapping that it could represent. Unfortunately, he made some exaggerated claims for the representational capabilities of the perceptron model. (Note the distinction between being able to represent a mapping and being able to learn this representation.) This attracted the attention of Marvin Minsky and his colleague Seymour Papert, who published a devastating critique of the perceptron model in their book "Perceptrons" (1969). In this book, they said:

```
"Perceptrons have been widely publicized as 'pattern
recognition' or 'learning machines' and as such have been
discussed in a large number of books, journal articles, and
voluminous 'reports'.  Most of this writing ... is without
scientific value .."

and

"Appalled at the persistent influence of perceptrons (and
similar ways of thinking) on practical pattern recognition, we
determined to set out our work as a book."
```

In the book, they pointed out that there is a major class of problems that can't be represented by the perceptron. A very simple example is the exclusive OR (XOR). They gave a very simple and compelling proof of the impossibility of finding a set of weights that would let a single-layer perceptron give the correct outputs for the XOR truth table. (If we had chosen this for our example, we would have been at it for a long time. The weights would change back and forth, but it would never converge to a final result.)

Although they were aware of the fact that other neural network architectures could produce an XOR (like the McCullogh-Pitts), they felt (incorrectly, it turned out) that there was no way to extend the perceptron learning rule to deal with these sorts of networks. This put a damper on enthusiasm for neural network research, bringing it to a virtual halt for much of the 1970's. Initially, this had the effect of draining off funding for research in neural nets and diverting it toward symbolic AI. But then, the lack of success of AI became apparent and there were "Dark Ages" for neural nets, paralleled by the "Winter of AI". Both of these were brought on by the disillusionment with over-optimistic claims.

## The Perceptron and the XOR

Why do we care about the XOR? It is a hard representation for a neural net to learn, yet it is simple enough for us to understand in detail, because of the small number of variables. In order for us to understand why there is no set of weights that will allow a perceptron to generate the correct outputs for the XOR truth table, let's generalize things a bit and let the output be a continous value betwen 0 and 1, and use a steep sigmoid instead of a step function for the output function, V = f(u3). Let's call V3 approximately 1 if u3 > 0, and V3 approximately 0 if u3 < 0. Summing the inputs, we have u3 = W31V1 + W32V2 + .

The possible inputs V1 and V2 form a 2-D space: We want to draw a line separating the region where V3 is approximately equal to 1 from the region where it is approximately zero. For the OR which we just treated, what is the equation of this line? We can find it by setting u3 = 0 in the equation above, and get:

V2 = -(W31 / W32)V1 - / W32

There are many values of the three variables W31, W32, and that will give such a line. What about the XOR? We want the output to be 1 (TRUE) if only one input is 1, and the output to be 0 if neither or both are 1. This means we need two lines to partition the space of possible inputs.

We don't have enough weight and bias parameters to define two lines. Minsky and Pappert identified a number of significant problems which fell into this category of not being "linearly separable". With two inputs, a linearly separable problem is one in which you can separate the two different solutions with a straight line. With three inputs (3-D), you need a plane. In higher dimensions it would be a hyperplane.

The way around this problem is fairly obvious: add more neurons to form a "hidden layer" which bridges the the input and output units. This will give the extra parameters needed to divide up the space of possible inputs. This is what we did with the McCullogh-Pitts network that I showed for the XOR. Minsky and Pappert knew this, but couldn't think of a learning rule to deal with the hidden units, and suspected that one didn't exist. Here's another condensed quote from "Perceptrons":

```    The perceptron ... has many features that attract
attention:  its linearity, its intriguing learning theorem
...

There is no reason to suppose that any of these virtues
carry over to the many-layered version.  Nevertheless, we
consider it an important research problem to elucidate (or
reject) our intuitive judgment that the extension is sterile.
```

A number of people have since independently discovered the learning rule for a multi-layered perceptron network. (The reason for this duplication and lack of communication between researchers is that the study of neural networks is an interdisciplinary field. Until recently, there were no neural networks journals; the results were published in a wide variety of math, physics, biology, psychology and engineering journals.) Paul Werbos (1974 Harvard Ph.D. thesis) is possibly the first one to discover what is now known as the generalized delta rule, or backpropagation algorithm.

Before I describe the training algorithm, I'll show you two different networks that are capable of representing the solution to the XOR problem.

The network described in PDP Chapter 8 looks like this: We have a single hidden unit with input connections to both the hidden and output layer. (It is hidden in the sense that it doesn't have a direct output to the outside world.) The biases are shown inside the hidden and output units, and the weights are showm beside the connections. We say that the hiden unit forms an "internal representation" of the problem. Can you tell what it is? What problem is being solved by the hidden unit with these weights and biases? [You should be able to show that it functions as an AND unit.] What does the output unit do? [It works like an OR, but with a strong inhibitory input from the AND.]

The net described in Chapter 5 of "Explorations in Parallel Distributed Processing" has this architecture:

```                                OUT
/ \
/   \
/     \
H1     H2
| \   / |
|  \ /  |
|   X   |
|  / \  |
IN1   IN2
```

It is a two layered feed-forward net with two hidden units. (We don't count the input layer, because it doesn't really do anything.) This gives it a total of 6 weights and 3 biases to use to separate the regions that give a 0 output from region that gives an output of 1.

For both of these nets, there are an infinite number of solutions for the weights and biases that will solve the problem. Also, there are lots of other architectures. At least two hidden units are needed if the input units only connect to the output layer, but we might wonder if there are any advantages of using three or more. Does the net learn the weights faster,or display more robust behavior to noisy inputs which aren't quite 0 or 1?

Finding the optimum network architecture or set of learning parameters for a back propagation calculation is still something of an art. There are many questions that we would like to answer with mathematical proofs. We may have to settle for a body of results from simulations which suggest certain general patterns of behavior. Sometimes these results may suggest the need for a theory that explains them, just as experimental results often provide the direction for theoretical analysis in other branches of science. In some cases, as when it is clear that the internal representation should be a binary number, it is easy to determine the minimum number of hidden units needed. Even then, it isn't clear which is the OPTIMUM number of hidden units for quick learning, tolerance for ``noisy'' input data, or capability to generalize. (Ability to properly treat inputs which were not explicitly present in the training set.) Computer simulations give us a way to answer some of these questions by experiment.

## Training a feed-forward net - Backpropagation

(See the handout "Summary of the Generalized Delta Rule")

Define ti = target (desired) output of unit i, and ai = the actual output. (We have been calling this Vi, but I'm now switching to the notation used in "Explorations in PDP".) This output (the "activation") is calculated from the net input using a sigmoid function: As with the single layer perceptron we calculate the net input from the weighted sum: The total sum of squared errors (``tss'') is calculated from the ``target activation'', ti: The sum is over the output units, because we don't know what the target activations should be for the hidden units.

We want to adjust the weights in a way that will minimize E. This is curve fitting in a high dimensional space. How can we make E --> 0? E is implicitly a function of all these weights and biases. One way to do it is to adjust the weights in the direction of the negative gradient of E, so that we make a change in each weight: Here is an example in two dimensions for the function f(x,y). It is like a topographical map with lines of constant height. We want to find our way to the minimum in the center.

The gradient of f (grad f(x,y)) is a vector that is perpendicular to the lines of constant f, headed uphill. So, to minimize f(x,y), we want to follow the negative gradient. If we take small steps, we will follow the path a. This technique is then called "gradient descent", or "steepest descent". (It might seem obvious that this is the optimum way to minimize the error. Actually, it isn't. There are better ways with names like "conjugate gradient" and "quasi-newton" methods. But, this is a fairly good technique, and is certainly the simplest.)

We can use the chain rule to show that the weight change is: The expression on the the right hand side arises from a nice feature of the sigmoidal activation function. From Eq. (1) you should be able to verify that:

dfi / dui = fi (1 - fi) = ai (1 - ai)

(There are some intermediate steps left out for you to fill in.)

Putting in a constant of proportionality that absorbs the factor of two, and adding another term that I will explain in a minute, we have: where n is the iteration number, and Except for the last term, Eq. (3) is like the perceptron learning rule, but with a different expression for delta. Here is the learning rate. If it is too large, we may jump back and forth over the path along the gradient, following path b, and may not reach the minimum. The final term in the equation above is an added variation in the algorithm that prevents radical changes in the weights due to the use of gradient descent. This term gives our trajectory in weight space some "momentum" (the parameter ) in order to preserve some memory of the direction it was going. To do this, we add some of the weight change from the nth iteration to the weight change that we are calculating for the (n + 1)th iteration. This is illustrated in path c.

So far, we have something that is similar to the learning rule for single layer perceptrons, often called the "delta rule", and it works fine for calculating the changes for the weights to the output layer. But, it isn't back propagation, yet. We have a problem with the hidden layers, because we don't know the target activations ti for the hidden units. The trick, derived using the chain rule in PDP Chapter 8, is to use a different expression for the delta when unit i is a hidden unit instead of an output unit: The first factor in parenthesis involving the sum over k is an approximation to (ti - ai) for the hidden layers when we don't know ti. It makes use of the deltas that have been calculated for the layer above. Note that the sum over k is the sum over the units that receive input from the ith unit.

```        O       O unit k
\     /
\   / Wki
\ /
O hidden unit i
```

The procedure for adjusting the weights is:

1. Present inputs for the first pattern to the input layer
2. Sum the weighted inputs to the next layer and calculate their activations [Eqs. (2) and (1)]
3. Present activations to the next layer, repeating (2) until the activations of the output layer are known
4. Compare output activations to the target values for the pattern and calculate deltas for the output layer [Eq. (4)]
5. Propagate error backwards by using the output layer deltas to calculate the deltas for the previous layer [Eq. (5)]
6. Use these deltas to calculate those of the previous layer, repeating until the first layer is reached
7. Calculate the weight changes for all weights and biases (treat biases as weights from a unit having an activation of 1) [Eq. (3)]
8. If training by pattern, update all the weights and biases, else repeat the cycle for all patterns, summing the changes and applying at the end of the epoch
9. Repeat the entire procedure until the total sum of squared errors is less than a specified criterion

Now we see why it is called back propagation. We start with a forward pass, presenting an input pattern, and calculate the activations of each layer from those of the preceding layer, using the current values of the weights. When we get to the output layer, we can compare the output activations to the target values for the given pattern, and calculate the delta values for the output layer. Now we propagate the error backwards by using these delta values to calculate the deltas for the preceding layer. If we are training by epoch, we present another pattern and sum the delta-W's over the set of patterns, updating the weights at the end of each epoch. Then we iterate the whole procedure until the error is reduced to an "acceptable" value. As we did with the single layer perceptron, we modify the bias terms by treating them just like the weights from a unit that always has an activation of 1.

Next time, we will demonstrate some backpropagation simulations, using the neural net simulation software from "Explorations in PDP".

[<--] Return to the list of AI and ANN lectures

Dave Beeman, University of Colorado
dbeeman "at" dogstar "dot" colorado "dot" edu
Thu Nov 1 16:06:15 MST 2001