[<--] Return to the list of AI and ANN lectures

Let's have a quick summary of the perceptron (click here).

There are a number of variations we could have made in our procedure. I arbitrarily set the initial weights and biases to zero. In fact, this isn't a very good idea, because it gives too much symmetry to the initial state. It turns out that if you do this with the AND function, you can get into a situation where you cycle through the same sets of weights without converging. However, if you start with small random weights and biases, this breaks the symmetry and it works fine. I didn't do this in our example calculation because I wanted to keep the arithmetic simple.

Another parameter is the learning rate. If it is too small, it can take a long time to converge. If it is too big, you might continually jump over the optimum weight values and fail to converge. Another variation is to sum the changes in weights and biases over a set of four patterns, without applying them to modify the weights until all four patterns have been presented. This is called "training by epoch" as contrasted with "training by pattern". It can often give a little more stability to the process and prevent wild fluctuations in weight values. Obviously, trying out all these variations by hand can be tedious, or nearly impossible for a large network of perceptrons. This is where computer simulations come in. I think that any one of you could write a very simple computer program to explore the perceptron learning algorithm for problems involving a single perceptron with two inputs and a bias. In the next lecture, I'll give you a demo of a simulator program for more complicated networks.

The perceptron learning rule was a great advance. Our simple example of learning how to generate the truth table for the logical OR may not sound impressive, but we can imagine a perceptron with many inputs solving a much more complex problem. Rosenblatt was able to prove that the perceptron was able to learn any mapping that it could represent. Unfortunately, he made some exaggerated claims for the representational capabilities of the perceptron model. (Note the distinction between being able to represent a mapping and being able to learn this representation.) This attracted the attention of Marvin Minsky and his colleague Seymour Papert, who published a devastating critique of the perceptron model in their book "Perceptrons" (1969). In this book, they said:

"Perceptrons have been widely publicized as 'pattern recognition' or 'learning machines' and as such have been discussed in a large number of books, journal articles, and voluminous 'reports'. Most of this writing ... is without scientific value .." and "Appalled at the persistent influence of perceptrons (and similar ways of thinking) on practical pattern recognition, we determined to set out our work as a book."

In the book, they pointed out that there is a major class of problems that can't be represented by the perceptron. A very simple example is the exclusive OR (XOR). They gave a very simple and compelling proof of the impossibility of finding a set of weights that would let a single-layer perceptron give the correct outputs for the XOR truth table. (If we had chosen this for our example, we would have been at it for a long time. The weights would change back and forth, but it would never converge to a final result.)

Although they were aware of the fact that other neural network architectures could produce an XOR (like the McCullogh-Pitts), they felt (incorrectly, it turned out) that there was no way to extend the perceptron learning rule to deal with these sorts of networks. This put a damper on enthusiasm for neural network research, bringing it to a virtual halt for much of the 1970's. Initially, this had the effect of draining off funding for research in neural nets and diverting it toward symbolic AI. But then, the lack of success of AI became apparent and there were "Dark Ages" for neural nets, paralleled by the "Winter of AI". Both of these were brought on by the disillusionment with over-optimistic claims.

Why do we care about the XOR? It is a hard representation for a
neural net to learn, yet it is simple enough for us to understand in
detail, because of the small number of variables. In order for us to
understand why there is no set of weights that will allow a perceptron
to generate the correct outputs for the XOR truth table, let's
generalize things a bit and let the output be a continous value betwen
0 and 1, and use a steep sigmoid instead of a step function for the
output function, V_{3}).

Let's call V_{3} approximately 1 if u_{3} > 0, and
V_{3} approximately 0 if u_{3} < 0. Summing the inputs,
we have u_{3} = W_{31}V_{1} +
W_{32}V_{2} + .

The possible inputs V_{1} and V_{2} form a 2-D space:

We want to draw a line separating the region where V_{3} is approximately
equal to 1 from the region where it is approximately zero. For the OR which
we just treated, what is the equation of this line? We can find it by
setting u3 = 0 in the equation above, and get:

V_{2} = -(W_{31} / W_{32})V_{1}
-/ W_{32}

There are many values of the three variables
W_{31}, W_{32}, and that
will give such a line. What about the XOR? We want the output to be 1
(TRUE) if only one input is 1, and the output to be 0 if neither or both are
1. This means we need two lines to partition the space of possible inputs.

We don't have enough weight and bias parameters to define two lines. Minsky and Pappert identified a number of significant problems which fell into this category of not being "linearly separable". With two inputs, a linearly separable problem is one in which you can separate the two different solutions with a straight line. With three inputs (3-D), you need a plane. In higher dimensions it would be a hyperplane.

The way around this problem is fairly obvious: add more neurons to form a "hidden layer" which bridges the the input and output units. This will give the extra parameters needed to divide up the space of possible inputs. This is what we did with the McCullogh-Pitts network that I showed for the XOR. Minsky and Pappert knew this, but couldn't think of a learning rule to deal with the hidden units, and suspected that one didn't exist. Here's another condensed quote from "Perceptrons":

The perceptron ... has many features that attract attention: its linearity, its intriguing learning theorem ... There is no reason to suppose that any of these virtues carry over to the many-layered version. Nevertheless, we consider it an important research problem to elucidate (or reject) our intuitive judgment that the extension is sterile.

A number of people have since independently discovered the learning rule for a multi-layered perceptron network. (The reason for this duplication and lack of communication between researchers is that the study of neural networks is an interdisciplinary field. Until recently, there were no neural networks journals; the results were published in a wide variety of math, physics, biology, psychology and engineering journals.) Paul Werbos (1974 Harvard Ph.D. thesis) is possibly the first one to discover what is now known as the generalized delta rule, or backpropagation algorithm.

Before I describe the training algorithm, I'll show you two different networks that are capable of representing the solution to the XOR problem.

The network described in PDP Chapter 8 looks like this:

We have a single hidden unit with input connections to both the hidden and output layer. (It is hidden in the sense that it doesn't have a direct output to the outside world.) The biases are shown inside the hidden and output units, and the weights are showm beside the connections. We say that the hiden unit forms an "internal representation" of the problem. Can you tell what it is? What problem is being solved by the hidden unit with these weights and biases? [You should be able to show that it functions as an AND unit.] What does the output unit do? [It works like an OR, but with a strong inhibitory input from the AND.]

The net described in Chapter 5 of "Explorations in Parallel Distributed Processing" has this architecture:

OUT / \ / \ / \ H1 H2 | \ / | | \ / | | X | | / \ | IN1 IN2

It is a two layered feed-forward net with two hidden units. (We don't count the input layer, because it doesn't really do anything.) This gives it a total of 6 weights and 3 biases to use to separate the regions that give a 0 output from region that gives an output of 1.

For both of these nets, there are an infinite number of solutions for the weights and biases that will solve the problem. Also, there are lots of other architectures. At least two hidden units are needed if the input units only connect to the output layer, but we might wonder if there are any advantages of using three or more. Does the net learn the weights faster,or display more robust behavior to noisy inputs which aren't quite 0 or 1?

Finding the optimum network architecture or set of learning parameters for a back propagation calculation is still something of an art. There are many questions that we would like to answer with mathematical proofs. We may have to settle for a body of results from simulations which suggest certain general patterns of behavior. Sometimes these results may suggest the need for a theory that explains them, just as experimental results often provide the direction for theoretical analysis in other branches of science. In some cases, as when it is clear that the internal representation should be a binary number, it is easy to determine the minimum number of hidden units needed. Even then, it isn't clear which is the OPTIMUM number of hidden units for quick learning, tolerance for ``noisy'' input data, or capability to generalize. (Ability to properly treat inputs which were not explicitly present in the training set.) Computer simulations give us a way to answer some of these questions by experiment.

(See the handout "Summary of the Generalized Delta Rule")

Define t_{i} = target (desired) output of unit *i*, and
a_{i} = the actual output. (We have been calling this V_{i},
but I'm now switching to the notation used in "Explorations in PDP".)
This output (the "activation") is calculated from the net input using
a sigmoid function:

As with the single layer perceptron we calculate the net input from the weighted sum:

The total sum of squared errors (``tss'') is calculated from the ``target
activation'', t_{i}:

The sum is over the output units, because we don't know what the target activations should be for the hidden units.

We want to adjust the weights in a way that will minimize E. This is curve fitting in a high dimensional space. How can we make E --> 0? E is implicitly a function of all these weights and biases. One way to do it is to adjust the weights in the direction of the negative gradient of E, so that we make a change in each weight:

Here is an example in two dimensions for the function f(x,y).

It is like a topographical map with lines of constant height. We want to find our way to the minimum in the center.

The gradient of f (grad f(x,y)) is a vector that is perpendicular to the
lines of constant f, headed uphill. So, to minimize f(x,y), we want to follow
the negative gradient. If we take small steps, we will follow the path
**a**. This technique is then called "gradient descent", or "steepest
descent". (It might seem obvious that this is the optimum way to minimize the
error. Actually, it isn't. There are better ways with names like "conjugate
gradient" and "quasi-newton" methods. But, this is a fairly good technique,
and is certainly the simplest.)

We can use the chain rule to show that the weight change is:

The expression on the the right hand side arises from a nice feature of the sigmoidal activation function. From Eq. (1) you should be able to verify that:

df_{i} / du_{i}
= f_{i} (1 - f_{i}) = a_{i} (1 - a_{i})

(There are some intermediate steps left out for you to fill in.)

Putting in a constant of proportionality that absorbs the factor of two, and adding another term that I will explain in a minute, we have:

where * n* is
the iteration number, and

Except for the last term, Eq. (3) is like the perceptron learning rule,
but with a different expression for delta.
Here is the learning
rate. If it is too large, we may jump back and forth over the path
along the gradient, following path **b**, and may not reach the minimum.
The final term in the equation above is an added variation in the
algorithm that prevents radical changes in the weights due to the use
of gradient descent. This term gives our trajectory in weight space
some "momentum" (the parameter )
in order to preserve some memory of the direction it was going. To do this,
we add some of the weight change from the *n*th iteration to the
weight change that we are calculating for the (*n + 1*)th iteration.
This is illustrated in path **c**.

So far, we have something that is similar to the learning rule for
single layer perceptrons, often called the "delta rule", and it works
fine for calculating the changes for the weights to the output layer.
But, it isn't back propagation, yet. We have a problem with the
hidden layers, because we don't know the target activations t_{i} for
the hidden units. The trick, derived using the chain rule in PDP
Chapter 8, is to use a different expression for the delta when unit *i*
is a hidden unit instead of an output unit:

The first factor in parenthesis involving the sum over *k* is an
approximation to (t_{i} - a_{i}) for the hidden layers when
we don't know t_{i}. It makes use of the deltas that have been
calculated for the layer above.
Note that the sum over *k* is the sum
over the units that receive input from the *i*th unit.

O O unit k \ / \ / Wki \ / O hidden unit i

The procedure for adjusting the weights is:

- Present inputs for the first pattern to the input layer
- Sum the weighted inputs to the next layer and calculate their activations [Eqs. (2) and (1)]
- Present activations to the next layer, repeating (2) until the activations of the output layer are known
- Compare output activations to the target values for the pattern and calculate deltas for the output layer [Eq. (4)]
- Propagate error backwards by using the output layer deltas to calculate the deltas for the previous layer [Eq. (5)]
- Use these deltas to calculate those of the previous layer, repeating until the first layer is reached
- Calculate the weight changes for all weights and biases (treat biases as weights from a unit having an activation of 1) [Eq. (3)]
- If training by pattern, update all the weights and biases, else repeat the cycle for all patterns, summing the changes and applying at the end of the epoch
- Repeat the entire procedure until the total sum of squared errors is less than a specified criterion

Now we see why it is called back propagation. We start with a forward
pass, presenting an input pattern, and calculate the activations of
each layer from those of the preceding layer, using the current values
of the weights. When we get to the output layer, we can compare the
output activations to the target values for the given pattern, and
calculate the delta values for the output layer. Now we propagate the
error *backwards* by using these delta values to calculate the
deltas for the preceding layer. If we are training by epoch, we
present another pattern and sum the delta-W's over the set of
patterns, updating the weights at the end of each epoch. Then we
iterate the whole procedure until the error is reduced to an
"acceptable" value. As we did with the single layer perceptron, we
modify the bias terms by treating them just like the weights from a
unit that always has an activation of 1.

Next time, we will demonstrate some backpropagation simulations, using the neural net simulation software from "Explorations in PDP".

[<--] Return to the list of AI and ANN lectures

dbeeman "at" dogstar "dot" colorado "dot" edu

Thu Nov 1 16:06:15 MST 2001