next_group up previous


Neural Networks

This lecture roughly follows the beginning of Bishop Chapter 3.

Linear Models-an overview.

Why Linear Models?

We study linear models because:

Kinds of Learning in a linear model.

  1. Supervised Learning:

    \includegraphics {supervised.eps}

    For each input, the agent has a target output. The actual output is compared with the target output, and the agent is altered to bring them closer together.

  2. Unsupervised Learning: \includegraphics {unsup.eps}

    The job of this algorithm is to determine the structure of the inputs.

  3. Reinforcement learning: \includegraphics {reinforce.eps}

The world receives the action of the agent. Under certain circumstances, the agent receives a reward. The agent adjusts its action to maximize the reward.

Some linear analysis in historical order.

11
Widrow-Hopf

11.1
Definitions

\includegraphics {widrow.eps}

This is the simplest non-trivial model of a neuron.

If we have several trials (Indexed by $ t$ which could be time, but almost never is in these cases.), then there are vectors $ {\mathbf
x}_{t}$ and outputs $ y_{t}$, as well as desired outputs $ d_{t}$.

Our Goal:
Find an incremental way of finding the optimal $ {\mathbf w}$, that is, the values of $ w_{i}$space minimizing the error functional:

$\displaystyle E = \frac{1}{2}\Sigma_{t} (y_{t} - d_{t})^{2}$

11.2
Method: Gradient Descent, or Widrow-Hoff LMS

Since $ y_{t}$ depends on $ {\mathbf w}$, we treat $ E$ as a function of $ {\mathbf w}$, and seek out the $ {\mathbf w}$ yielding minimal $ E$. Since the error function is quadratic, following the gradient towards the single local minimum in sufficiently small steps will lead us to the minimum error spot.

We let $ {\mathbf w}$ change by $ \Delta {\mathbf w} := -\eta
\nabla_{\mathbf w} E$, where $ \eta$ is precisely described as ``small''.

The $ i^{\rm th}$ coordinate of $ \Delta {\mathbf w}$ for the $ t^{\rm th}$ trial is computed via:

$\displaystyle (\Delta {\mathbf w})_{i} = (-\eta \nabla_{w} E_{t})_{i} = - \eta
(y-d) x_{i} $

So that the update rule is given by (1)

$\displaystyle w_{i}^{\rm new} = w_{i}^{\rm old} - \eta(y - d) x_{i}$ (1)

11.3
Analysis

Whenever we analyze one of these methods, we are concerned with the following questions:

  1. Does it converge?
  2. How fast does it converge?
  3. How do we set the learning parameter (in this case $ \eta$.)

When we slect $ \eta$, we have to keep in mind that choosing an $ \eta$ that is too small may cause slow convergence, while choosing an $ \eta$ too large may cause us to skip over the minimum point (see figure 1.1.3)

Figure 1: Countour map of $ E$ : extreme point is minimum.

\includegraphics {param.eps}

We want to choose $ \eta$ so that the successive approximations converge to some position in weight space.

If we teach the neuron with several trials, the total error is $ E =
\Sigma E_{t}$. This is the average of the errors.

It is important to include several trials, since, if $ x_{i} = 0$space in some trial, $ w_{i}$ doesn't matter, and we can't find a value for it.

Widrow-Hopf LMS was first used in adaptive radar beam forming programs in the late 50's and early 60's.

12
Ronsenblatt's Perceptron

Description

Figure 2: Ronsenblatt's Perceptron
\begin{figure}\setlength {\unitlength}{3947sp}%\begingroup\makeatletter\ifx\Se...
...Font{12}{14.4}{\rmdefault}{\mddefault}{\updefault}z}}}
\end{picture}\end{figure}

The perceptron problem includes the following:

12.1
The Goal:

Given inputs $ {\mathbf x} \in {\Bbb
R}^{n}$ and outputs in $ \{0,1\}$, find $ {\mathbw w} \in {\Bbb R}^{n}$ such that $ z = d$.

12.2
The Method:

The standard method for finding $ {\mathbf w}$ is called linear discrimination. Given $ {\mathbf w}$ the whole of $ {\mathbf x}$ space is divided into $ {\mathbf x}$ that yield 0 output and $ {\mathbf x}$ that yield 1 ouput (see figure 1.2.2).

Figure 3: Linear Discrimination.
\begin{figure}\setlength {\unitlength}{3947sp}%\begingroup\makeatletter\ifx\Se...
...Font{12}{14.4}{\rmdefault}{\mddefault}{\updefault}w}}}
\end{picture}\end{figure}

Given a putative $ {\mathbf w}$, we get that $ z = \theta$ whenever $ {\mathbf w}^{T}{\mathbf x} = \theta$. $ {\mathbf w}^{T}{\mathbf x} = \theta$ is a line in $ {\mathbf x}$ space with dimension $ n-1$ and with normal $ {\mathbf w}$. One side of this hyperplane consists of all $ {\mathbf x}$ that will yield an on position, and the other side consists of all $ {\mathbf x}$ values that will yield an off position.

Since our trials tell us what $ {\mathbf x}$ values actually yield on and off, choosing the $ {\mathbf w}$ and $ \theta$ is a matter of choosing a plane that puts the on and off dots on the correct sides (see figure 1.2.2).

12.3
Specific Linear Programming problem.

For the sake of simplicity, let $ \theta = 0$ for a moment. It is easy to take care of the $ \theta$ afterards.

We want change $ w$ thus:

$ w_{i}^{\rm new} = w_{i}^{\rm old} - \eta (z-d) x_{i}$

Let $ \delta = z - d$.

Let $ \eta$space be set to be the smallest positive value where

We would like to guarantee that $ {\mathbf w}^{\rm new}^Tx = 0$, so that

$ ({\mathbf w}^{\rm old} - \eta \delta {\mathbf x})^{T} {\mathbf x} =
0$.

This means that:

$ w^{\rm old}^{T} {\mathbf x} = \eta \delta \vert\vert x\vert\vert^{2}$.

So that:

$ \eta = \frac{{\mathbf w}^{\rm old}^{T} {\mathbf x}}{\vert\vert x\vert\vert^{2}
\delta}$

If we say $ \Delta {\mathbf w} = - \frac{1}{\delta \vert\vert x\vert\vert^{2}} = (-
\frac{\delta }{\vert\vert x\vert\vert^{2}} y) x + \epsilon$.

Where $ \epsilon$ is just barely large enough to force the $ \delta$ to change sign. (Remember that the $ \eta$ was set to the largest possible value that would not change the sign.)

12.4
Evaluation and batch solution

We still want to know :

we here implement the perceptron learning problem as a batch solution:

Given trials indexed (again) by $ t$, we have $ {\mathbf x}(t)$ and $ d(t)$. and $ y(t) = {\mathbf w}^{T}{\mathbf x}(t)$. For some $ t$, $ {\mathbf
w}^{T}{\mathbf x}(t) <0$, and for some, $ {\mathbf
w}^{T}{\mathbf x}(t) >0$. In order for each of the trials to match the desired value, we need $ d(t) {\mathbf w}^{T} {\mathbf x}(t)>0$ for all $ t$. This is a linear programming problem. Every $ t$ defines a half space in $ {\mathbf w}$- coordinates defined by $ H(t) = \{d(t) {\mathbf w}^{T} {\mathbf x}(t) > 0\}$, and the possible $ {\mathbf w}$'s must lie in the intersection of all these spaces.

About this document ...

Neural Networks

This document was generated using the LaTeX2HTML translator Version 99.1 release (March 30, 1999)

Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.

The command line arguments were:
latex2html -split 1 -white 8_28.tex

The translation was initiated by Ben Jones on 2000-08-30


next_group up previous
Ben Jones
2000-08-30