Neural Networks

This lecture roughly follows the beginning of Bishop Chapter 3.

Linear Models-an overview.

Why Linear Models?

We study linear models because:

Linear methods should be a perforamnce floor for other methods (since linear methods are so understandable, we should reject tricky non-linear methods whose performance is no better than direct linear methods).
Linear models give us insights into how nonlinear models work.

Kinds of Learning in a linear model.

Supervised Learning:
$\includegraphics {supervised.eps}$
For each input, the agent has a target output. The actual output is compared with the target output, and the agent is altered to bring them closer together.
Unsupervised Learning: $\includegraphics {unsup.eps}$
The job of this algorithm is to determine the structure of the inputs.
Reinforcement learning: $\includegraphics {reinforce.eps}$

The world receives the action of the agent. Under certain circumstances, the agent receives a reward. The agent adjusts its action to maximize the reward.

Some linear analysis in historical order.

11
Widrow-Hopf

11.1
Definitions

$\includegraphics {widrow.eps}$

This is the simplest non-trivial model of a neuron.

The $x_{i}$ are the inputs to the neuron.
$w_{i}$ are the weights assigned to the inputs.
$\theta$ is the threshold of input to induce activity; the $w_{i}$ and the $\theta$ define the neuron.
is the output determined by the inputs.

If we have several trials (Indexed by which could be time, but almost never is in these cases.), then there are vectors ${\mathbf x}_{t}$ and outputs $y_{t}$ , as well as desired outputs $d_{t}$ .

Our Goal:

Find an incremental way of finding the optimal ${\mathbf w}$ , that is, the values of $w_{i}$ space minimizing the error functional:

$\displaystyle E = \frac{1}{2}\Sigma_{t} (y_{t} - d_{t})^{2}$

11.2
Method: Gradient Descent, or Widrow-Hoff LMS

Since $y_{t}$ depends on ${\mathbf w}$ , we treat

as a function of ${\mathbf w}$ , and seek out the ${\mathbf w}$ yielding minimal

. Since the error function is quadratic, following the gradient towards the single local minimum in sufficiently small steps will lead us to the minimum error spot.

We let ${\mathbf w}$ change by $\Delta {\mathbf w} := -\eta \nabla_{\mathbf w} E$ , where $\eta$ is precisely described as ``small''.

The $i^{\rm th}$ coordinate of $\Delta {\mathbf w}$ for the $t^{\rm th}$ trial is computed via:

$\displaystyle (\Delta {\mathbf w})_{i} = (-\eta \nabla_{w} E_{t})_{i} = - \eta (y-d) x_{i}$

So that the update rule is given by (1)

$\displaystyle w_{i}^{\rm new} = w_{i}^{\rm old} - \eta(y - d) x_{i}$

(1)

11.3
Analysis

Whenever we analyze one of these methods, we are concerned with the following questions:

Does it converge?
How fast does it converge?
How do we set the learning parameter (in this case $\eta$ .)

When we slect $\eta$ , we have to keep in mind that choosing an $\eta$ that is too small may cause slow convergence, while choosing an $\eta$ too large may cause us to skip over the minimum point (see figure 1.1.3)

**Figure 1:** Countour map of : extreme point is minimum.
$\includegraphics {param.eps}$

We want to choose $\eta$ so that the successive approximations converge to some position in weight space.

If we teach the neuron with several trials, the total error is $E = \Sigma E_{t}$ . This is the average of the errors.

It is important to include several trials, since, if $x_{i} = 0$ space in some trial, $w_{i}$ doesn't matter, and we can't find a value for it.

Widrow-Hopf LMS was first used in adaptive radar beam forming programs in the late 50's and early 60's.

12
Ronsenblatt's Perceptron

Description

**Figure 2:** Ronsenblatt's Perceptron
$\begin{figure}\setlength {\unitlength}{3947sp}%\begingroup\makeatletter\ifx\Se... ...Font{12}{14.4}{\rmdefault}{\mddefault}{\updefault}z}}} \end{picture}\end{figure}$

The perceptron problem includes the following:

The inputs $x_{i}$ .
The weights $w_{i}$ .
The threshold $\theta$ ; $\theta$ and ${\mathbf w}$ define the neuron.
$y = w^{T}x$ .
The output . $z = \begin{cases} 0 & y \leq \theta \\ 1 & y > \theta \end{cases}$
The desired output .

12.1
The Goal:

Given inputs ${\mathbf x} \in {\Bbb R}^{n}$ and outputs in $\{0,1\}$ , find ${\mathbw w} \in {\Bbb R}^{n}$ such that

12.2
The Method:

The standard method for finding ${\mathbf w}$ is called linear discrimination. Given ${\mathbf w}$ the whole of ${\mathbf x}$ space is divided into ${\mathbf x}$ that yield 0 output and ${\mathbf x}$ that yield 1 ouput (see figure 1.2.2).

**Figure 3:** Linear Discrimination.
$\begin{figure}\setlength {\unitlength}{3947sp}%\begingroup\makeatletter\ifx\Se... ...Font{12}{14.4}{\rmdefault}{\mddefault}{\updefault}w}}} \end{picture}\end{figure}$

Given a putative ${\mathbf w}$ , we get that $z = \theta$ whenever ${\mathbf w}^{T}{\mathbf x} = \theta$ . ${\mathbf w}^{T}{\mathbf x} = \theta$ is a line in ${\mathbf x}$ space with dimension and with normal ${\mathbf w}$ . One side of this hyperplane consists of all ${\mathbf x}$ that will yield an on position, and the other side consists of all ${\mathbf x}$ values that will yield an off position.

Since our trials tell us what ${\mathbf x}$ values actually yield on and off, choosing the ${\mathbf w}$ and $\theta$ is a matter of choosing a plane that puts the on and off dots on the correct sides (see figure 1.2.2).

12.3
Specific Linear Programming problem.

For the sake of simplicity, let $\theta = 0$ for a moment. It is easy to take care of the $\theta$ afterards.

We want change thus:

$w_{i}^{\rm new} = w_{i}^{\rm old} - \eta (z-d) x_{i}$
Let $\delta = z - d$ .
Let $\eta$ space be set to be the smallest positive value where

We would like to guarantee that ${\mathbf w}^{\rm new}^Tx = 0$ , so that

$({\mathbf w}^{\rm old} - \eta \delta {\mathbf x})^{T} {\mathbf x} = 0$ .

This means that:

$w^{\rm old}^{T} {\mathbf x} = \eta \delta \vert\vert x\vert\vert^{2}$ .

So that:

$\eta = \frac{{\mathbf w}^{\rm old}^{T} {\mathbf x}}{\vert\vert x\vert\vert^{2} \delta}$

If we say $\Delta {\mathbf w} = - \frac{1}{\delta \vert\vert x\vert\vert^{2}} = (- \frac{\delta }{\vert\vert x\vert\vert^{2}} y) x + \epsilon$ .

Where $\epsilon$ is just barely large enough to force the $\delta$ to change sign. (Remember that the $\eta$ was set to the largest possible value that would not change the sign.)

12.4
Evaluation and batch solution

We still want to know :

Does this method converge?
How fast?
How do we set the learning parameters?

we here implement the perceptron learning problem as a batch solution:

Given trials indexed (again) by , we have ${\mathbf x}(t)$ and . and $y(t) = {\mathbf w}^{T}{\mathbf x}(t)$ . For some , ${\mathbf w}^{T}{\mathbf x}(t) <0$ , and for some, ${\mathbf w}^{T}{\mathbf x}(t) >0$ . In order for each of the trials to match the desired value, we need $d(t) {\mathbf w}^{T} {\mathbf x}(t)>0$ for all . This is a linear programming problem. Every defines a half space in ${\mathbf w}$ - coordinates defined by $H(t) = \{d(t) {\mathbf w}^{T} {\mathbf x}(t) > 0\}$ , and the possible ${\mathbf w}$ 's must lie in the intersection of all these spaces.