Misc Definitions, and Intro to Linear Classifiers (Perceptron)

Definitions

Machine Learning
 - think "learning a machine", not "learning by a machine"

Types of Learning

 Supervised Learning
 Unsupervised Learning
 Reinforcement Learning

Examples

 Supervised:
  map bitmap of handwritten digit to digit class {0,1,...,9,not-a-digit}.
  map credit card transaction authorization requests to Pr(fraudulent)
  map patch of lung X-Ray to Pr(cancerous)

 Unsupervised:
  figure out the different letters in some unknown alphabet
  figure out the 3D structure of the world from seeing lots of movies
  figure out the structure of English from hearing many spoken sentences

 Reinforcement Learning:
  figure out which moves are good (ie lead to winning) in Chess

Success Stories of Machine Learning

 PAPnet - multi-billion dollar success - saved tens of thousands of
 women's lives - maps image of plated-out cell from stained Pap smear
 slide to rating of abnormality, top-ranked abnormal cells in slide
 presented to human operator

 Handwritten digit recognition for postal zip-code sorting

 Handwriting recognition for PDAs

 Speech recognition - used to be hand-crafted rules, now dominated by
 machine learning techniques, especially HMMs (hidden markov models).
 Allows replacement of phone menu trees, hands-free phone dialing,
 word-based searching of corpus of audio data, etc.

 Fraud detection for credit cards

Example Learning Theorem!

 Simple example of theory in machine learning: we will define a very
 simple learning situation and then analyze it.  Typically we are
 concerned with performance on new (unseen) inputs.  Let us assume
 that inputs x are drawn iid from some unknown distribution p(x) and
 that future inputs will be drawn from the same distribution.

 One sided error: assume there is some threshold x_c and that all
 x=x_c should be classified
 "not okay".  Imagine that incorrectly classifying an x that is really
 okay as not okay (false negative) is costly but not very costly,
 while incorrectly classifying an x that is actually not okay as okay
 is a major disaster.  (Think of classifying fruits as to whether they
 are poisonous.  Acceptable to throw away non-poisonous fruit, but
 disastrous to eat poisonous fruit.)  Consider the following "online"
 learning system.

 At each time step we get a fruit, with a 1D measurement x(t).  We
 need to decide whether x(t) is okay.  After making our decision, we
 are told the correct classification.  We would like to come up with a
 simple algorithm that (a) never makes a false positive error, (b)
 minimizes the number of false negative errors, (c) has a provable
 bound on the number of false negative errors it will make.

 Consider the following algorithm.  We maintain a threshold
 \hat{x}_c(t), initialized at \hat{x}_c(1) = -\infty.
 We use \hat{x}_c(t) to classify x(t).  If we make a mistake (which
 had better be a false negative) we raise \hat{x}_c(t) = x(t).

 Theorem: the expected number of false negatives after the T-th actual
 (true) negative is log(T).

 Proof: let T(i) be the time we see the i-th positive exemplar.
 If we get x(T(t)) wrong, it can only be because we have never seen
 such a large positive exemplar.  The chance of that one being the
 largest we have ever seen is 1/t.  So the odds of getting the t-th
 positive examplar wrong are 1/t, and summing that up, the total
 number wrong up to the t-th is 1/1 + 1/2 + 1/3 + ... + 1/t = log(t).
 (Meaning the natural logarithm.)

We start with supervised learning.
Let us define some terminology.

 learning a machine that looks like this:

   y = h(x; w)

 where x is the input (eg sensory input), w are the free parameters of
 the machine we need to set properly to make it match the training
 data, and y is the output of the machine given input x and parameter
 setting w.

 We need a name for the actual output of the machine given input x_p
 (which we call y_p) and the desired output of the machine given input
 x_p, which we call d_p.

 Input Space X, x \in X

 Output Space Y, y \in Y.
  distinguish between space of possible machine outputs and space of
  desired (target) outputs: these may differ.  So y \in Y but d \in D.

 Examplars
  examples of inputs of corresponding output.  We say
    x_p \mapsto d_p
  is an "examplar", or more colloquially a "training pattern."

 Weight Space
  can be continuous, discrete, or both in potentially complicated ways.
  w \in W.

 Hypothesis Space H.  Parametrized by w, perhaps over-parametrized.
  h_w \in H \subset (X \rightarrow Y)

 H is that subset of (X \rightarrow Y) that can actually be
 constructed by setting w appropriately.  So

  H = { the mappings x \mapsto h(x; w) such that w \in W }

 We use standard variable names for members of these spaces.
 Sometimes this standardization conflicts with other sorts of
 standardization, in which case we punt in the interest of clarity.

Linear Classifier (aka Perceptron)

 input space is R^n
 output space is {-1,+1}, corresponding to no/yes dichotomy

  (why {-1,+1} instead of {0,1} ?  The symmetry will prove useful in
  simplifying some formulas later.)

 weight space is R^(n+1)
 hypothesis space is linear decision surface
 if w.x > theta then y = +1 else y = -1

 division between +1 region and -1 region is w.x = theta.
 (the boundary, in input space, between different outputs is called
 the separation surface.)
 note that changing theta keeps the direction of the separation plane
 the same, just translating it.
 And note that if we set theta=0 then the separating plane goes
 through the origin, and if perpendicular to w if w is regarded as a
 vector in input space.  This means that even for nonzero theta, the
 surface is perpendicular to the vector w.

 Note that we can switch the "sense" of the separation surface, ie
 make the +1 region into the -1 region and vice-versa, by changing the
 sign of w, ie the sign of all entries of w.  (Exercise: does theta
 also need to be adjusted?)

 Note that the mapping from (w,theta) to a "hypothesis" is
 many-to-one, as for instance in 2D
  (w1=2, w2=3, theta=4)
 and
  (w1=4, w2=6, theta=8)
 give the same mapping from x to y.  In fact, scaling w and
 theta by some alpha>0 will leave the hypothesis unchanged.
Perceptron References
  1. Wikipedia Perceptron entry
  2. Perceptron Learning Rule chapter from the textbook Neural Network Design by Martin T. Hagan, Howard B. Demuth, and Mark H. Beale, ISBN 0-9717321-0-8