DETERMINISTIC OPTIMIZATION
Conjugate Gradient
The Conjugate Gradient algorithm is probably the best deterministic
algorithm for high-dimensional problems.
The following is very popular book. *Never* use any code from it!!
***NOT EVEN ONCE***!!!! However, the text is not too bad, although it
does contain the occasional blooper and a great deal of omitted
material. The chapter on ordinary differential equations is
particularly outdated, wedged in the '60s.
@Book {PRESS-ETAL88A,
author = "William H. Press and Brian P. Flannery and Saul A. Teukolsky
and William T. Verrerling",
title = "Numerical Recipes in {C}",
publisher= CAMBRIDGE,
year = 1988
}
Here is a seminal reference.
@book {JEA82,
author = "Kang Chang Jea",
title = "Generalized conjugate gradient acceleration of
iterative methods",
publisher= Uof#"Texas at Austin, Center for Numerical Analysis",
year = 1982,
note = "PhD thesis",
}
And here is a *great* paper, which really explains everything, and is
accessible even if you didn't finish an advanced numeric analysis
course just last week.
@TechReport {SHEWCHUK94A,
author = "Jonathan Richard Shewchuk",
title = "An Introduction to the Conjugate Gradient Method Without the
Agonizing Pain",
institution= CMU#" School of Computer Science",
year = 1994,
note = "http://www.cs.cmu.edu/afs/cs/project/quake/public/papers/painless-conjugate-gradient.ps
or
ftp://warp.cs.cmu.edu/quake-papers/painless-conjugate-gradient.ps",
annote = "For figures suitable for transparencies and lectures, see
ftp://warp.cs.cmu.edu/quake-papers/painless-conjugate-gradient-pics.ps"
}
Newton's Method
For full second-order methods, variants of Levenberg-Marquat remain
state of the art.
Here are seminal refs.
@Article {LEVENBERG44A,
author = "K. Levenberg",
title = "A method for the solution of certain problems in least
squares",
journal = "Quart. Appl. Math",
year = 1944,
volume = 2,
pages = "164--168"
}
@Article {MARQUARDT63A,
author = "D. Marquardt",
title = "An algorithm for least-squares estimation of nonlinear
parameters",
journal = "SIAM J. Appl. Math",
year = 1963,
volume = 11,
pages = "431--441"
}
You wouldn't actually want to read those. Instead look it up in any
numerical analysis textbook. Eg.
@InBook {MORE77A,
author = "J. J. More'",
title = "The {L}evenberg-{M}arquardt algorithm: implementation and
theory",
pages = "105--116",
booktitle= "Numerical Analysis",
editor = "G. A. Watson",
series = "Lecture Notes in Mathematics 630",
year = 1977,
publisher= SPRINGER,
}
And here is a paper that gives an efficient algorithm for calculating
the diagonal of the Hessian of a determinist gradient system in O(n)
time, and proposes an optimization algorithm that makes use of this
information.
@inproceedings {BECKER-LECUN88,
crossref= "CMSS88",
author = "Sue Becker and Yann LeCun",
title = "Improving the Convergence of Back-Propagation Learning with
Second Order Methods",
pages = "29--37",
note = "Also published as Technical Report CRG-TR-88-5, "#Dof#CS#
", "#Uof#"Toronto",
}
STOCHASTIC OPTIMIZATION
Fundamental results.
@article {ROBINS-MONRO51,
author = "H. Robbins and S. Monro",
title = "A Stochastic Approximation Method",
journal = "Annals of Mathematical Statistics",
volume = 22,
pages = "400--407",
year = 1951,
}
(Russian mathematicians refer to some Russian guy who published the
same thing twenty years earlier in Russian. I don't have the ref
handy. They say the same thing about the FFT though. Who knows...)
Theoretical demolishing of momentum.
@inproceedings {SHYNK-ROY88,
author = "J. J. Shynk and S. Roy",
title = "The {LMS} algorithm with momentum updating",
booktitle= "Proceedings of the IEEE International Symposium on
Circuits and Systems",
location= "Helsinki, Finland",
month = jun#"~6--9",
year = 1988,
pages = "2651--2654",
}
@article {TUGAY-TANIK89,
author = "Mehmet Ali Tu\v{g}ay and Yal\c{c}in Tanik",
title = "Properties of the momentum {LMS} algorithm",
journal = "Signal Processing",
publisher= "Elsevier Science Publishers B. V.",
volume = 18,
number = 2,
year = 1989,
month = oct,
pages = "117--127",
}
@article {SHYNK-ROY90,
author = "John J. Shynk and Sumit Roy",
title = "Convergence properties and stationary points of a Perceptron
learning algorithm",
journal = "Proceedings of the IEEE",
volume = 78,
number = 10,
pages = "1599--1604",
year = 1990,
month = oct,
abstract= "If a Gaussian random vector is input into a Perceptron, the
algorithm's stationary points are not unique and the step size mu
and the momentum constant alpha determine the algorithm's
behavior near convergence. A Perceptron is an adaptive linear
neuron that produces one of two discrete values and is used as
the basis for multilayer, feedforward neural networks. The
least-mean-square adaptive algorithm is used to train a
single-layer Perceptron by adjusting internal weights.",
}
A new analytical approach which simplifies the math amazingly, and
immediately extends many of the traditional results.
@Article {SHARMA-ETAL96A,
author = "Rajesh Sharma and William A. Sethares and James A. Bucklew",
title = "Asymptotic analysis of stochastic gradient-based adaptive
filtering algorithms with general cost functions",
journal = ieeetrSP,
year = 1996,
volume = 44,
number = 9,
pages = "2186--2194",
}
Here is some modern stochastic gradient stuff. These resurect
momentum, using a more fine-grained analysis.
@InProceedings {LEEN-ORR94A,
author = "Todd K. Leen and Genevieve B. Orr",
title = "Momentum and Optimal Stochastic Search",
crossref= "NIPS6",
url = "ftp://speech.cse.ogi.edu/pub/neural/papers/leenOrr93.AdaptMom.ps.Z"
}
@InProceedings {ORR-LEEN93A,
author = "G. B. Orr and T. K. Leen",
title = "Weight Space Probability Densities in Stochastic Learning:
II. Transients and Basin Hopping Times",
crossref= "NIPS5",
url = "ftp://speech.cse.ogi.edu/pub/neural/papers/leenOrr92.stochastic.ps.Z"
}
@PhdThesis {ORR95A,
author = "Genevieve B. Orr",
title = "Dynamics and Algorithms for Stochastic Search",
school = "Oregon Graduate Institute",
year = 1995,
address = "Portland, OR",
url = "ftp://neural.cse.ogi.edu/pub/neural/papers/orrPhDch1-5.ps.Z,
ftp://neural.cse.ogi.edu/pub/neural/papers/orrPhDch6-9.ps.Z"
}
And here is a paper that gives an efficient algorithm for quickly
finding the product of the Hessian with an arbitrary vector. It is
used a bit in the above.
@Article {PEARLMUTTER94A,
author = "Barak A. Pearlmutter",
title = "Fast Exact Multiplication by the {H}essian",
journal = NeuComp,
year = 1994,
volume = 6,
number = 1,
pages = "147--160",
FTPnote = "Ftp archive.cis.ohio-state.edu:
/pub/neuroprose/pearlmutter.hessian.ps.Z"
}
And here are some more papers that uses the above algorithm for
optimization. Sorry for the overkill...
@Article {MOLLER93A,
author = "Martin M{\o}ller",
title = "A Scaled Conjugate Gradient Algorithm for Fast Supervised
Learning",
journal = "Neural Networks",
year = 1993,
month = jun,
volume = 6,
number = 4,
pages = "525--533",
}
@Article {MOLLER93B,
author = "Martin M{\o}ller",
title = "Supervised Learning on Large Redundant Training Sets",
journal = "International Journal of Neural Systems",
year = 1993,
volume = 4,
number = 1,
pages = "15--25",
}
@InBook {MOLLER94B,
crossref= "MOLLER94A",
title = "Adaptive preconditioning of the {H}essian matrix",
pages = "127--149"
}
@PhdThesis {MOLLER94A,
author = "Martin M{\o}ller",
title = "Efficient Training of Feed-Forward Neural Networks",
school = "Comp. Sci. Dept., Aarhus University",
year = 1994,
address = "Denmark",
note = "DAIMI PB-464"
}
@PhdThesis {LEHR96A,
author = "Michael A. Lehr",
title = "Scaled Stochastic Methods for Training Neural Networks",
school = "Department of Electrical Engineering, Stanford University",
year = 1996,
note = "Available from http://www-isl.stanford.edu/people/lehr/ or
ftp://simoon.stanford.edu/pub/lehr/thesis/.",
annote = "contains different derivation of R{backprop} Hessian stuff"
}
This paper proposed adapting the learning parameters. Unfortunately,
in a batch context, ah well. It's the basis for the more advanced
papers that follow it.
@article {JACOBS88,
author = "Robert A. Jacobs",
title = "Increased rates of convergence through learning
rate adaptation",
journal = NNets,
publisher= "Pergamon Press",
year = 1988,
volume = 1,
number = 4,
pages = "295--307"
}
This paper takes a control-theoretic approach to stochastic
optimization.
@article {DABIS-MOIR91,
author = "H. S. Dabis and T. J. Moir",
title = "Least mean squares as a control system",
journal = IJC,
publisher= "Taylor and Francis Ltd",
year = 1991,
volume = 54,
number = 2,
pages = "321--335",
}
I've saved the best for last. Here is the current best linear
complexity stochastic optimization algorithm. It is apparently
competitive with recursive least squares, the best second-order
quadratic complexity stochstic optimization algorithm!
@InProceedings {SUTTON92B,
crossref= "YWALS7",
author = "Richard S. Sutton",
title = "Gain Adaptation Beats Least Squares?",
pages = "161--166",
keywords= "delta-bar-delta",
url = "ftp://ftp.cs.umass.edu/pub/anw/pub/sutton/sutton-92b.ps.gz"
}
@InProceedings {SUTTON92C,
crossref= "AAAI92",
author = "Richard S. Sutton",
title = "Adapting Bias by Gradient Descent: An Incremental Version of
Delta-Bar-Delta",
url = "ftp://ftp.cs.umass.edu/pub/anw/pub/sutton/sutton-92a.ps.gz"
}
@InProceedings {GLUCK92A,
crossref= "COGSCI92",
author = "Mark A. Gluck and Paul T. Glauthier and Richard S. Sutton",
title = "Adaptation of Cue-Specific Learning Rates in Network Models of
Human Category Learning",
keywords= "delta-bar-delta"
}
EM
Seminal ref.
@article {DEMPSTER76,
author = "A. P. Dempster and N. M. Laird and D. B. Rubin",
title = "Maximum likelihood from incomplete data via the {EM}
algorithm",
journal = "Proceedings of the Royal Statistical Society",
year = 1976,
pages = "1--38"
}
For more modern stuff, look up some papers from Mike Jordan's group
via http://www.ai.mit.edu/projects/jordan.html, their intros and refs
are excellent hooks into the literature.
Hidden Markov Models (HMMs)
@Article {LEVINSON-ETAL83A,
author = "S. E. Levinson and L. R. Rabiner and M. M. Sondhi",
title = "An Introduction to the Application of the Theory of
Probabilistic Functions of a Markov Process to Automatic
Speech Recognition",
journal = "Bell System Technical Journal",
volume = 62,
number = 4,
month = apr,
year = 1983,
pages = "1035--1074",
comment = "Analysis of implementation of Hidden Markov models and the
Baum-Welch algorithm",
}
See also Peter Brown's PhD thesis (CMU) and Kai-Fu Lee's PhD thesis
(also CMU). The latter has a particularly good intro to HMMs chapter,
and a lot of stuff on tricks that allowed his system to beat similar
efforts by much much larger much much more experienced groups.
(Sorry, I don't have url's or ISBNs. I do have copies on my shelves...)
For fun, nothing beats this very provocative paper which views French
as English passed through a strange communication channel, uses a
large corpus (and the EM algorithm) to build a model of the channel,
and then does translation by finding the input to the channel most
likely to have produced the observed output. It even mostly works!
@article {BROWN-COCKE-ET-AL90,
author = "Peter Brown and John Cocke and Stephen Della Pietra
and Vincent J. Della Pietra and Fredrick Jelinek and
John D. Lafferty and Robert L. Mercer and Paul S. Roossin",
title = "A Statistical Approach to Machine Translation",
journal = "Computational Linguistics",
pages = "79--85",
volume = 16,
number = 2,
month = jun,
year = 1990,
}
the Kalman Filter
@Article {KALMAN60A,
author = "R. E. Kalman",
title = "A New Approach to Linear Filtering and Prediction Problems",
journal = "Trans. ASME "#Jof#"Basic Engineering",
year = 1960,
month = mar,
volume = 82,
number = 1,
pages = "35--45"
}
The Extended Kalman Filter
@Article {MEHRA70A,
author = "R. K. Mahra",
title = "On the Identification of Variances and Adaptive {K}alman
Filtering",
journal = ieeetrAC,
year = 1970,
volume = "AC-15",
number = 2,
pages = "175--184",
month = apr,
}
And an application thereof.
@Article {PUSKORIUS-FELDKAMP94A,
author = "G. V. Puskorius and L. A. Feldkamp",
title = "Neurocontrol of Nonlinear Dynamical Systems with {K}alman
Filter-Trained Recurrent Networks",
journal = IEEEtrNN,
volume = 5,
number = 2,
pages = "279--297",
year = 1994
}
MODEL BASED RECOGNITION
vs
DISCRIMINATION BASED RECOGNITION
Motor theory of speech. A good row in the journals exposes the seamy
underbelly of psychology!
@article {LANE65,
author = "Lane, H. L.",
year = 1965,
title = "The motor theory of speech perception: A critical review",
journal = "Psychology Review",
volume = 72,
pages = "275--309"
}
@article {STUDDERT-KENNEDY-LIBERMAN-HARRIS-COOPER70,
author = "Studdert-Kennedy, M. and Liberman, A. M and Harris, K. S. and
Cooper, F. S.",
year = 1970,
title = "Motor Theory of Speech Perception: A Reply to Lane's Critical
Review",
journal = PsychRev,
volume = 77,
pages = "234--249"
}
@Proceedings {MOTORHEAR91,
title = "Modularity and the Motor Theory of Speech Perception:
Proceedings of a Conference to Honor Alvin M. Liberman",
booktitle= "Modularity and the Motor Theory of Speech Perception:
Proceedings of a Conference to Honor Alvin M. Liberman",
year = 1991,
editor = "Ignatius G. Mattingly and Michael Studdert-Kennedy",
publisher= ERLBAUM
}
N-ARMED BANDITS
These are excellent texts on bandits.
@Book {BERRY-FRISTEDT85A,
author = "Donald A. Berry and Bert Fristedt",
title = "Bandit Problems: Sequential Allocation of Experiments",
publisher= CHAPMAN-HALL,
year = 1985,
series = "Monographs on statistics and applied probability"
}
@book {NARENDRA-THATHACHAR89A,
author = "K. S. Narendra and M. A. L. Thathachar",
title = "Learning Automata: an Introduction",
year = 1989,
publisher= PRENTICE,
}
I've also heard that this is good, but have not seen it myself.
@Book {PRESMAN-SONIN90A,
author = "Ernst Lvovich Presman and Isaak Mikhailovich Sonin",
title = "Sequential control with incomplete information: the Bayesian
approach to multi-armed bandit problems",
publisher= ACADEMIC,
year = 1990,
annote = "translated from the Russian"
}
This is a little dated, but has a slant of interest to people working
on genetic algorithms.
@TechReport {SIVAPALAN-ETAL87A,
author = "T. Sivapalan and David E. Goldberg",
title = "The Two-Armed Bandit Problem: A Bibliography, 1952-Present",
institution= "University of Alabama, Dept. of Engineering Mechanics,
Clearinghouse for Genetic Algorithms",
year = 1987,
number = "TCGA 87002",
address = "Tuscaloosa, Alabama",
month = nov
}
This is off topic, but one of the funniest books I've ever read.
@Book {WEISBECKER86A,
author = "Alan C. Weisbecker",
title = "Cosmic Banditos: a Contrabandista's Quest for the Meaning of
Life",
publisher= "Vintage Books",
year = 1986,
address = "New York",
quote = "Mr Quark: Please leave me and my family alone."
}
These view neurons as little bandit automata. Considered classics.
@article {BARTO83,
author = "Barto, A. G. and Sutton, R. S. and Anderson, C. W.",
title = "Neuron-like Adaptive Elements that Can Solve Difficult
Learning Control Problems",
journal = ieeetrSMC,
volume = 13,
pages = "835--846",
year = 1983,
annote = "Reprinted in {\it Neurocomputing: Foundations of Research},
J. A. Anderson and E. Rosenfeld (Eds.), Cambridge, MA: The MIT
Press, 1988, pp. 537--549"
}
@article {BARTO85,
author = "Barto, A. G. and P. Anandan",
title = "Pattern Recognizing Stochastic Learning Automata",
pages = "360--375",
journal = ieeetrSMC,
volume = 15,
year = 1985
}
REINFORCEMENT LEARNING SYSTEMS
The classic ad-hoc bucket-brigade self-learning checkers player.
@Article {SAMUEL60A,
author = "A. L. Samuel",
title = "Programming Computers to Play Games",
journal = "Advances in Computers",
year = 1960,
volume = 1,
pages = "165--192"
}
@article {SAMUEL67,
author = "Samuel, A. L.",
title = "Some studies in machine learning using the game of
checkers {II}---Recent progress",
journal = ibmjrd,
year = 1967,
volume = 11
}
More bucket-brigades. This book has a lot in it, but the lack of a
systematic and rigorous approach makes it difficult to separate what
worked from what he thought was cool.
@book {HOLLAND75,
author = "Holland, J. H.",
title = "Adaptation in Natural and Artificial Systems",
publisher= Uof#"Michigan Press",
year = 1975
}
There are a couple important modern algorithms. Aside from Bandit
algorithms, the method of tempral differences is probably most
important. It is not stricly a reinforcement learning algorithm, as
it only estimates the expected return of a state, but I'm putting it
here anyway. (It is related to the bucket brigade, but has had some
serious practical impact.)
@Article {SUTTON88A,
author = "R. S. Sutton",
title = "Learning To Predict by the Methods of Temporal Differences",
journal = "Machine Learning",
year = 1988,
volume = 3,
pages = "9--44",
url = "ftp://ftp.cs.umass.edu/pub/anw/pub/sutton/sutton-88.ps.gz"
}
Here is a fun paper, which embodies a lot of different ideas in a
single system called DYNA.
@InBook {SUTTON90A,
crossref= "NNfC",
author = "Richard S. Sutton",
title = "First Results with {DYNA}, an Integrated Architecture for
Learning, Planning and Reacting",
chapter = 8,
pages = "179--189",
annote = "see also ftp://ftp.cs.umass.edu/pub/anw/pub/sutton/sutton-91b.ps.gz"
}
@InProceedings {SUTTON90B,
author = "Richard S. Sutton",
title = "Integrated Architectures for Learning, Planning, and Reacting
Based on Approximating Dynamic Programming",
crossref= "ICML7",
url = "ftp://ftp.cs.umass.edu/pub/anw/pub/sutton/sutton-90.ps.gz"
}
Some more TD stuff.
@InProceedings {TESAURO92A,
author = "G. J. Tesauro",
title = "Practical Issues in Temporal Difference Learning",
crossref= "NIPS4",
pages = "260--266"
}
@InProceedings {SUTTON-SINGH94A,
author = "Richard S. Sutton and Satinder P. Singh",
title = "On Bias and Step-Size in Temporal-Difference Learning",
crossref= "YWALS9",
pages = "91--96",
url = "ftp://ftp.cs.umass.edu/pub/anw/pub/sutton/sutton-singh-94.ps.gz"
}
@InBook {SUTTON-BARTO90B,
author = "Richard. S. Sutton and Andrew. G. Barto",
title = "A Time-Derivative Theory of Pavlovian Conditioning",
crossref= "LCNS",
pages = "497--537"
}
The seminal Q-learning reference
@PhdThesis {WATKINS89,
author = "C. J. C. H. Watkins",
title = "Learning from Delayed Rewards",
school = "Cambridge University",
year = "1989",
address = "Cambridge, England",
}
Something a little easier to get ahold of
@Article {WATKINS-DAYAN92A,
author = "Watkins, J. C. H. and Dayan, P.",
title = "Technical Note: Q-Learning",
journal = "Machine Learning",
year = 1992,
volume = 8,
number = "279-292"
}
A sort of manifesto attempting to unify two disciplines.
@InProceedings {SUTTON-ETAL91,
crossref= "ACC91",
author = "R. S. Sutton and A. G. Barto and R. J. Williams",
title = "Reinforcement Learning is Direct Adaptive Optimal Control",
pages = "2143--2146"
}
A nice collection.
@Book {SUTTON92A,
title = "Reinforcement Learning",
booktitle= "Reinforcement Learning",
publisher= KLUWER,
year = 1992,
editor = "Richard S. Sutton",
series = "SECS",
volume = 173,
note = "Reprinted from volume 8(3--4) (1992) of Machine Learning"
}
More Q-learning
(boltzmann-distribution paper which shows convergence)
On the convergence of stochastic iterative dynamic programming algorithms
Tomi Jaakola, Mike Jordan, Satinder Singh.
http://web.mit.edu/~tommi/Public/conv.html
ftp://psyche.mit.edu/pub/tommi/jaak-convergence.ps.gz
POMDP (partially observable markov decision process) policy learning
Reinforcement learning algorithm for partially observable decision problems
Jaakkola, Singh, and Jordan / NIPS-7
http://web.mit.edu/~tommi/Public/nips7.html
ftp://psyche.mit.edu/pub/tommi/jaak-nips7.ps.gz
******* to add (not in my bib file, urk) *******
Gerry Tesauro's world champion TD-gammon. Neural Computation.
http://www.research.ibm.com/massdist/tdl.html
Hans Berliner's BKG 9.2 article, scientific american. blemishes by
clever Hans.
peter brown thesis
kai-fu lee thesis
BILL, the othello program. (eval function tuned w/ reinforcement
learning, sort of.)
Heirarchical reinforcement learning
heirarchical EM stuff
the bad lieutenant's stuff (Harmon):
sutton's extended delta-bar-delta -> nonlinear
advantage learning
reinforcement learning tutorial
http://ace.aa.wpafb.af.mil:80/~harmonme/pubs.html
Kenji Doya, continuous-time TD, NIPS*9x
that bandit H-something races paper from NIPS a few years ago.
Nice formalism.
the recent COLT community work on adaptive approximations to optimal
balanced portfolios.
add textbooks.
backpropagation & automatic differentiation
helmholtz machines
boltzmann machines
Dayan & Hinton NIPS paper on hierarchical reinforcement learning
section on generalization
PAC learning
An Introduction to Computational Learning Theory
Michael J. Kearns and Umesh V. Vazirani
1994
MIT Press
ISBN 0-262-11193-4
http://www-mitpress.mit.edu/mitp/recent-books/comp/kearns.html
combining learners
Boosting, especially the AdaBoost algorithm
http://www.research.att.com/~schapire/boost.html
Exponentiated Gradient Descent