DETERMINISTIC OPTIMIZATION Conjugate Gradient The Conjugate Gradient algorithm is probably the best deterministic algorithm for high-dimensional problems. The following is very popular book. *Never* use any code from it!! ***NOT EVEN ONCE***!!!! However, the text is not too bad, although it does contain the occasional blooper and a great deal of omitted material. The chapter on ordinary differential equations is particularly outdated, wedged in the '60s. @Book {PRESS-ETAL88A, author = "William H. Press and Brian P. Flannery and Saul A. Teukolsky and William T. Verrerling", title = "Numerical Recipes in {C}", publisher= CAMBRIDGE, year = 1988 } Here is a seminal reference. @book {JEA82, author = "Kang Chang Jea", title = "Generalized conjugate gradient acceleration of iterative methods", publisher= Uof#"Texas at Austin, Center for Numerical Analysis", year = 1982, note = "PhD thesis", } And here is a *great* paper, which really explains everything, and is accessible even if you didn't finish an advanced numeric analysis course just last week. @TechReport {SHEWCHUK94A, author = "Jonathan Richard Shewchuk", title = "An Introduction to the Conjugate Gradient Method Without the Agonizing Pain", institution= CMU#" School of Computer Science", year = 1994, note = "http://www.cs.cmu.edu/afs/cs/project/quake/public/papers/painless-conjugate-gradient.ps or ftp://warp.cs.cmu.edu/quake-papers/painless-conjugate-gradient.ps", annote = "For figures suitable for transparencies and lectures, see ftp://warp.cs.cmu.edu/quake-papers/painless-conjugate-gradient-pics.ps" } Newton's Method For full second-order methods, variants of Levenberg-Marquat remain state of the art. Here are seminal refs. @Article {LEVENBERG44A, author = "K. Levenberg", title = "A method for the solution of certain problems in least squares", journal = "Quart. Appl. Math", year = 1944, volume = 2, pages = "164--168" } @Article {MARQUARDT63A, author = "D. Marquardt", title = "An algorithm for least-squares estimation of nonlinear parameters", journal = "SIAM J. Appl. Math", year = 1963, volume = 11, pages = "431--441" } You wouldn't actually want to read those. Instead look it up in any numerical analysis textbook. Eg. @InBook {MORE77A, author = "J. J. More'", title = "The {L}evenberg-{M}arquardt algorithm: implementation and theory", pages = "105--116", booktitle= "Numerical Analysis", editor = "G. A. Watson", series = "Lecture Notes in Mathematics 630", year = 1977, publisher= SPRINGER, } And here is a paper that gives an efficient algorithm for calculating the diagonal of the Hessian of a determinist gradient system in O(n) time, and proposes an optimization algorithm that makes use of this information. @inproceedings {BECKER-LECUN88, crossref= "CMSS88", author = "Sue Becker and Yann LeCun", title = "Improving the Convergence of Back-Propagation Learning with Second Order Methods", pages = "29--37", note = "Also published as Technical Report CRG-TR-88-5, "#Dof#CS# ", "#Uof#"Toronto", } STOCHASTIC OPTIMIZATION Fundamental results. @article {ROBINS-MONRO51, author = "H. Robbins and S. Monro", title = "A Stochastic Approximation Method", journal = "Annals of Mathematical Statistics", volume = 22, pages = "400--407", year = 1951, } (Russian mathematicians refer to some Russian guy who published the same thing twenty years earlier in Russian. I don't have the ref handy. They say the same thing about the FFT though. Who knows...) Theoretical demolishing of momentum. @inproceedings {SHYNK-ROY88, author = "J. J. Shynk and S. Roy", title = "The {LMS} algorithm with momentum updating", booktitle= "Proceedings of the IEEE International Symposium on Circuits and Systems", location= "Helsinki, Finland", month = jun#"~6--9", year = 1988, pages = "2651--2654", } @article {TUGAY-TANIK89, author = "Mehmet Ali Tu\v{g}ay and Yal\c{c}in Tanik", title = "Properties of the momentum {LMS} algorithm", journal = "Signal Processing", publisher= "Elsevier Science Publishers B. V.", volume = 18, number = 2, year = 1989, month = oct, pages = "117--127", } @article {SHYNK-ROY90, author = "John J. Shynk and Sumit Roy", title = "Convergence properties and stationary points of a Perceptron learning algorithm", journal = "Proceedings of the IEEE", volume = 78, number = 10, pages = "1599--1604", year = 1990, month = oct, abstract= "If a Gaussian random vector is input into a Perceptron, the algorithm's stationary points are not unique and the step size mu and the momentum constant alpha determine the algorithm's behavior near convergence. A Perceptron is an adaptive linear neuron that produces one of two discrete values and is used as the basis for multilayer, feedforward neural networks. The least-mean-square adaptive algorithm is used to train a single-layer Perceptron by adjusting internal weights.", } A new analytical approach which simplifies the math amazingly, and immediately extends many of the traditional results. @Article {SHARMA-ETAL96A, author = "Rajesh Sharma and William A. Sethares and James A. Bucklew", title = "Asymptotic analysis of stochastic gradient-based adaptive filtering algorithms with general cost functions", journal = ieeetrSP, year = 1996, volume = 44, number = 9, pages = "2186--2194", } Here is some modern stochastic gradient stuff. These resurect momentum, using a more fine-grained analysis. @InProceedings {LEEN-ORR94A, author = "Todd K. Leen and Genevieve B. Orr", title = "Momentum and Optimal Stochastic Search", crossref= "NIPS6", url = "ftp://speech.cse.ogi.edu/pub/neural/papers/leenOrr93.AdaptMom.ps.Z" } @InProceedings {ORR-LEEN93A, author = "G. B. Orr and T. K. Leen", title = "Weight Space Probability Densities in Stochastic Learning: II. Transients and Basin Hopping Times", crossref= "NIPS5", url = "ftp://speech.cse.ogi.edu/pub/neural/papers/leenOrr92.stochastic.ps.Z" } @PhdThesis {ORR95A, author = "Genevieve B. Orr", title = "Dynamics and Algorithms for Stochastic Search", school = "Oregon Graduate Institute", year = 1995, address = "Portland, OR", url = "ftp://neural.cse.ogi.edu/pub/neural/papers/orrPhDch1-5.ps.Z, ftp://neural.cse.ogi.edu/pub/neural/papers/orrPhDch6-9.ps.Z" } And here is a paper that gives an efficient algorithm for quickly finding the product of the Hessian with an arbitrary vector. It is used a bit in the above. @Article {PEARLMUTTER94A, author = "Barak A. Pearlmutter", title = "Fast Exact Multiplication by the {H}essian", journal = NeuComp, year = 1994, volume = 6, number = 1, pages = "147--160", FTPnote = "Ftp archive.cis.ohio-state.edu: /pub/neuroprose/pearlmutter.hessian.ps.Z" } And here are some more papers that uses the above algorithm for optimization. Sorry for the overkill... @Article {MOLLER93A, author = "Martin M{\o}ller", title = "A Scaled Conjugate Gradient Algorithm for Fast Supervised Learning", journal = "Neural Networks", year = 1993, month = jun, volume = 6, number = 4, pages = "525--533", } @Article {MOLLER93B, author = "Martin M{\o}ller", title = "Supervised Learning on Large Redundant Training Sets", journal = "International Journal of Neural Systems", year = 1993, volume = 4, number = 1, pages = "15--25", } @InBook {MOLLER94B, crossref= "MOLLER94A", title = "Adaptive preconditioning of the {H}essian matrix", pages = "127--149" } @PhdThesis {MOLLER94A, author = "Martin M{\o}ller", title = "Efficient Training of Feed-Forward Neural Networks", school = "Comp. Sci. Dept., Aarhus University", year = 1994, address = "Denmark", note = "DAIMI PB-464" } @PhdThesis {LEHR96A, author = "Michael A. Lehr", title = "Scaled Stochastic Methods for Training Neural Networks", school = "Department of Electrical Engineering, Stanford University", year = 1996, note = "Available from http://www-isl.stanford.edu/people/lehr/ or ftp://simoon.stanford.edu/pub/lehr/thesis/.", annote = "contains different derivation of R{backprop} Hessian stuff" } This paper proposed adapting the learning parameters. Unfortunately, in a batch context, ah well. It's the basis for the more advanced papers that follow it. @article {JACOBS88, author = "Robert A. Jacobs", title = "Increased rates of convergence through learning rate adaptation", journal = NNets, publisher= "Pergamon Press", year = 1988, volume = 1, number = 4, pages = "295--307" } This paper takes a control-theoretic approach to stochastic optimization. @article {DABIS-MOIR91, author = "H. S. Dabis and T. J. Moir", title = "Least mean squares as a control system", journal = IJC, publisher= "Taylor and Francis Ltd", year = 1991, volume = 54, number = 2, pages = "321--335", } I've saved the best for last. Here is the current best linear complexity stochastic optimization algorithm. It is apparently competitive with recursive least squares, the best second-order quadratic complexity stochstic optimization algorithm! @InProceedings {SUTTON92B, crossref= "YWALS7", author = "Richard S. Sutton", title = "Gain Adaptation Beats Least Squares?", pages = "161--166", keywords= "delta-bar-delta", url = "ftp://ftp.cs.umass.edu/pub/anw/pub/sutton/sutton-92b.ps.gz" } @InProceedings {SUTTON92C, crossref= "AAAI92", author = "Richard S. Sutton", title = "Adapting Bias by Gradient Descent: An Incremental Version of Delta-Bar-Delta", url = "ftp://ftp.cs.umass.edu/pub/anw/pub/sutton/sutton-92a.ps.gz" } @InProceedings {GLUCK92A, crossref= "COGSCI92", author = "Mark A. Gluck and Paul T. Glauthier and Richard S. Sutton", title = "Adaptation of Cue-Specific Learning Rates in Network Models of Human Category Learning", keywords= "delta-bar-delta" } EM Seminal ref. @article {DEMPSTER76, author = "A. P. Dempster and N. M. Laird and D. B. Rubin", title = "Maximum likelihood from incomplete data via the {EM} algorithm", journal = "Proceedings of the Royal Statistical Society", year = 1976, pages = "1--38" } For more modern stuff, look up some papers from Mike Jordan's group via http://www.ai.mit.edu/projects/jordan.html, their intros and refs are excellent hooks into the literature. Hidden Markov Models (HMMs) @Article {LEVINSON-ETAL83A, author = "S. E. Levinson and L. R. Rabiner and M. M. Sondhi", title = "An Introduction to the Application of the Theory of Probabilistic Functions of a Markov Process to Automatic Speech Recognition", journal = "Bell System Technical Journal", volume = 62, number = 4, month = apr, year = 1983, pages = "1035--1074", comment = "Analysis of implementation of Hidden Markov models and the Baum-Welch algorithm", } See also Peter Brown's PhD thesis (CMU) and Kai-Fu Lee's PhD thesis (also CMU). The latter has a particularly good intro to HMMs chapter, and a lot of stuff on tricks that allowed his system to beat similar efforts by much much larger much much more experienced groups. (Sorry, I don't have url's or ISBNs. I do have copies on my shelves...) For fun, nothing beats this very provocative paper which views French as English passed through a strange communication channel, uses a large corpus (and the EM algorithm) to build a model of the channel, and then does translation by finding the input to the channel most likely to have produced the observed output. It even mostly works! @article {BROWN-COCKE-ET-AL90, author = "Peter Brown and John Cocke and Stephen Della Pietra and Vincent J. Della Pietra and Fredrick Jelinek and John D. Lafferty and Robert L. Mercer and Paul S. Roossin", title = "A Statistical Approach to Machine Translation", journal = "Computational Linguistics", pages = "79--85", volume = 16, number = 2, month = jun, year = 1990, } the Kalman Filter @Article {KALMAN60A, author = "R. E. Kalman", title = "A New Approach to Linear Filtering and Prediction Problems", journal = "Trans. ASME "#Jof#"Basic Engineering", year = 1960, month = mar, volume = 82, number = 1, pages = "35--45" } The Extended Kalman Filter @Article {MEHRA70A, author = "R. K. Mahra", title = "On the Identification of Variances and Adaptive {K}alman Filtering", journal = ieeetrAC, year = 1970, volume = "AC-15", number = 2, pages = "175--184", month = apr, } And an application thereof. @Article {PUSKORIUS-FELDKAMP94A, author = "G. V. Puskorius and L. A. Feldkamp", title = "Neurocontrol of Nonlinear Dynamical Systems with {K}alman Filter-Trained Recurrent Networks", journal = IEEEtrNN, volume = 5, number = 2, pages = "279--297", year = 1994 } MODEL BASED RECOGNITION vs DISCRIMINATION BASED RECOGNITION Motor theory of speech. A good row in the journals exposes the seamy underbelly of psychology! @article {LANE65, author = "Lane, H. L.", year = 1965, title = "The motor theory of speech perception: A critical review", journal = "Psychology Review", volume = 72, pages = "275--309" } @article {STUDDERT-KENNEDY-LIBERMAN-HARRIS-COOPER70, author = "Studdert-Kennedy, M. and Liberman, A. M and Harris, K. S. and Cooper, F. S.", year = 1970, title = "Motor Theory of Speech Perception: A Reply to Lane's Critical Review", journal = PsychRev, volume = 77, pages = "234--249" } @Proceedings {MOTORHEAR91, title = "Modularity and the Motor Theory of Speech Perception: Proceedings of a Conference to Honor Alvin M. Liberman", booktitle= "Modularity and the Motor Theory of Speech Perception: Proceedings of a Conference to Honor Alvin M. Liberman", year = 1991, editor = "Ignatius G. Mattingly and Michael Studdert-Kennedy", publisher= ERLBAUM } N-ARMED BANDITS These are excellent texts on bandits. @Book {BERRY-FRISTEDT85A, author = "Donald A. Berry and Bert Fristedt", title = "Bandit Problems: Sequential Allocation of Experiments", publisher= CHAPMAN-HALL, year = 1985, series = "Monographs on statistics and applied probability" } @book {NARENDRA-THATHACHAR89A, author = "K. S. Narendra and M. A. L. Thathachar", title = "Learning Automata: an Introduction", year = 1989, publisher= PRENTICE, } I've also heard that this is good, but have not seen it myself. @Book {PRESMAN-SONIN90A, author = "Ernst Lvovich Presman and Isaak Mikhailovich Sonin", title = "Sequential control with incomplete information: the Bayesian approach to multi-armed bandit problems", publisher= ACADEMIC, year = 1990, annote = "translated from the Russian" } This is a little dated, but has a slant of interest to people working on genetic algorithms. @TechReport {SIVAPALAN-ETAL87A, author = "T. Sivapalan and David E. Goldberg", title = "The Two-Armed Bandit Problem: A Bibliography, 1952-Present", institution= "University of Alabama, Dept. of Engineering Mechanics, Clearinghouse for Genetic Algorithms", year = 1987, number = "TCGA 87002", address = "Tuscaloosa, Alabama", month = nov } This is off topic, but one of the funniest books I've ever read. @Book {WEISBECKER86A, author = "Alan C. Weisbecker", title = "Cosmic Banditos: a Contrabandista's Quest for the Meaning of Life", publisher= "Vintage Books", year = 1986, address = "New York", quote = "Mr Quark: Please leave me and my family alone." } These view neurons as little bandit automata. Considered classics. @article {BARTO83, author = "Barto, A. G. and Sutton, R. S. and Anderson, C. W.", title = "Neuron-like Adaptive Elements that Can Solve Difficult Learning Control Problems", journal = ieeetrSMC, volume = 13, pages = "835--846", year = 1983, annote = "Reprinted in {\it Neurocomputing: Foundations of Research}, J. A. Anderson and E. Rosenfeld (Eds.), Cambridge, MA: The MIT Press, 1988, pp. 537--549" } @article {BARTO85, author = "Barto, A. G. and P. Anandan", title = "Pattern Recognizing Stochastic Learning Automata", pages = "360--375", journal = ieeetrSMC, volume = 15, year = 1985 } REINFORCEMENT LEARNING SYSTEMS The classic ad-hoc bucket-brigade self-learning checkers player. @Article {SAMUEL60A, author = "A. L. Samuel", title = "Programming Computers to Play Games", journal = "Advances in Computers", year = 1960, volume = 1, pages = "165--192" } @article {SAMUEL67, author = "Samuel, A. L.", title = "Some studies in machine learning using the game of checkers {II}---Recent progress", journal = ibmjrd, year = 1967, volume = 11 } More bucket-brigades. This book has a lot in it, but the lack of a systematic and rigorous approach makes it difficult to separate what worked from what he thought was cool. @book {HOLLAND75, author = "Holland, J. H.", title = "Adaptation in Natural and Artificial Systems", publisher= Uof#"Michigan Press", year = 1975 } There are a couple important modern algorithms. Aside from Bandit algorithms, the method of tempral differences is probably most important. It is not stricly a reinforcement learning algorithm, as it only estimates the expected return of a state, but I'm putting it here anyway. (It is related to the bucket brigade, but has had some serious practical impact.) @Article {SUTTON88A, author = "R. S. Sutton", title = "Learning To Predict by the Methods of Temporal Differences", journal = "Machine Learning", year = 1988, volume = 3, pages = "9--44", url = "ftp://ftp.cs.umass.edu/pub/anw/pub/sutton/sutton-88.ps.gz" } Here is a fun paper, which embodies a lot of different ideas in a single system called DYNA. @InBook {SUTTON90A, crossref= "NNfC", author = "Richard S. Sutton", title = "First Results with {DYNA}, an Integrated Architecture for Learning, Planning and Reacting", chapter = 8, pages = "179--189", annote = "see also ftp://ftp.cs.umass.edu/pub/anw/pub/sutton/sutton-91b.ps.gz" } @InProceedings {SUTTON90B, author = "Richard S. Sutton", title = "Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming", crossref= "ICML7", url = "ftp://ftp.cs.umass.edu/pub/anw/pub/sutton/sutton-90.ps.gz" } Some more TD stuff. @InProceedings {TESAURO92A, author = "G. J. Tesauro", title = "Practical Issues in Temporal Difference Learning", crossref= "NIPS4", pages = "260--266" } @InProceedings {SUTTON-SINGH94A, author = "Richard S. Sutton and Satinder P. Singh", title = "On Bias and Step-Size in Temporal-Difference Learning", crossref= "YWALS9", pages = "91--96", url = "ftp://ftp.cs.umass.edu/pub/anw/pub/sutton/sutton-singh-94.ps.gz" } @InBook {SUTTON-BARTO90B, author = "Richard. S. Sutton and Andrew. G. Barto", title = "A Time-Derivative Theory of Pavlovian Conditioning", crossref= "LCNS", pages = "497--537" } The seminal Q-learning reference @PhdThesis {WATKINS89, author = "C. J. C. H. Watkins", title = "Learning from Delayed Rewards", school = "Cambridge University", year = "1989", address = "Cambridge, England", } Something a little easier to get ahold of @Article {WATKINS-DAYAN92A, author = "Watkins, J. C. H. and Dayan, P.", title = "Technical Note: Q-Learning", journal = "Machine Learning", year = 1992, volume = 8, number = "279-292" } A sort of manifesto attempting to unify two disciplines. @InProceedings {SUTTON-ETAL91, crossref= "ACC91", author = "R. S. Sutton and A. G. Barto and R. J. Williams", title = "Reinforcement Learning is Direct Adaptive Optimal Control", pages = "2143--2146" } A nice collection. @Book {SUTTON92A, title = "Reinforcement Learning", booktitle= "Reinforcement Learning", publisher= KLUWER, year = 1992, editor = "Richard S. Sutton", series = "SECS", volume = 173, note = "Reprinted from volume 8(3--4) (1992) of Machine Learning" } More Q-learning (boltzmann-distribution paper which shows convergence) On the convergence of stochastic iterative dynamic programming algorithms Tomi Jaakola, Mike Jordan, Satinder Singh. http://web.mit.edu/~tommi/Public/conv.html ftp://psyche.mit.edu/pub/tommi/jaak-convergence.ps.gz POMDP (partially observable markov decision process) policy learning Reinforcement learning algorithm for partially observable decision problems Jaakkola, Singh, and Jordan / NIPS-7 http://web.mit.edu/~tommi/Public/nips7.html ftp://psyche.mit.edu/pub/tommi/jaak-nips7.ps.gz ******* to add (not in my bib file, urk) ******* Gerry Tesauro's world champion TD-gammon. Neural Computation. http://www.research.ibm.com/massdist/tdl.html Hans Berliner's BKG 9.2 article, scientific american. blemishes by clever Hans. peter brown thesis kai-fu lee thesis BILL, the othello program. (eval function tuned w/ reinforcement learning, sort of.) Heirarchical reinforcement learning heirarchical EM stuff the bad lieutenant's stuff (Harmon): sutton's extended delta-bar-delta -> nonlinear advantage learning reinforcement learning tutorial http://ace.aa.wpafb.af.mil:80/~harmonme/pubs.html Kenji Doya, continuous-time TD, NIPS*9x that bandit H-something races paper from NIPS a few years ago. Nice formalism. the recent COLT community work on adaptive approximations to optimal balanced portfolios. add textbooks. backpropagation & automatic differentiation helmholtz machines boltzmann machines Dayan & Hinton NIPS paper on hierarchical reinforcement learning section on generalization PAC learning An Introduction to Computational Learning Theory Michael J. Kearns and Umesh V. Vazirani 1994 MIT Press ISBN 0-262-11193-4 http://www-mitpress.mit.edu/mitp/recent-books/comp/kearns.html combining learners Boosting, especially the AdaBoost algorithm http://www.research.att.com/~schapire/boost.html Exponentiated Gradient Descent