{"title": "Globally Optimal On-line Learning Rules", "book": "Advances in Neural Information Processing Systems", "page_first": 322, "page_last": 328, "abstract": "", "full_text": "Globally Optimal On-line Learning Rules \n\nMagnus Rattray*and David Saadt \n\nDepartment of Computer Science & Applied Mathematics, \n\nAston University, Birmingham B4 7ET, UK. \n\nAbstract \n\nWe present a method for determining the globally optimal on-line \nlearning rule for a soft committee machine under a statistical me(cid:173)\nchanics framework. This work complements previous results on \nlocally optimal rules, where only the rate of change in general(cid:173)\nization error was considered. We maximize the total reduction in \ngeneralization error over the whole learning process and show how \nthe resulting rule can significantly outperform the locally optimal \nrule. \n\n1 \n\nIntroduction \n\nWe consider a learning scenario in which a feed-forward neural network model (the \nstudent) emulates an unknown mapping (the teacher), given a set of training exam(cid:173)\nples produced by the teacher. The performance of the student network is typically \nmeasured by its generalization error, which is the expected error on an unseen ex(cid:173)\nample. The aim of training is to reduce the generalization error by adapting the \nstudent network's parameters appropriately. \nA common form of training is on-line learning, where training patterns are pre(cid:173)\nsented sequentially and independently to the network at each learning step. This \nform of training can be beneficial in terms of both storage and computation time, \nespecially for large systems. A frequently used on-line training method for networks \nwith continuous nodes is that of stochastic gradient descent, since a differentiable \nerror measure can be defined in this case. The stochasticity is a consequence of \nthe training error being determined according to only the latest, randomly cho(cid:173)\nsen, training example. This is to be contrasted with batch learning, where all the \ntraining examples would be used to determine the training error leading to a de(cid:173)\nterministic algorithm. Finding an effective algorithm for discrete networks is less \nstraightforward as the error measure is not differentiable. \n\n\u2022 rattraym@aston.ac.uk \nt saadd@aston.ac.uk \n\n\fGlobally Optimal On-line Learning Rules \n\n323 \n\nOften, it is possible to improve on the basic stochastic gradient descent algorithm \nand a number of modifications have been suggested in the literature. At late times \none can use on-line estimates of second order information (the Hessian or its eigen(cid:173)\nvalues) to ensure asymptotically optimal performance (e.g., [1, 2]). A number of \nheuristics also exist which attempt to improve performance during the transient \nphase of learning (for a review, see [3]). However, these heuristics all require the \ncareful setting of parameters which can be critical to their performance. Moreover, \nit would be desirable to have principled and theoretically well motivated algorithms \nwhich do not rely on heuristic arguments. \nStatistical mechanics allows a compact description for a number of on-line learning \nscenarios in the limit of large input dimension, which we have recently employed to \npropose a method for determining globally optimal learning rates for on-line gradi(cid:173)\nent descent [4]. This method will be generalized here to determine globally optimal \non-line learning rules for both discrete and continuous machines. That is, rules \nwhich provide the maximum reduction in generalization error over the whole learn(cid:173)\ning process. This provides a natural extension to work on locally optimal learning \nrules [5, 6], where only the rate of change in generalization error is optimized. In \nfact, for simple systems we sometimes find that the locally optimal rule is also glob(cid:173)\nally optimal. However, global optimization seems to be rather important in more \ncomplex systems which are characterized by more degrees of freedom and often re(cid:173)\nquire broken permutation symmetries to learn perfectly. We will outline our general \nformalism and consider two simple and tractable learning scenarios to demonstrate \nthe method. \nIt should be pointed out that the optimal rules derived here will often require knowl(cid:173)\nedge of macroscopic properties related to the teacher's structure which would not \nbe known in general. In this sense these rules do not provide practical algorithms as \nthey stand, although some of the required macroscopic properties may be evaluated \nor estimated on the basis of data gathered as the learning progresses. In any case \nthese rules provide an upper bound on the performance one could expect from a \nreal algorithm and may be instrumental in designing practical training algorithms. \n\n2 The statistical mechanics framework \n\nFor calculating the optimal on-line learning rule we employ the statistical mechanics \ndescription of the learning process. Under this framework, which may be employed \nfor both smooth [7,8] and discrete sy.stems (e.g. [9]), the learning process is captured \nby a small number of self-averaging statistics whose trajectory is deterministic in the \nlimit of large input dimension. In this analysis the relevant statistics are overlaps \nbetween weight vectors associated with different nodes of the student and teacher \nnetworks. The equations of motion for the evolution of these overlaps can be written \nin closed form and can be integrated numerically to describe the dynamics. \nWe will consider a general two-layer soft committee machinel . The desired teacher \nmapping is from an N-dimensional input space e E RN onto a scalar ( E R, \nwhich the student models through a map a(J,e) = 2:~l g(Ji \u00b7e), where g(x) is the \nactivation function for the hidden layer, J == {Jih