Monday, October 25, 2021

Our understanding of why applications of deep learning networks are so effective is lacking; the empirical results should not be possible according to sample complexity in statistics and nonconvex optimization theory

The unreasonable effectiveness of deep learning in artificial intelligence. Terrence J. Sejnowski. Proceedings of the National Academy of Sciences, December 1, 2020 117 (48) 30033-30038; https://doi.org/10.1073/pnas.1907373117

Abstract: Deep learning networks have been trained to recognize speech, caption photographs, and translate text between languages at high levels of performance. Although applications of deep learning networks to real-world problems have become ubiquitous, our understanding of why they are so effective is lacking. These empirical results should not be possible according to sample complexity in statistics and nonconvex optimization theory. However, paradoxes in the training and effectiveness of deep learning networks are being investigated and insights are being found in the geometry of high-dimensional spaces. A mathematical theory of deep learning would illuminate how they function, allow us to assess the strengths and weaknesses of different network architectures, and lead to major improvements. Deep learning has provided natural ways for humans to communicate with digital devices and is foundational for building artificial general intelligence. Deep learning was inspired by the architecture of the cerebral cortex and insights into autonomy and general intelligence may be found in other brain regions that are essential for planning and survival, but major breakthroughs will be needed to achieve these goals.

Keywords: deep learningartificial intelligenceneural networks

Lost in Parameter Space

The network models in the 1980s rarely had more than one layer of hidden units between the inputs and outputs, but they were already highly overparameterized by the standards of statistical learning. Empirical studies uncovered a number of paradoxes that could not be explained at the time. Even though the networks were tiny by today’s standards, they had orders of magnitude more parameters than traditional statistical models. According to bounds from theorems in statistics, generalization should not be possible with the relatively small training sets that were available. However, even simple methods for regularization, such as weight decay, led to models with surprisingly good generalization.

Even more surprising, stochastic gradient descent of nonconvex loss functions was rarely trapped in local minima. There were long plateaus on the way down when the error hardly changed, followed by sharp drops. Something about these network models and the geometry of their high-dimensional parameter spaces allowed them to navigate efficiently to solutions and achieve good generalization, contrary to the failures predicted by conventional intuition.

Network models are high-dimensional dynamical systems that learn how to map input spaces into output spaces. These functions have special mathematical properties that we are just beginning to understand. Local minima during learning are rare because in the high-dimensional parameter space most critical points are saddle points (11). Another reason why good solutions can be found so easily by stochastic gradient descent is that, unlike low-dimensional models where a unique solution is sought, different networks with good performance converge from random starting points in parameter space. Because of overparameterization (12), the degeneracy of solutions changes the nature of the problem from finding a needle in a haystack to a haystack of needles.

Many questions are left unanswered. Why is it possible to generalize from so few examples and so many parameters? Why is stochastic gradient descent so effective at finding useful functions compared to other optimization methods? How large is the set of all good solutions to a problem? Are good solutions related to each other in some way? What are the relationships between architectural features and inductive bias that can improve generalization? The answers to these questions will help us design better network architectures and more efficient learning algorithms.

[...]

In his essay “The Unreasonable Effectiveness of Mathematics in the Natural Sciences,” Eugene Wigner marveled that the mathematical structure of a physical theory often reveals deep insights into that theory that lead to empirical predictions (38).

---

Not only in AI...

38  E. P. WignerThe unreasonable effectiveness of mathematics in the natural sciences. Richard Courant lecture in mathematical sciences delivered at New York University, May 11, 1959Commun. Pure Appl. Math. 13114 (1960):

The first example is the oft quoted one of planetary motion. The laws of falling bodies became rather well established as a result of experiments carried out principally in Italy. These experiments could not be very accurate in the sense in which we understand accuracy today partly because of the effect of air resistance and partly because of the impossibility, at that time, to measure short time intervals. Nevertheless, it is not surprising that as a result of their studies, the Italian natural scientists acquired a familiarity with the ways in which objects travel through the atmosphere. It was Newton who then brought the law of freely falling objects into relation with the motion of the moon, noted that the parabola of the thrown rock’s path on the earth, and the circle of the moon’s path in the sky, are particular cases of the same mathematical object of an ellipse and postulated the universal law of gravitation, on the basis of a single, and at that time very approximate, numerical coincidence. Philosophically, the law of gravitation as formulated by Newton was repugnant to his time and to himself. Empirically, it was based on very scanty observations. The mathematical language in which it was formulated contained the concept of a second derivative and those of us who have tried to draw an osculating circle to a curve know that the second derivative is not a very immediate concept. The law of gravity which Newton reluctantly established and which he could verify with an accuracy of about 4 % has proved to be accurate to less than a ten thousandth of a per cent and became so closely associated with the idea of absolute accuracy that only recently did physicists become again bold enough to inquire into the limitations of its accuracy.[...] Certainly, the example of Newton’s law, quoted over and over again, must be mentioned first as a monumental example of a law, formulated in terms which appear simple to the mathematician, which has proved accurate beyond all reasonable expectation. Let us just recapitulate our thesis on this example: first, the law, particularly since a second derivative appears in it, is simple only to the mathematician, not to common sense or to non-mathematically-minded freshmen; second, it is a conditional law of very limited scope. It explains nothing about the earth which attracts Galileo’s rocks, or about the circular form of the moon’s orbit, or about the planets of the sun. The explanation of these initial conditions is left to the geologist and the astronomer, and they have a hard time with them.

The second example is that of ordinary, elementary quantum mechanics. This originated when Max Born noticed that some rules of computation, given by Heisenberg, were formally identical with the rules of computation with matrices, established a Iong time before by mathematicians. Born, Jordan and Heisenberg then proposed to replace by matrices the position and momentum variables of the equations of classical mechanics [S]. They applied the rules of matrix mechanics to a few highly idealized problems and the results were quite satisfactory. However, there was, at that time, no rational evidence that their matrix mechanics would prove correct under more realistic conditions. Indeed, they say “if the mechanics as here proposed should already be correct in its essential traits”. As a matter of fact, the first application of their mechanics to a realistic problem, that of the hydrogen atom, was given several months later, by Pauli. This application gave results in agreement with experience. This was satisfactory but still understandable because Heisenberg’s rules of calculation were abstracted from problems which included the old theory of the hydrogen atom. The miracle occurred only when matrix mechanics, or a mathematically equivalent theory, was applied to problems for which Heisenberg’s calculating rules were meaningless. Heisenberg’s rules presupposed that the classical equations of motion had solutions with certain periodicity properties; and the equations of motion of the two electrons of the helium atom, or of the even greater number of electrons of heavier atoms, simply do not have these properties, so that Heisenberg’s rules cannot be applied to these cases. Nevertheless, the calculation of the lowest energy level of helium, as carried out a few months ago by Kinoshita at Cornell and by Bazley at the Bureau of Standards, agree with the experimental data within the accuracy of the observations, which is one part in ten millions. Surely in this case we “got something out” of the equations that we did not put in.

[...]

Considered from this point of view, the fact that some of the theories which we know to be false give such amazingly accurake results, is an adverse factor. Had we somewhat less knowledge, the group of phenomena which these “false” theories explain, would appear to us to be large enough to "prove" these theories. However, these theories are considered to be “false” by us just for the reason that they are, in ultimate analysis, incompatible with more encompassing pictures and, if sufficiently many such false theories are discovered, they are bound to prove also to be in conflict with each other. Similarly, it is possible that the theories, which we consider to be “proved” by a number of numerical agreements which appears to be large enough for us, are false because they are in conflict with a possible more encompassing theory which is beyond our means of discovery. If this were true, we would have to expect conflicts between our theories as soon as their number grows beyond a certain point and as soon as they cover a sufficiently large number of groups of phenomena. In contrast to the article of faith of the theoretical physicist mentioned before, this is the nightmare of the theorist.


No comments:

Post a Comment