Many discussions of neural networks unnecessarily introduce a large vocabulary of specialized terms. In fact, neural networks are a simple system of recursive equations that can be optimized by conventional means. Any good book on optimization provides all the necessary tools [1]. We need only unambiguously write down the equations we are using, identify the objective function, and calculate a few derivatives.
Neural-network equations map vectors of known data to corresponding known vectors of parameters , indexed by . The vector of parameters might be actual physical properties or statistical probabilities. Data can be physical measurements or numbers that index possible events. The equations contain unknown coefficients, or weights, that the user adjusts until known parameters are predicted accurately. Then it is hoped that unknown parameters can be estimated from new measurements of data.
Let's look at how simple these equations are. Intermediate vectors
are expressed recursively
as a function of other vectors
,
usually with a different dimensionality.
The first vector
corresponds to data , and the
final vector
corresponds to the parameters
to be estimated.
First, a pre-established non-linear transform is applied to
each element of a vector. Then linear transforms calculate
each sample of the output vector as a weighted sum
of samples from the input vector.
I could also have reversed the order of linear and non-linear operators. A reversed order is equivalent to setting and to identity operations. This form (1) is somewhat easier to manipulate.
This nonlinear recursion can be abbreviated as
(5) |
A perturbation of the weights
will result in
a linearized perturbation of the objective function:
The perturbations (9) of the vector elements
are
expressed as a linear function of perturbed weights
.
We can abbreviate the recursive equations (9)
as a single linear transform
.
The simplest neural network “training” algorithm adjusts the previous choice of weights by a scaled gradient. This recursive algorithm is called back-propagation.
Many steepest-descent algorithms would replace step 3 by a line search to find an optimum scale factor.
A good estimate of the scale factor can be estimated without an explicit line search.
When the linearization is a good approximation (particularly in the vicinity of a minimum), the scale factor will be very accurate. Each iteration will be much more efficient than a line search. If the scaled perturbation increases the objective function in early non-linear iterations, then a line search can be used until the linearization improves.
The convergence of these steepest descent algorithms should be improved by a PARTAN or Fletcher-Reeves algorithm, which retains and combines consecutive gradients. Neural network references use the word “momentum” to describe similar strategies, usually with invariable scale factors.
An even more efficient alternative is to solve the linearized least-squares problem fully in each iteration — a Gauss-Newton approach:
If a steepest descent algorithm were applied with an unlimited number of infinitesimal perturbations, then the changing weights would continuously decrease the objective function until a local minimum was reached. Visualize following a path that is directly downhill at every point. The path cannot ascend, no matter what may be beyond the next rise. Practical descent algorithms take large step sizes and may step over a small local increase in the objective function, to find another minimum beyond.
To increase the chances of finding the global minimum with the pseudo-quadratic objective function (6), we could initialize with different initial weights and discard suboptimum local minima. If different weights consistently lead to the same solution, then the objective function is probably convex and has only one global minimum.