Skip to content

Latest commit

 

History

History
131 lines (104 loc) · 6 KB

ABOUT.md

File metadata and controls

131 lines (104 loc) · 6 KB

About

Algorithmic overview

See the paper at [INSERT ARXIV LINK] for a complete description of the algorithms behind this software.

Multilevel methods, originally developed for iterative methods for solving linear systems of equations, work by constructing a sequence of increasingly-"coarse" analogues to your problem of interest, each coarse analogue a smaller (fewer variables) version of the immediately-previous one. This sequence is typically called the multilevel hierarchy.

Multilevel methods proceed by performing a bit of work on the original problem (called smoothing), then passing learned information to the next-coarser problem in the hierarchy (passing information to a coarser level is called restriction), performing a bit of work on that one (smoothing again), then passing information to the next-coarser problem, and so on, until reaching the coarsest problem in the hierarchy. The method then reverses the process: perform a bit of work (smoothing) on the coarsest problem, then pass learned information to the next-finer problem (passing information to a finer level is called prolongation) where a bit of work is performed, and so on, until reaching the original problem again. This entire sequence is called a V-Cycle.

Simple code for an iterative V-Cycle solver might look like the following:

for j in range(num_cycles):
    for level_ind, level in hierarchy[:-1]:
        level.smooth(problem_data)
        level.restrict(level.next_coarser_level)

    hierarchy[-1].smooth(problem_data)

    for level_ind, level in reversed(enumerate(hierarchy[:-1])):
        level.prolongate(level.next_coarser_level)
        level.smooth(problem_data)

The intuition behind this is that the "big idea" can be efficiently learned at the coarse levels, with the details refined at the fine levels. We have found that, in the case of neural networks, the value of multilevel methods is not necessarily computational speed, but regularization: Training an entire hierarchy of neural networks together forces a neural network to be "similar" to those above and/or below it in the hierarchy. Therefore, it learns to be a refined version of a smaller network that may capture the big ideas but not the details, and/or to be a better-regularized version of a bigger network that may be prone to overfitting. MTNN does not replace existing regularization methods, which remain crucial, but can regularize in ways that existing methods do not, and so is a useful addition to our toolbox of regularization methods.

Multilevel methods are not algorithms, must algorithmic frameworks in which one must make individual algorithmic choices for various components; we have provided a number of choices here, and designed the code in a modular way so that you can (hopefully) easily write and insert your own algorithmic components.

Some algorithmic component choices

Smoothing

Smoothing is typically just a traditional neural network optimizer applied to that level's neural network. We've used SGD here, but one could certainly use other optimizers.

Restriction

We have provided one implementation of a restriction method here. In our restriction, a coarse neural network has the same architecture as the original, but with fewer neurons per layer. In the case of CNNs, the coarse network has fewer channels per layer.

We use a Heavy Edge Matching procedure to identify pairs of similar neurons (or channels) in a layer. Upon restriction, we calculate a weighted average of the parameters of the two neurons to produce a single "coarse neuron."

The weighted averaging is linear in the parameters of the network. Thus, the operation can be represented as a matrix $R$, and restriction computes a set of coarse network parameters $N_c$ from an original network $N$ as

equation

Prolongation

Prolongation is simply the reverse operation of restriction, and a linear operator $P$ converts from coarse to fine networks. However, the image of $P$ only lies on a subspace of the original parameter space, and we want to maintain the selection on the orthogonal subspace prior to restriction. Therefore, we update as follows:

equation

Tau correction

We leverage here a variant of the multilevel framework known as the Full Approximation Scheme, originally developed for solving nonlinear system of equations. The tau correction is a linear modification to the coarse loss:

equation

As we show in our paper, the linear modifier $\tau$ is chosen to replace the coarse gradient with a restricted version of the fine gradient. This has the effect of coarse-level training proceeding more like fine-level training, at least at first before the higher-order derivatives at the coarse level take over.

We provide a few different tau correction options:

-none Does not use a tau correction. This forces the training procedure to seek out a point in which all neural networks in the hierarchy have small gradients and a positive-definite Hessian.

-wholeset or minibatch are two provided variants of using a tau correction. Because it replaces the coarse gradient, this is weaker regularization than none. It seeks out a point in which all neural networks in the hierarchy have positive-definite Hessians, but only requires that the finest network have a small gradient.

Note that wholeset and minibatch are only to be used when the neural network to be used is the finest in the hierarchy; the loss modifications at the coarser levels tend to shift those networks into modes that are suboptimal if one were to use them on their own. Using no tau correction (none) produces an entire hierarchy of potentially-useful neural networks of varying sizes.