See the paper at [INSERT ARXIV LINK] for a complete description of the algorithms behind this software.
Multilevel methods, originally developed for iterative methods for solving linear systems of equations, work by constructing a sequence of increasingly-"coarse" analogues to your problem of interest, each coarse analogue a smaller (fewer variables) version of the immediately-previous one. This sequence is typically called the multilevel hierarchy.
Multilevel methods proceed by performing a bit of work on the original problem (called smoothing), then passing learned information to the next-coarser problem in the hierarchy (passing information to a coarser level is called restriction), performing a bit of work on that one (smoothing again), then passing information to the next-coarser problem, and so on, until reaching the coarsest problem in the hierarchy. The method then reverses the process: perform a bit of work (smoothing) on the coarsest problem, then pass learned information to the next-finer problem (passing information to a finer level is called prolongation) where a bit of work is performed, and so on, until reaching the original problem again. This entire sequence is called a V-Cycle.
Simple code for an iterative V-Cycle solver might look like the following:
for j in range(num_cycles):
for level_ind, level in hierarchy[:-1]:
level.smooth(problem_data)
level.restrict(level.next_coarser_level)
hierarchy[-1].smooth(problem_data)
for level_ind, level in reversed(enumerate(hierarchy[:-1])):
level.prolongate(level.next_coarser_level)
level.smooth(problem_data)
The intuition behind this is that the "big idea" can be efficiently learned at the coarse levels, with the details refined at the fine levels. We have found that, in the case of neural networks, the value of multilevel methods is not necessarily computational speed, but regularization: Training an entire hierarchy of neural networks together forces a neural network to be "similar" to those above and/or below it in the hierarchy. Therefore, it learns to be a refined version of a smaller network that may capture the big ideas but not the details, and/or to be a better-regularized version of a bigger network that may be prone to overfitting. MTNN does not replace existing regularization methods, which remain crucial, but can regularize in ways that existing methods do not, and so is a useful addition to our toolbox of regularization methods.
Multilevel methods are not algorithms, must algorithmic frameworks in which one must make individual algorithmic choices for various components; we have provided a number of choices here, and designed the code in a modular way so that you can (hopefully) easily write and insert your own algorithmic components.
Smoothing is typically just a traditional neural network optimizer applied to that level's neural network. We've used SGD here, but one could certainly use other optimizers.
We have provided one implementation of a restriction method here. In our restriction, a coarse neural network has the same architecture as the original, but with fewer neurons per layer. In the case of CNNs, the coarse network has fewer channels per layer.
We use a Heavy Edge Matching procedure to identify pairs of similar neurons (or channels) in a layer. Upon restriction, we calculate a weighted average of the parameters of the two neurons to produce a single "coarse neuron."
The weighted averaging is linear in the parameters of the
network. Thus, the operation can be represented as a matrix
Prolongation is simply the reverse operation of restriction, and a
linear operator
We leverage here a variant of the multilevel framework known as the Full Approximation Scheme, originally developed for solving nonlinear system of equations. The tau correction is a linear modification to the coarse loss:
As we show in our paper, the linear modifier
We provide a few different tau correction options:
-none
Does not use a tau correction. This forces the training
procedure to seek out a point in which all neural networks in the
hierarchy have small gradients and a positive-definite Hessian.
-wholeset
or minibatch
are two provided variants of using a tau
correction. Because it replaces the coarse gradient, this is weaker
regularization than none
. It seeks out a point in which all neural
networks in the hierarchy have positive-definite Hessians, but only
requires that the finest network have a small gradient.
Note that wholeset
and minibatch
are only to be used when the
neural network to be used is the finest in the hierarchy; the loss
modifications at the coarser levels tend to shift those networks into
modes that are suboptimal if one were to use them on their own. Using
no tau correction (none
) produces an entire hierarchy of
potentially-useful neural networks of varying sizes.