It turns out that yes, it is sometimes possible to get high-accuracy solutions from low-precision training—and here we’ll describe a new variant of stochastic gradient descent (SGD) called high-accuracy low precision (HALP) that can do it. HALP can do better than previous algorithms because it reduces the two sources of noise that limit the accuracy of low-precision SGD: gradient variance and round-off error.

First, to set the stage: we want to solve training problems of the form
maximize f(w)=1N∑i=1Nfi(w) over w∈Rd.” role=”presentation”>maximize f(w)=1NN∑i=1fi(w) over w∈Rd.maximize f(w)=1N∑i=1Nfi(w) over w∈Rd.

This is the classic empirical risk minimization problem used to train many machine learning models, including deep neural networks. One standard way of solving this is with stochastic gradient descent, which is an iterative algorithm that approaches the optimum by running
wt+1=wt−α∇fit(wt)” role=”presentation”>wt+1=wt−α∇fit(wt)wt+1=wt−α∇fit(wt)
where it” role=”presentation”>itit is an index randomly chosen from {1,…,N}” role=”presentation”>{1,…,N}{1,…,N} at each iteration.

We want to run an algorithm like this, but to make the iterates wt” role=”presentation”>wtwtlow-precision. That is, we want them to use fixed-point arithmetic with a small number of bits, typically 8 or 16 bits (this is small compared with the 32-bit or 64-bit floating point numbers that are standard for these algorithms).

But when this is done directly to the SGD update rule, we run into a representation problem: the solution to the problem w∗” role=”presentation”>w∗w∗ may not be representable in the chosen fixed-point representation. For example, if we use an 8-bit fixed-point representation that can store the integers {−128,−127,…,127}” role=”presentation”>{−128,−127,…,127}{−128,−127,…,127}, and the true solution is w∗=100.5″ role=”presentation”>w∗=100.5w∗=100.5 then we can’t get any closer than a distance of 0.5″ role=”presentation”>0.50.5 to the solution since we can’t even represent non-integers. Read more from…

thumbnail courtesy of