adagrad

Topics

ml optimizer

With optimizers like SGD with momentum and nesterov accelerated gradient, we are able to adapt our updates to the slope of our error function and speed up SGD in turn. We would also like to adapt our updates to each individual parameter, i.e. to perform larger or smaller updates depending on their importance.

Motivation

Let us consider the dataset which has both dense and sparse features. During training if learning rate $η$ is fixed to some value (say $η = 0.001$ ) then training happens with same $η$ across the dataset. Due to this, dense features (i.e. weights associated with this feature) will get faster updates while sparse feature will get slower updates. Overall, this leads to slower convergence. This can be undesirable due to the following reasons.

Many features are irrelevant
Rare features are often very informative

Adagrad (Adaptive Gradient) is an algorithm for gradient-based optimization that does just this - adapts the learning rate to the parameters, performing smaller updates (i.e. low learning rates) for parameters associated with frequently occurring features, and larger updates (i.e. high learning rates) for parameters associated with infrequent features. For this reason, it is well-suited for sparse data.

Need for Adaptive learning Rate

Let’s say we have a very simple perceptron (with no non-linearity), and a single feature $x$ paired with a single target $y$ .

graph LR
x --> |w| A(( ))
1 --> |b| A
A --> O[y<sup>'</sup>]

style A stroke:#bbf,stroke-width:4px

If we compute the derivative of the MSE loss $J$ w.r.t $w$ and $b$ parameters, we get:

\frac{\partial J}{\partial w} \frac{\partial J}{\partial w} = \frac{\partial J}{\partial y ^{'}} \frac{\partial y ^{'}}{\partial w} = 2 (y - y^{'}) x = 2 (y - y^{'})

One can note the following:

The derivative w.r.t $w$ has the term $x$
- If there were several features $x_{1}, x_{2}, \dots$ then we would have corresponding params $w_{1}, w_{2}, \dots$ and $\partial J / \partial w_{i}$ will have the term $x_{i}$
If a feature $x_{i}$ is sparse, i.e. mostly zeros, then $\partial J / \partial w_{i} = 0$ for most samples
According to the weight update rule $w = w - η \frac{\partial J}{\partial w}$ , weights corresponding to sparse features will get very few updates, compared to dense features
- These uneven updates can cause the trajectory of descent to be biased towards the dense feature dimension, thereby slowing convergence

Intuitively, since the weight updates for the dense feature (say $w_{d}$ ) is so frequent, it reaches a good value earlier than others. Now at this point, there is nothing we can do by changing $w_{d}$ anymore and the only way to reach the minima is to change the value of other $w_{i}$ . Overall this takes more epochs to converge due to this long trajectory taken.

Idea

We extend the vanilla weight update (using gradient descent GD) by adding a decay term to the learning rate, for parameters, in proportion to their update history. Mathematically, for a parameter $w$ , we have the new update rule as:

v_{t} w_{t + 1} = v_{t - 1} + (\nabla_{w_{t}} J)^{2} = w_{t} - \frac{η}{v _{t} + ϵ} \nabla_{w_{t}} J

where $v_{t}$ is a gradient accumulator, $ϵ$ is a smoothing term that avoids division by zero (usually on the range from $1 0^{- 4}$ to $1 0^{- 8}$ ) and other terms mean the same as in regular GD. Interestingly, the square root operation turns out to be very important and without it the algorithm performs much worse.

Intuition

For the features which have received a lot of updates, the denominator term $v_{t} + ϵ$ would be high so that the effective learning rate becomes smaller than the learning rate for sparse features (which received few updates)
- Moreover, for sparse features, the learning rate is also boosted when the updates are very small and $v_{t} < 1$
Another thing to note is that as $t$ increases, $v_{t}$ increases too as it’s an accumulator. This causes a decaying effect since we perform division in $η / v_{t} + ϵ$ . Effectively, this prevents $η$ from oveshooting

Vectorization

Another way to write the equation is as:

w_{t + 1} = w_{t} - \frac{η}{diag ( G _{t} + ϵ I )} ⊙ \nabla_{w_{t}} J

where $w_{t}$ is the vector of parameters, $G_{t} \in R^{d \times d}$ is a diagonal matrix where each diagonal element contains the sum of the squares of the gradients w.r.t. $w_{i}$ up to time step $t$ (like the $v_{t}$ term seen earlier). With $diag (\cdot)$ , we only take the diagonal elementes in the form of a vector. Also, $⊙$ denotes Hadamard product a.k.a element-wise product between vectors/matrices of same dimension.

Pros:

Well-suited for sparse data
Eliminates need to manually tune learning rate (default often works)

Cons:

Accumulation of squared gradients causes learning rate to shrink over time, becoming very small. Learning stops too early

Altamash Khan

Altamash Khan

adagrad

Motivation

Need for Adaptive learning Rate

Idea

Intuition

Vectorization

Backlinks

Altamash Khan

adagrad

Motivation

Need for Adaptive learning Rate

Idea

Intuition

Vectorization

Related

Backlinks