Home Notes on Gradient Decent
Post
Cancel

Notes on Gradient Decent

Intro:

Gradient descent is a first order optimisation algorithem used for finding for the local minimum of a real-valued function minxf(x) with respect to the variable x.

Usually the functions f are multivariate i.e. functions that take multiple input variables and produce a single output, and have the form:

f:RnR

where Rn denotes the n-dimensional real space, and the function f maps an n-dimensional vector of real numbers to a single real number.

Examples of Multivariate Functions

  1. Two Variables: f(x,y)=x2+y2 This function takes two input variables, x and y, and produces a single output which is the sum of the squares of the inputs.

  2. Three Variables: g(x,y,z)=xy+yz+zx This function takes three input variables, x, y, and z, and produces a single output which is the sum of the products of the pairs of inputs.

The gradient at a point, points in the direction of steepest ascent 1.

Gradient decent relies on that f(x0) decreases fasterst if we move from x0 in the direction of the negative gradient ((f)(x0)). Note, we uise the transpose for the gradient, otherwise vector dimensions will not work out.

Detailed Explanation

  • f: This symbol represents the gradient of the function f. The gradient is a vector of partial derivatives of f with respect to its input variables. If f is a function of n variables, f is a vector with n components.

  • (f)(x0): This means that the gradient f is evaluated at the specific point x0. The point x0 is in the domain of f.

  • (f)(x0): The symbol denotes the transpose of the gradient vector. If the gradient (f)(x0) is originally a column vector, its transpose will be a row vector, and vice versa.

  • (f)(x0): The minus sign indicates that we are considering the negative of the transposed gradient vector evaluated at x0.

Example

Consider a function f(x,y)=x2+y2.

  1. Gradient Calculation: f=(fx,fy)=(2x,2y)

  2. Evaluating at x0=(1,1): (f)(1,1)=(21,21)=(2,2)

  3. Transposing the Gradient: If (2,2) is a row vector, its transpose is still (2,2) but often, it is considered a column vector transposed to a row vector.

  4. Negative Transposed Gradient: (f)(1,1)=(2,2)

This negative gradient is used to update the current point x0 in the gradient descent method to find the function’s minimum.

If for a small step-size 0

  1. Mathematics for Machine Learning, Section 5.1, https://yung-web.github.io/home/courses/mathml.html 

This post is licensed under CC BY 4.0 by the author.