Intro:

Gradient descent is a first order optimisation algorithem used for finding for the local minimum of a real-valued function $min_{x} f (x)$ with respect to the variable $x$ .

Usually the functions $f$ are multivariate i.e. functions that take multiple input variables and produce a single output, and have the form:

f : R^{n} \to R

where $R^{n}$ denotes the $n$ -dimensional real space, and the function $f$ maps an $n$ -dimensional vector of real numbers to a single real number.

Examples of Multivariate Functions

Two Variables: $f (x, y) = x^{2} + y^{2}$ This function takes two input variables, $x$ and $y$ , and produces a single output which is the sum of the squares of the inputs.
Three Variables: $g (x, y, z) = x \cdot y + y \cdot z + z \cdot x$ This function takes three input variables, $x$ , $y$ , and $z$ , and produces a single output which is the sum of the products of the pairs of inputs.

The gradient at a point, points in the direction of steepest ascent ¹.

Gradient decent relies on that $f (x_{0})$ decreases fasterst if we move from $x_{0}$ in the direction of the negative gradient $- ((\nabla f) (x_{0}))^{⊤}$ . Note, we uise the transpose for the gradient, otherwise vector dimensions will not work out.

Detailed Explanation

$\nabla f$ : This symbol represents the gradient of the function $f$ . The gradient is a vector of partial derivatives of $f$ with respect to its input variables. If $f$ is a function of $n$ variables, $\nabla f$ is a vector with $n$ components.
$(\nabla f) (x_{0})$ : This means that the gradient $\nabla f$ is evaluated at the specific point $x_{0}$ . The point $x_{0}$ is in the domain of $f$ .
$(\nabla f) (x_{0})^{⊤}$ : The $⊤$ symbol denotes the transpose of the gradient vector. If the gradient $(\nabla f) (x_{0})$ is originally a column vector, its transpose will be a row vector, and vice versa.
$- (\nabla f) (x_{0})^{⊤}$ : The minus sign indicates that we are considering the negative of the transposed gradient vector evaluated at $x_{0}$ .

Example

Consider a function $f (x, y) = x^{2} + y^{2}$ .

Gradient Calculation: $\nabla f = (\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}) = (2 x, 2 y)$
Evaluating at $x_{0} = (1, 1)$ : $(\nabla f) (1, 1) = (2 \cdot 1, 2 \cdot 1) = (2, 2)$
Transposing the Gradient: If $(2, 2)$ is a row vector, its transpose is still $(2, 2)$ but often, it is considered a column vector transposed to a row vector.
Negative Transposed Gradient: $- (\nabla f) (1, 1)^{⊤} = - (2, 2)$

This negative gradient is used to update the current point $x_{0}$ in the gradient descent method to find the function’s minimum.

If for a small step-size $\nabla \geq 0$

Mathematics for Machine Learning, Section 5.1, https://yung-web.github.io/home/courses/mathml.html ↩

Notes on Gradient Decent

Intro:

Examples of Multivariate Functions

Detailed Explanation

Example

Further Reading

Code and Coffee Meeetup - Notes on LLM tokenizers

ML in PL Workshop, Generative methods in drug discovery, a practical introduction

Using GPT4 to generate git logs for OpenSource projects in the style of conventional commits via a terminal