RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://nbviewer.org/github/gibiansky/IHaskell/blob/master/notebooks/Gradient-Descent.ipynb below:

Jupyter Notebook Viewer

In supervised learning algorithms, we generally have some model (such as a neural network) with a set of parameters (the weights), a data set, and an error function which measures how well our parameters and model fit the data. With many models, the way to train the model and fit the parameters is through an iterative minimization procedure which uses the gradient of the error to find the local minimum in parameter space. This notebook will not go into the details of neural networks or how to compute the derivatives of error functions, but instead focuses on some of the simple minimization methods one can employ. The goal of this notebook is to develop a simple yet flexible framework in Haskell in which we can develop gradient descent methods.

Although you should keep in mind that the goal of these algorithms (for our purposes) is to train neural networks, for now we will just discuss some abstract function $f(\vec x)$ for which we can compute all partial derivatives. $\newcommand\vector[1]{\langle #1 \rangle}\newcommand\p[2]{\frac{\partial #1}{\partial #2}}$

Gradient Descent¶

The simplest algorithm for iterative minimization of differentiable functions is known as just gradient descent. Recall that the gradient of a function is defined as the vector of partial derivatives:

$$\nabla f(x) = \vector{\p{f}{x_1}, \p{f}{x_2}, \ldots, \p{f}{x_n}}$$

and that the gradient of a function always points towards the direction of maximal increase at that point.

Equivalently, it points away from the direction of maximum decrease - thus, if we start at any point, and keep moving in the direction of the negative gradient, we will eventually reach a local minimum.

This simple insight leads to the Gradient Descent algorithm. Outlined algorithmically, it looks like this:

Pick a point $x_0$ as your initial guess.
Compute the gradient at your current guess:

$v_i = \nabla f(x_i)$ 3. Move by $\alpha$ (your step size) in the direction of that gradient: $x_{i+1} = x_i + \alpha v_i$ 4. Repeat steps 1-3 until your function is close enough to zero (until $f(x_i) < \varepsilon$ for some small tolerance $\varepsilon$)

Note that the step size, $\alpha$, is simply a parameter of the algorithm and has to be fixed in advance.

Though this algorithm is simple, it will be a bit of a challenge to formalize it into executable Haskell code that we can later extend to other algorithms. First, note that gradient descent requires two things:

Something to optimize (a function)
What to optimize over (the parameters)
A way to compute partials of the function

Note that we don't actually need to call the function itself - only the partial derivatives are necessary.

We're going to define a single class for things on which we can run gradient descent. Although later we may want to modify this class, this serves as a beginning:

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4