A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://www.geeksforgeeks.org/machine-learning/optimization-in-neural-networks-and-newtons-method/ below:

Optimization in Neural Networks and Newton's Method

In machine learning, optimizers and loss functions are two components that help improve the performance of the model. A loss function measures the performance of a model by measuring the difference between the output expected from the model and the actual output obtained from the model. Mean square loss, hinge loss, and log loss are some examples of loss functions. The optimizer helps improve the model by adjusting its parameters to minimize the loss function value. SGD, ADAM, RMSProp, and Newton's method are some examples of optimizers. The role of the optimizer is to find the best set of parameters (weights and biases) of the neural network that allow it to make accurate predictions.

The computational method for iterative optimization technique can be broadly divided in three types

In this article we will focus on the Newton method for optimization and how it can be used for training neural networks. Let us first compare it with gradient descent.

Gradient Descent Vs Newton Method

Gradient Descent and the Newton Method are two well-known optimization techniques for training neural networks each method has its advantages and disadvantages. The choice between them depends on the problem at hand, the complexity of the neural network, and available computation resources.

WORKING

It relies only on first order derivative of the loss function to update model parameters

It relies on both first-order (gradient) and second-order derivatives (hessian ) to update model parameters

APPLICABILITY

It is relatively easy to implement and is widely used in practice due to its simplicity.

The Newton method may not be practical or suitable for very large neural networks or when the Hessian matrix is computationally expensive to compute or invert

Convergence

Gradient Descent typically converges to a local minimum but can be sensitive to the choice of learning rate, may get stuck in saddle points, and may have slower convergence in some cases.

The Newton method often converges much faster than Gradient Descent.

Learning Rate

Highly Sensitive

Less sensitive

Basic Concept of Newton Method

The genesis of the Newton's method lies in calculus for finding the roots of a differentiable function F, which are solutions to the equation F (x) = 0.

Let us understand Newton method for root finding , then we will see how it can be used for optimization

X_{t+1} = X_t - f(X_t)/f'(X_t)

X_{t+1} = X_t - f(X_t)/f'(X_t) .

So using this we get,

f'(x) = (f(x_t)-0 )/ (X_t-X_{t+1} ) . Rearranging the above equation we get

X_{t+1} = X_t - f(X_t)/f'(X_t)

Newton Method for Optimization

The above method we discussed is used to find the roots of a function i.e. f(x) = 0. Now we know that derivative of the function is zero at its critical point(minimum/maximum/saddle). So instead of finding roots of f(x) we can find the roots of f'(x). By doing so we find the critical point of f(x).

Newton Method For Optimization


Let us first get an intuition behind Newton method using a single variable case . Then we will extend it to multivariate case.
Consider the above graph in which a single step of Newton method has been shown.

Here, we want to find the minimum of the function f(X):

  1. Here we start with a initial guess at X_0
  2. We approximate the function at X_0 using Taylor series.
  3. Since the derivative of a function is zero at the point of minima/maxima , we find the minimum of this approximation by differentiating and equating to zero. Note this is the minimum of Taylor series which is approximate at X_0
  4. Repeat steps 3 and 4 until convergence is reached (e.g., when the change in x is sufficiently small)

We can extend the above method to multivariate case.

The multivariable quadratic approximation to f(x1,…,xn) can be written.

f(x_0) + (x-x_0)^{T}\nabla f(x_0) + \frac{1}{2}(x -x_0)^{T}H(x-x_0)

Here,

\overrightarrow{\nabla}f(\overrightarrow{x}) =\begin{bmatrix} \frac{\partial f}{\partial x_1}\\ \frac{\partial f}{\partial x_2}\\ \vdots \\ \frac{\partial f}{\partial x_n}\end{bmatrix}

Multivariable equivalent of the second derivative : Instead of a vector, we now need a matrix. This matrix Hf(x ) is called the Hessian, and is a matrix of the second derivatives of f(x) with respect to each pair of parameters.

Hf(\overrightarrow{x}) = \begin{bmatrix} \frac{\partial^2 f}{\partial^{2} x_{1}}& \frac{\partial^2 f}{\partial x_{1} \partial x_{2}} & ... & \frac{\partial^2 }{\partial x_1 \partial x_n}\\ \frac{\partial^2 }{\partial x_2 \partial x_1}& & & \frac{\partial^2 }{\partial x_2 \partial x_n}\\ \vdots & & & \vdots \\ \frac{\partial^2 }{\partial x_n \partial x_1} & ... & ... & \frac{\partial^2 f}{\partial^{2} x_{n}} \end{bmatrix}

Hessian Matrix

The update rule in Newton's method uses the Hessian matrix and the gradient to calculate the step size, which is essentially solving a linear system of equations involving the Hessian and the gradient.

Significance of Hessian Matrix
Utilization of Second Derivative

The geometric explanation of Newton's method involves an iterative process where, at each step, it approximates the graph of the function f(x) near the trial value by fitting a parabola. Newton's method makes use of the first derivation (gradient) and second derivative (the Hessian matrix) of the objective function . The parabola is designed to have the same slope and curvature as the function's graph at that particular point. Utilizing second derivative provides a more accurate and informative characterization of the local curvature of the function than the first derivative which only provides the slope information. Once the parabola is constructed, the method proceeds to determine the maximum or minimum of this parabolic approximation.

Gradient descent, on the other hand, relies solely on the first derivative (the gradient). Incorporating second derivative information allows Newton's method to take more precise and efficient steps towards the minimum. Newton methods can converge faster than first-order methods like gradient descent because they consider both the direction and curvature of the loss landscape.

How to use Newton's Method for Optimization?

Basic outline of implementing newtons method for neural network is given below:

Step-by-Step Guide of Newton Method in Python

We will train a custom neural network for image classification using pytorch and LBFGS solver as optimizer. We will use the famous CIFAR10 dataset.

Import Necessary Libraries

Torch is the main library for implementing pytorch based neural network

Python3
import torch
from torch import nn,optim
import torch.nn.functional as F
import pandas as pd
import numpy as np


Initialize Device

We will initialize our device variable to cuda if GPU is available otherwise to CPU.

Python3
device = "cuda" if torch.cuda.is_available() else "cpu"


Create Model

Here we have designed a simple CNN network for classification. It consists of 2 layer of convolution followed by 2 linear layers. We are using ReLU activation and maxpool after convolution layer. We use dropout layer so that all weights are trained. We use Xavier initialization for initializing the weights of the two convolution layer of the model.

Python3
# Creating our own ConvNet
class ConvNet(nn.Module):
    def __init__(self):
        super(ConvNet, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)

        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 1000)
        self.fc2 = nn.Linear(1000, 10)
        self.dropout = nn.Dropout(0.3)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.dropout(x) 
        x = self.pool(F.relu(self.conv2(x)))
        x = self.dropout(x)

        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

model = ConvNet().to(device)
torch.nn.init.xavier_uniform_(model.conv1.weight)
torch.nn.init.xavier_uniform_(model.conv2.weight)


Load the dataset

Here we load the dataset using the built in class CIFAR10. We use transform class to make sure that the data is in tensor format . We split our data in train and test set. We create a dataloader object to create a iterator that can be used by our model.

Python3
from torchvision import transforms
from torchvision.datasets import CIFAR10
train_transforms = transforms.Compose([ transforms.ToTensor()])
dataset = CIFAR10(root="./test/", train=False, download=True, transform=train_transforms)


train_set, test_set = torch.utils.data.random_split(dataset, [8000, 2000])

trainloader = torch.utils.data.DataLoader( train_set, batch_size=1024,shuffle=True)
testloader = torch.utils.data.DataLoader( test_set,batch_size=1024, shuffle=True)


Loss function and optimizer

We define cross entropy loss as it is a classification problem. We use the LBFGS optimizer. It is based on Newton method that optimizes memory and uses an estimate of the inverse Hessian matrix to save computation resources.

Python3
criterion = nn.CrossEntropyLoss()
optimizer = optim.LBFGS(model.parameters(),lr=0.39,max_iter=2,history_size=10,line_search_fn='strong_wolfe')


Model Training

We train the model . Below is a standard pytorch training code used for neural network. The only difference is use of closure function. This is as per requirement of pytorch implementation of LBFGS.

Python3
N_EPOCHS = 50
for epoch in range(N_EPOCHS):
  epoch_loss = 0.0
  
  model.train()
  for inputs, labels in trainloader:
    inputs = inputs.to(device)
    labels = labels.to(device)
 
    def closure():
      optimizer.zero_grad()
      outputs = model(inputs)
      loss = criterion(outputs, labels)
      loss.backward(retain_graph=True)

      return loss
    
    optimizer.step(closure)

    outputs = model(inputs)
    loss = criterion(outputs, labels)
    epoch_loss += loss.item()

  val_loss = 0.0
  model.eval()
  for inputs, labels in testloader:
    inputs = inputs.to(device) 
    labels = labels.to(device)
  
    outputs = model(inputs)
    loss = criterion(outputs, labels)
    val_loss += loss.item()
  print("Epoch: {} Train Loss: {} Val Loss: {} ".format(epoch, epoch_loss/len(trainloader), val_loss/len(testloader)))

Output:

Epoch: 0 Train Loss: 2.302582234144211 Val Loss: 2.302242159843445 
Epoch: 1 Train Loss: 2.302254170179367 Val Loss: 2.302234172821045
Epoch: 2 Train Loss: 2.301358014345169 Val Loss: 2.3020946979522705
Epoch: 3 Train Loss: 2.302258551120758 Val Loss: 2.3020176887512207
Epoch: 4 Train Loss: 2.300609827041626 Val Loss: 2.298862934112549
Epoch: 5 Train Loss: 2.199757754802704 Val Loss: 2.0873111486434937

After training for 50 epochs we get loss of 1.6 for both train and validation set.

Lets randomly check the output of our model

Python3
y_pred = model(test_set[901][0].unsqueeze(0).to(device))
print(f" Model predicted label : {np.argmax(y_pred.detach().cpu().numpy())}")
print(f" Actual label : {test_set[901][1]}")

Output:

Model predicted label : 1
Actual label : 1


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4