Last Updated : 05 Jul, 2025
Gradient Descent is an optimization algorithm in machine learning used to determine the optimal parameters such as weights and bias for models. The idea is to minimize the model's error by iteratively updating the parameters in the direction of the steepest descent as determined by the gradient of the loss function.
Depending on how much data is used to compute the gradient during each update, gradient descent comes in three main variants:
Each variant has its own strengths and trade-offs in terms of speed, stability and convergence behavior.
Convergence in BGD, SGD & MBGD Working of Mini-Batch Gradient DescentMini-batch gradient descent is a optimization method that updates model parameters using small subsets of the training data called mini-batches. This technique offers a middle path between the high variance of stochastic gradient descent and the high computational cost of batch gradient descent. They are used to perform each update, making training faster and more memory-efficient. It also helps stabilize convergence and introduces beneficial randomness during learning.
It is often preferred in modern machine learning applications because it combines the benefits of both batch and stochastic approaches.
Key advantages of mini-batch gradient descent:
Let:
max_iters
= number of epochsFor itr=1,2,3,…,max_iters:
For each mini-batch ( X_{mini} , y_{mini} ):
1. Forward Pass on the batch X_mini:
Make predictions on the mini-batch
\hat{y} = f(X_{\text{mini}},\ \theta)
Compute error in predictions J(θ) with the current values of the parameters
J(θ)=L(\hat{y},y_{mini})
2. Backward Pass:
Compute gradient:
\nabla_{\theta} J(\theta) = \frac{\partial J(\theta)}{\partial \theta}
3. Update parameters:
Gradient descent rule:
Python Implementation\theta = \theta - \eta \nabla_{\theta} J(\theta)
Here we will use Mini-Batch Gradient Descent for Linear Regression.
1. Importing LibrariesWe begin by importing libraries like Numpy
and
Matplotlib.pyplot
import numpy as np
import matplotlib.pyplot as plt
2. Generating Synthetic 2D Data
Here, we generate 8000 two-dimensional data points sampled from a multivariate normal distribution:
cov
matrix defines the variance and correlation between the features. A value of 0.95
indicates a strong positive correlation between the two features.
mean = np.array([5.0, 6.0])
cov = np.array([[1.0, 0.95], [0.95, 1.2]])
data = np.random.multivariate_normal(mean, cov, 8000)
3. Visualizing Generated Data Python
plt.scatter(data[:500, 0], data[:500, 1], marker='.')
plt.title("Scatter Plot of First 500 Samples")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.grid(True)
plt.show()
Output:
4. Splitting DataWe split the data into training and testing sets:
(8000, 2)
(8000, 3)
data = np.hstack((np.ones((data.shape[0], 1)), data)) # shape: (8000, 3)
split_factor = 0.90
split = int(split_factor * data.shape[0])
X_train = data[:split, :-1]
y_train = data[:split, -1].reshape((-1, 1))
X_test = data[split:, :-1]
y_test = data[split:, -1].reshape((-1, 1))
5. Displaying Datasets Python
print("Number of examples in training set = %d" % X_train.shape[0])
print("Number of examples in testing set = %d" % X_test.shape[0])
Output:
results 6. Defining Core Functions of Linear Regression
# Hypothesis function
def hypothesis(X, theta):
return np.dot(X, theta)
# Gradient of the cost function
def gradient(X, y, theta):
h = hypothesis(X, theta)
grad = np.dot(X.T, (h - y))
return grad
# Mean squared error cost
def cost(X, y, theta):
h = hypothesis(X, theta)
J = np.dot((h - y).T, (h - y)) / 2
return J[0]
7. Creating Mini-Batches for Training
This function divides the dataset into random mini-batches used during training:
# Create mini-batches from the dataset
def create_mini_batches(X, y, batch_size):
mini_batches = []
data = np.hstack((X, y))
np.random.shuffle(data)
n_minibatches = data.shape[0] // batch_size
for i in range(n_minibatches + 1):
mini_batch = data[i * batch_size:(i + 1) * batch_size, :]
X_mini = mini_batch[:, :-1]
Y_mini = mini_batch[:, -1].reshape((-1, 1))
mini_batches.append((X_mini, Y_mini))
if data.shape[0] % batch_size != 0:
mini_batch = data[i * batch_size:]
X_mini = mini_batch[:, :-1]
Y_mini = mini_batch[:, -1].reshape((-1, 1))
mini_batches.append((X_mini, Y_mini))
return mini_batches
8. Mini-Batch Gradient Descent Function
This function performs mini-batch gradient descent to train the linear regression model:
theta
are initialized to zeros and an empty list error_list
tracks the cost over time.max_iters
), the dataset is divided into mini-batches.theta
to reduce cost and records the current error for tracking training progress.
# Mini-batch gradient descent
def gradientDescent(X, y, learning_rate=0.001, batch_size=32):
theta = np.zeros((X.shape[1], 1))
error_list = []
max_iters = 3
for itr in range(max_iters):
mini_batches = create_mini_batches(X, y, batch_size)
for X_mini, y_mini in mini_batches:
theta = theta - learning_rate * gradient(X_mini, y_mini, theta)
error_list.append(cost(X_mini, y_mini, theta))
return theta, error_list
9. Training and Visualization
The model is trained using gradientDescent()
on the training data. After training:
This provides a visual and quantitative insight into how well the mini-batch gradient descent is optimizing the regression model.
Python
theta, error_list = gradientDescent(X_train, y_train)
print("Bias = ", theta[0])
print("Coefficients = ", theta[1:])
# visualising gradient descent
plt.plot(error_list)
plt.xlabel("Number of iterations")
plt.ylabel("Cost")
plt.show()
Output:
Mini-Batch over Regression model 10. Final Prediction and EvaluationPrediction: The hypothesis() function is used to compute predicted values for the test set.
Visualization:
Evaluation:
# Predicting output for X_test
y_pred = hypothesis(X_test, theta)
# Visualizing predictions vs actual values
plt.scatter(X_test[:, 1], y_test, marker='.', label='Actual')
plt.plot(X_test[:, 1], y_pred, color='orange', label='Predicted')
plt.xlabel("Feature 1")
plt.ylabel("Target")
plt.title("Model Predictions vs Actual Values")
plt.legend()
plt.grid(True)
plt.show()
# Calculating mean absolute error
error = np.sum(np.abs(y_test - y_pred)) / y_test.shape[0]
print("Mean Absolute Error =", error)
Output:
Model prediction and Actual valuesThe orange line represents the final hypothesis function i.e θ[0] + θ[1] * X_test[:, 1] + θ[2] * X_test[:, 2] = 0
This is the linear equation learned by the model where:
θ[0]
is the bias (intercept)θ[1]
is the weight for the first featureθ[2]
is the weight for the second featureLets see a quick difference between Batch Gradient Descent, Stochastic Gradient Descent (SGD) and Mini-Batch Gradient Descent.
Type Update Strategy Speed & Efficiency Noise in Updates Batch Gradient Descent Updates parameters after computing gradient using the entire training dataset Slow, as it processes the full dataset before each update Smooth and stable Stochastic Gradient Descent (SGD) Updates parameters after computing gradient using one training example Faster updates, but cannot fully utilize vectorized computations Highly noisy updates Mini-Batch Gradient Descent Updates parameters using a small batch (subset) of training examples Efficient; leverages vectorization for faster computation Moderate noise—dependent on batch sizeRetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4