A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://www.geeksforgeeks.org/learning-model-building-scikit-learn-python-machine-learning-library/ below:

Learning Model Building in Scikit-learn

Learning Model Building in Scikit-learn

Last Updated : 27 May, 2025

Building machine learning models from scratch can be complex and time-consuming. Scikit-learn which is an open-source Python library which helps in making machine learning more accessible. It provides a straightforward, consistent interface for a variety of tasks like classification, regression, clustering, data preprocessing and model evaluation. Whether we're new to machine learning or have some experience it makes easy to build reliable models quickly. In this article, we’ll see important features and steps to get started with Scikit-learn.

Installing and Using Scikit-learn

Before we start building models we need to install Scikit-learn. It requires Python 3.8 or newer and depends on two important libraries: NumPy and SciPy. Make sure these are installed first.

To install Scikit-learn run the following command:

pip install -U scikit-learn

This will download and install the latest version of Scikit-learn along with its dependencies. Lets see various steps involved in the process of building Model using Scikit-learn library.

Step 1: Loading a Dataset

A dataset is a collection of data used to train and test machine learning models. It has two main parts:

Scikit-learn includes some ready-to-use example datasets like Iris and Digits datasets for classification tasks and Boston Housing dataset for regression tasks. Here we will be using the Iris dataset.

Python
from sklearn.datasets import load_iris 
iris = load_iris() 

X = iris.data 
y = iris.target 
  
feature_names = iris.feature_names 
target_names = iris.target_names 
  
print("Feature names:", feature_names) 
print("Target names:", target_names) 

print("\nType of X is:", type(X)) 

print("\nFirst 5 rows of X:\n", X[:5])

Output: 

Loading dataset

Sometimes we need to work on our own custom data then we load an external dataset. For this we can use the pandas library for easy loading and manipulating datasets.

For this you can refer to our article on How to import csv file in pandas?

Step 2: Splitting the Dataset

When working with machine learning models handling large datasets can be computationally expensive. To make training efficient and to evaluate model performance fairly we split the data into two parts: the training set and the testing set.

The training set is used to teach the model to recognize patterns while the testing set helps us check how well the model performs on new, unseen data. This separation helps in preventing overfitting and gives a more accurate measure of how the model will work in real-world situations. In Scikit-learn the train_test_split function from the sklearn.model_selection module makes this easy.

Here we are spliting the Iris dataset so that 60% of the data is used for training and 40% for testing by setting test_size=0.4. Using random_state=1 parameter helps in ensuring that the split remains the same every time we run the code which is helpful for reproducibility.

After splitting, we get four subsets:

Python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)

Now lets check the Shapes of the Splitted Data to ensures that both sets have correct proportions of data avoiding any potential errors in model evaluation or training.

Python
print("X_train Shape:",  X_train.shape)
print("X_test Shape:", X_test.shape)
print("Y_train Shape:", y_train.shape)
print("Y_test Shape:", y_test.shape)

Output:

Shape of Splitted Data Step 3: Handling Categorical Data

Machine learning algorithms require numerical input so handling categorical data correctly is important. If categorical variables are left as text, the algorithms may misinterpret their meaning which leads to poor results. To avoid this we convert categorical data into numerical form using encoding techniques which are as follows:

1. Label Encoding: It converts each category into a unique integer. For example in a column with categories like 'cat', 'dog' and 'bird', it would convert them to 0, 1 and 2 respectively. This method works well when the categories have a meaningful order such as “Low”, “Medium” and “High”.

Python
from sklearn.preprocessing import LabelEncoder

categorical_feature = ['cat', 'dog', 'dog', 'cat', 'bird']

encoder = LabelEncoder()

encoded_feature = encoder.fit_transform(categorical_feature)

print("Encoded feature:", encoded_feature)

Output:

Encoded feature: [1 2 2 1 0]

2. One-Hot Encoding: It creates binary columns for each category where each column represents a category. For example if we have a column with values 'cat' 'dog' and 'bird' it will create three new columns one for each category where each row will have 1 in the column corresponding to its category and 0s in the others. This method is useful for categorical variables without any order ensuring that no numeric relationships are implied between the categories.

Python
from sklearn.preprocessing import OneHotEncoder
import numpy as np

categorical_feature = ['cat', 'dog', 'dog', 'cat', 'bird']

categorical_feature = np.array(categorical_feature).reshape(-1, 1)

encoder = OneHotEncoder(sparse_output=False)

encoded_feature = encoder.fit_transform(categorical_feature)

print("OneHotEncoded feature:\n", encoded_feature)

Output:

Besides Label Encoding and One-Hot Encoding there are other techniques like Mean Encoding.

Step 4: Training the Model

Now that our data is ready, it’s time to train a machine learning model. Scikit-learn has many algorithms with a consistent interface for training, prediction and evaluation. Here we’ll use Logistic Regression as an example.

Note: We will not go into the details of how the algorithm works as we are interested in understanding its implementation only. 

Python
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(max_iter=200)
log_reg.fit(X_train, y_train)
Training Using Logistic Regression. Step 5: Make Predictions

Once trained we use the model to make predictions on the test data X_test by calling the predict method. This returns predicted labels y_pred.

Python
y_pred = log_reg.predict(X_test)
Step 6: Evaluating Model Accuracy

Check how well our model is performing by comparing y_test and y_pred. Here we are using the metrics module's method accuracy_score.

Python
from sklearn import metrics
print("Logistic Regression model accuracy:", metrics.accuracy_score(y_test, y_pred))

Output:

Logistic Regression model accuracy: 0.9666666666666667

Now we want our model to make predictions on new sample data. Then the sample input can simply be passed in the same way as we pass any feature matrix. Here we used it as sample = [[3, 5, 4, 2], [2, 3, 5, 4]]

Python
sample = [[3, 5, 4, 2], [2, 3, 5, 4]]
preds = log_reg.predict(sample)
pred_species = [iris.target_names[p] for p in preds]
print("Predictions:", pred_species)

Output: 

Predictions: [np.str_('virginica'), np.str_('virginica')]

Features of Scikit-learn

Scikit-learn is used because it makes building machine learning models straightforward and efficient. Here are some important reasons:

  1. Ready-to-Use Tools: It provides built-in functions for common tasks like data preprocessing, training models and making predictions. This saves time by avoiding the need to code algorithms from scratch.
  2. Easy Model Evaluation: With tools like cross-validation and performance metrics it helps to measure how well our model works and identify areas for improvement.
  3. Wide Algorithm Support: It offers many popular machine learning algorithms including classification, regression and clustering which gives us flexibility to choose the right model for our problem.
  4. Smooth Integration: Built on top of important Python libraries like NumPy and SciPy so it fits into our existing data analysis workflow.
  5. Simple and Consistent Interface: The same straightforward syntax works across different models helps in making it easier to learn and switch between algorithms.
  6. Model Tuning Made Easy: Tools like grid search help us fine-tune our model’s settings to improve accuracy without extra hassle.
Benefits of using Scikit-learn

With its accessible tools and reliable performance, Scikit-learn makes machine learning practical and achievable for everyone.



RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4