RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://www.codecademy.com/article/scikit-learn-tutorial below:

Scikit-Learn Tutorial: Python Machine Learning Model Building

What is scikit-learn?

Scikit-learn (often shortened to “sklearn”) is a free, open-source machine learning library for Python that provides simple and efficient tools for data analysis and modeling. Built on NumPy, SciPy, and matplotlib, scikit-learn has become the go-to library for many data scientists and machine learning practitioners. Whether you’re a beginner just starting your machine learning journey or an experienced practitioner looking for reliable implementations, scikit-learn offers a consistent interface that makes experimenting with different algorithms straightforward and accessible.

The library was initially developed by David Cournapeau as part of a Google Summer of Code project in 2007. Since then, it has grown into a robust ecosystem maintained by a diverse community of contributors worldwide. The name “scikit-learn” comes from the fact that it’s a “SciKit” (SciPy Toolkit), an add-on package for SciPy, focusing specifically on machine learning algorithms.

The functionality that scikit-learn provides includes:

Regression, including Linear and Logistic Regression
Classification, including K-Nearest Neighbors
Clustering, including K-Means and K-Means++
Model selection
Preprocessing, including Min-Max Normalization

How to install scikit-learn

Installing scikit-learn is straightforward with Python’s package manager, pip. Before installing, make sure you have NumPy and SciPy installed, as scikit-learn depends on these libraries.

pip install -U scikit-learn

Alternatively, if you’re using Anaconda, you can install scikit-learn using conda:

conda install scikit-learn

To verify your installation, you can import the library in Python:

import sklearn 
print(sklearn.__version__)

If the installation was successful, this code will print the version of scikit-learn installed on your system.

Step-by-step: building your first scikit-learn model

Let’s create a machine learning model using scikit-learn. We’ll walk through the complete workflow for building a model:

Step 1: Load a dataset

First, we’ll load the Iris dataset, one of scikit-learn’s built-in datasets:

from sklearn.datasets import load_iris 
iris = load_iris() 
X = iris.data 
y = iris.target 
print("Feature names:", iris.feature_names) 
print("Target names:", iris.target_names) 
print("\nFirst 5 rows of X:\n", X[:5])

Output:

Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'] 
Target names: ['setosa' 'versicolor' 'virginica'] 
First 5 rows of X: 
[[5.1 3.5 1.4 0.2] 
[4.9 3. 1.4 0.2] 
[4.7 3.2 1.3 0.2] 
[4.6 3.1 1.5 0.2] 
[5. 3.6 1.4 0.2]]

Step 2: Split the dataset

Next, we’ll divide our data into training and testing sets:

from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) 
print("X_train shape:", X_train.shape) 
print("X_test shape:", X_test.shape) 
print("y_train shape:", y_train.shape) 
print("y_test shape:", y_test.shape)

Output:

X_train shape: (105, 4) 
X_test shape: (45, 4) 
y_train shape: (105,) 
y_test shape: (45,)

Step 3: Train the model

Now we’ll train a K-Nearest Neighbors classifier using our training data:

from sklearn.neighbors import KNeighborsClassifier 
knn = KNeighborsClassifier(n_neighbors=3) 
knn.fit(X_train, y_train)

Step 4: Make predictions

With our trained model, we can now make predictions on the test data:

y_pred = knn.predict(X_test) 
print("First 5 predictions:", y_pred[:5]) 
print("First 5 actual values:", y_test[:5])

Step 5: Evaluate the model

Finally, we’ll evaluate how well our model performed:

from sklearn import metrics 
accuracy = metrics.accuracy_score(y_test, y_pred) 
print(f"Accuracy: {accuracy * 100:.2f}%") 
print("\nClassification Report:") 
print(metrics.classification_report(y_test, y_pred, target_names=iris.target_names)) 
print("\nConfusion Matrix:") 
print(metrics.confusion_matrix(y_test, y_pred))

Step 6: Make new predictions

Finally, let’s use our model to predict the species of some new iris flowers:

new_samples = [[5.1, 3.5, 1.4, 0.2], 
[6.3, 3.3, 6.0, 2.5], 
[5.9, 3.0, 4.2, 1.5]] 
new_predictions = knn.predict(new_samples) 
for i, pred in enumerate(new_predictions): 
print(f"Sample {i+1}: Predicted as {iris.target_names[pred]}")

This example demonstrates the typical machine learning workflow with scikit-learn, from loading data to making new predictions with a trained model.

Key features of Scikit-learn

Scikit-learn stands out among machine learning libraries due to several key features that make it user-friendly and powerful:

Consistent API

Scikit-learn provides a uniform interface where most estimators follow the same pattern:

Initialize: model = Algorithm(params)
Train: model.fit(X_train, y_train)
Predict: y_pred = model.predict(X_test)
Evaluate: score = model.score(X_test, y_test)

This consistency makes experimenting with different algorithms quick and intuitive.

Wide range of algorithms

Scikit-learn provides implementations of many popular machine learning algorithms, including:

Supervised learning: Linear/logistic regression, decision trees, random forests, SVMs
Unsupervised learning: K-means, hierarchical clustering, PCA, t-SNE
Model selection: Cross-validation, grid search, hyperparameter tuning

Preprocessing capabilities

Data preprocessing is a crucial step in any machine learning pipeline. Scikit-learn offers various tools for:

Feature scaling (StandardScaler, MinMaxScaler)
Encoding (OneHotEncoder, LabelEncoder)
Feature selection and extraction
Missing value handling

Pipeline integration

The Pipeline class allows you to chain multiple preprocessing steps and a final estimator into a single object, making your workflow more organized and less prone to errors like data leakage:

from sklearn.pipeline import Pipeline 
from sklearn.preprocessing import StandardScaler 
from sklearn.linear_model import LogisticRegression 
pipe = Pipeline([ 
('scaler', StandardScaler()), 
('classifier', LogisticRegression()) 
]) 
pipe.fit(X_train, y_train) 
predictions = pipe.predict(X_test)

Model persistence

Scikit-learn makes it easy to save your trained models for later use:

from sklearn.externals import joblib 
joblib.dump(model, 'model.pkl') 
loaded_model = joblib.load('model.pkl')

Where is Scikit-learn used?

Scikit-learn is widely used across various domains and industries:

Data science and research

Researchers use scikit-learn to prototype models quickly and analyze experimental data. Its accessibility and extensive documentation make it ideal for academic research and publication.

Business and industry

Companies use scikit-learn for:

Customer segmentation and behavior analysis
Demand forecasting and inventory management
Fraud detection and risk assessment
Recommendation systems

Education

Scikit-learn’s simplicity makes it an excellent tool for teaching machine learning concepts, which is why many educational platforms, including Codecademy, use it in their courses.

Prototyping

Even teams that ultimately deploy models using other frameworks often prototype using scikit-learn due to its quick setup and ease of use.

Use cases of Scikit-Learn

Let’s explore some common use cases for scikit-learn with practical examples:

Classification

Classification involves predicting a categorical label. Here’s a simple example using the famous Iris dataset:

from sklearn.datasets import load_iris 
from sklearn.model_selection import train_test_split 
from sklearn.ensemble import RandomForestClassifier 
iris = load_iris() 
X, y = iris.data, iris.target 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 
clf = RandomForestClassifier(n_estimators=100) 
clf.fit(X_train, y_train) 
accuracy = clf.score(X_test, y_test) 
print(f"Accuracy: {accuracy:.2f}")

Regression

Regression is used to predict continuous values. Here’s an example using the Boston housing dataset:

from sklearn.datasets import load_boston 
from sklearn.linear_model import LinearRegression 
from sklearn.metrics import mean_squared_error 
boston = load_boston() 
X, y = boston.data, boston.target 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 
reg = LinearRegression() 
reg.fit(X_train, y_train) 
y_pred = reg.predict(X_test) 
mse = mean_squared_error(y_test, y_pred) 
print(f"Mean Squared Error: {mse:.2f}")

Clustering

Clustering is an unsupervised learning technique for grouping similar data points:

from sklearn.cluster import KMeans 
from sklearn.datasets import make_blobs 
import matplotlib.pyplot as plt 
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0) 
kmeans = KMeans(n_clusters=4) 
y_kmeans = kmeans.fit_predict(X) 
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis') 
centers = kmeans.cluster_centers_ 
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.5) 
plt.title('K-means Clustering Results') 
plt.show()

Dimensionality reduction

PCA (Principal Component Analysis) is commonly used for reducing the number of features while preserving variance:

from sklearn.decomposition import PCA 
from sklearn.datasets import load_digits 
digits = load_digits() 
X = digits.data 
pca = PCA(n_components=2) 
X_reduced = pca.fit_transform(X) 
plt.figure(figsize=(10, 8)) 
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=digits.target, cmap='viridis', alpha=0.5) 
plt.colorbar() 
plt.title('PCA of Digits Dataset') 
plt.show()

Model selection

Scikit-learn provides tools for finding the best hyperparameters:

from sklearn.model_selection import GridSearchCV 
from sklearn.svm import SVC 
param_grid = { 
'C': [0.1, 1, 10, 100], 
'gamma': [0.001, 0.01, 0.1, 1], 
'kernel': ['rbf', 'linear'] 
} 
grid = GridSearchCV( 
SVC(), 
param_grid, 
cv=5, 
scoring='accuracy', 
verbose=1 
) 
grid.fit(X_train, y_train) 
print(f"Best parameters: {grid.best_params_}") 
print(f"Best cross-validation score: {grid.best_score_:.2f}")

Scikit-Learn vs TensorFlow vs PyTorch

When choosing a machine learning library, it’s important to understand how scikit-learn compares to alternatives:

Feature Scikit-learn TensorFlow PyTorch Primary focus Classical ML algorithms Deep learning Deep learning Learning curve Gentle Steep Moderate Performance with large data Limited Excellent Excellent GPU acceleration Limited Extensive Extensive Neural network support Basic Advanced Advanced Deployment Simple Production-ready Research-friendly Community size Large Very large Large and growing Ideal use cases Classical ML, prototyping, tabular data Production deep learning, deployment Research, experimentation, flexibility When to choose Scikit-Learn

You’re learning machine learning fundamentals
You need quick prototyping with classical algorithms
Your dataset fits in memory
You want a consistent, simple API
You’re working with structured, tabular data

When to choose TensorFlow or PyTorch

TensorFlow: For production-ready deep learning, mobile/edge deployment, or when using TensorFlow Extended (TFX) ecosystem
PyTorch: For research-oriented projects, rapid experimentation, or when dynamic computational graphs are needed

Conclusion

Scikit-learn is an invaluable tool in any data scientist’s toolkit, offering a perfect balance of simplicity and power. Its consistent API, comprehensive documentation, and wide range of algorithms make it an excellent choice for beginners and experienced practitioners alike.

Whether you’re classifying emails, predicting stock prices, segmenting customers, or reducing dimensionality for visualization, scikit-learn provides the tools you need to build effective machine learning models in Python.

Ready to deepen your scikit-learn skills? Explore Codecademy’s Machine Learning with Python course, which covers scikit-learn in depth, from basic concepts to advanced techniques.

Frequently asked questions 1. Is scikit-learn better than TensorFlow?

Neither is inherently “better” – they serve different purposes. Scikit-learn excels at traditional machine learning algorithms with a simple, consistent API, making it ideal for beginners and for quickly prototyping models. TensorFlow specializes in deep learning and neural networks, offering more flexibility and computational power for complex models, especially those requiring GPU acceleration. Choose scikit-learn for classical machine learning tasks and TensorFlow for deep learning projects.

2. What is the difference between sklearn and scikit-learn?

There is no difference – “sklearn” is simply the abbreviation used in Python import statements for the scikit-learn library. When importing the library, you use import sklearn, but the full name of the project is “scikit-learn.” This naming convention follows Python’s import system requirements, where hyphens aren’t allowed in module names.

3. What are the advantages of sklearn?

Scikit-learn offers several advantages:

Consistency: All algorithms follow the same API pattern
Comprehensive documentation: Extensive examples and tutorials
Integration: Works seamlessly with NumPy, Pandas, and matplotlib
Preprocessing tools: Robust toolset for data preparation
Model selection: Built-in cross-validation and hyperparameter tuning
Active community: Regular updates and responsive support
Low dependencies: Minimal external requirements beyond NumPy and SciPy

4. Is Keras better than sklearn?

Keras and scikit-learn serve different purposes and excel in different areas. Keras is a high-level neural networks API that runs on top of TensorFlow, specializing in deep learning models like convolutional neural networks and recurrent neural networks. Scikit-learn focuses on traditional machine learning algorithms like decision trees, SVMs, and linear models.

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4