Last Updated : 12 Jul, 2025
In this article, we will build and deploy a Machine Learning model using Flask. We will train a Decision Tree Classifier on the Adult Income Dataset, preprocess the data, and evaluate model accuracy. After training, we’ll save the model and create a Flask web application where users can input data and get real-time predictions about income classification. This will demonstrate how to integrate ML models into web applications using Flask.
Installation and SetupTo create a basic flask app, refer to- Create Flask App
After creating and activating a virtual environment install Flask and other libraries required in this project using these commands-
File Structurepip install flask
pip install pandas
pip install numpy
pip install scikit-learn
After completing the project, our file structure should look similar to this-
File Structure Dataset and Model SelectionWe are using the Adult Income Dataset from the UCI Machine Learning Repository. This dataset contains information about individuals, including age, education, occupation, and marital status, with the goal of predicting whether their income exceeds $50K per year.
To download the dataset click here.
Dataset Preview-
DatasetWe are goin to use the Decision Tree Classifier, a popular supervised learning algorithm. It is easy to interpret, flexible, and works well with both numerical and categorical data. The model learns patterns from historical data and predicts whether a person’s income is above or below $50K based on their attributes.
Preprocessing DatasetDataset consists of 14 attributes and a class label telling whether the income of the individual is less than or more than 50K a year. Before training our machine learning model, we need to clean and preprocess the dataset to ensure better accuracy and efficiency. Create a file- "preprocessing.py", it will containt the code to preprocess the dataset. Here’s how we prepare the data:
Handling Missing Values:The dataset may contain missing values represented by "?". These are replaced with NaN, and then filled using the mode (most frequent value) of each column.
Python
# Filling missing values
df.replace("?", np.nan, inplace=True)
df.fillna(df.mode().iloc[0], inplace=True) # Fill missing values with the mode
Simplifying Categorical Data:
The marital status column is simplified by grouping values into just two categories: "married" and "not married".
Python
# Discretization (simplifying marital status)
df.replace(['Divorced', 'Married-AF-spouse', 'Married-civ-spouse',
'Married-spouse-absent', 'Never-married', 'Separated', 'Widowed'],
['divorced', 'married', 'married', 'married',
'not married', 'not married', 'not married'], inplace=True)
Encoding Categorical Variables:
# Discretization (simplifying marital status)
df.replace(['Divorced', 'Married-AF-spouse', 'Married-civ-spouse',
'Married-spouse-absent', 'Never-married', 'Separated', 'Widowed'],
['divorced', 'married', 'married', 'married',
'not married', 'not married', 'not married'], inplace=True)
# Label Encoding
category_col = ['workclass', 'race', 'education', 'marital-status', 'occupation',
'relationship', 'gender', 'native-country', 'income']
label_encoder = preprocessing.LabelEncoder()
# Creating a mapping dictionary
mapping_dict = {}
for col in category_col:
df[col] = label_encoder.fit_transform(df[col])
mapping_dict[col] = dict(enumerate(label_encoder.classes_)) # Improved mapping
print(mapping_dict)
# Dropping redundant columns
df.drop(['fnlwgt', 'educational-num'], axis=1, inplace=True)
Splitting Features and Target:
The dataset is split into features (X) and target labels (Y), where the target column represents income classification (≤50K or >50K).
Python
# Splitting features and target
X = df.iloc[:, :-1].values # All columns except last
Y = df.iloc[:, -1].values # Only last column
Training and Saving Model
Now that we have preprocessed our dataset, we can train and save our Machine Learning Model over it. The dataset is divided into 70% training data and 30% testing data to evaluate the model’s performance and we are using pickle library to save it locally.
Python
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=100)
# Initialize and Train Decision Tree Classifier
dt_clf_gini = DecisionTreeClassifier(criterion="gini", random_state=100, max_depth=5, min_samples_leaf=5)
dt_clf_gini.fit(X_train, y_train)
# Save Model Using Pickle
with open("model.pkl", "wb") as model_file:
pickle.dump(dt_clf_gini, model_file)
Creating app.py
Create a file- "app.py", it will contain the code of our main flask app.
Python
#importing libraries
import numpy as np
import flask
import pickle
from flask import Flask, render_template, request
#creating instance of the class
app=Flask(__name__)
#to tell flask what url shoud trigger the function index()
@app.route('/')
@app.route('/index')
def index():
return flask.render_template('index.html')
#return "Hello World"
#prediction function
def ValuePredictor(to_predict_list):
to_predict = np.array(to_predict_list).reshape(1,12)
loaded_model = pickle.load(open(r"path_to_the_saved_model","rb"))
result = loaded_model.predict(to_predict)
return result[0]
@app.route('/result',methods = ['POST'])
def result():
if request.method == 'POST':
to_predict_list = request.form.to_dict()
to_predict_list=list(to_predict_list.values())
to_predict_list = list(map(int, to_predict_list))
result = ValuePredictor(to_predict_list)
if int(result)==1:
prediction='Income more than 50K'
else:
prediction='Income less that 50K'
return render_template("result.html",prediction=prediction)
if __name__ == "__main__":
app.run(debug=True)
Code Breakdown:
We create all the HTML files in a templates folder in flask. Here are the HTML files we need to create for this app-
index.htmlThis page contains a form that will take input from the user and then send to "/result" route in the app.py file that will process it and predict the output over it using the saved model.
HTML
<html>
<body>
<h3>Income Prediction Form</h3>
<div>
<form action="/result" method="POST">
<label for="age">Age</label>
<input type="text" id="age" name="age">
<br>
<label for="w_class">Working Class</label>
<select id="w_class" name="w_class">
<option value="0">Federal-gov</option>
<option value="1">Local-gov</option>
<option value="2">Never-worked</option>
<option value="3">Private</option>
<option value="4">Self-emp-inc</option>
<option value="5">Self-emp-not-inc</option>
<option value="6">State-gov</option>
<option value="7">Without-pay</option>
</select>
<br>
<label for="edu">Education</label>
<select id="edu" name="edu">
<option value="0">10th</option>
<option value="1">11th</option>
<option value="2">12th</option>
<option value="3">1st-4th</option>
<option value="4">5th-6th</option>
<option value="5">7th-8th</option>
<option value="6">9th</option>
<option value="7">Assoc-acdm</option>
<option value="8">Assoc-voc</option>
<option value="9">Bachelors</option>
<option value="10">Doctorate</option>
<option value="11">HS-grad</option>
<option value="12">Masters</option>
<option value="13">Preschool</option>
<option value="14">Prof-school</option>
<option value="15">16 - Some-college</option>
</select>
<br>
<label for="martial_stat">Marital Status</label>
<select id="martial_stat" name="martial_stat">
<option value="0">divorced</option>
<option value="1">married</option>
<option value="2">not married</option>
</select>
<br>
<label for="occup">Occupation</label>
<select id="occup" name="occup">
<option value="0">Adm-clerical</option>
<option value="1">Armed-Forces</option>
<option value="2">Craft-repair</option>
<option value="3">Exec-managerial</option>
<option value="4">Farming-fishing</option>
<option value="5">Handlers-cleaners</option>
<option value="6">Machine-op-inspect</option>
<option value="7">Other-service</option>
<option value="8">Priv-house-serv</option>
<option value="9">Prof-specialty</option>
<option value="10">Protective-serv</option>
<option value="11">Sales</option>
<option value="12">Tech-support</option>
<option value="13">Transport-moving</option>
</select>
<br>
<label for="relation">Relationship</label>
<select id="relation" name="relation">
<option value="0">Husband</option>
<option value="1">Not-in-family</option>
<option value="2">Other-relative</option>
<option value="3">Own-child</option>
<option value="4">Unmarried</option>
<option value="5">Wife</option>
</select>
<br>
<label for="race">Race</label>
<select id="race" name="race">
<option value="0">Amer Indian Eskimo</option>
<option value="1">Asian Pac Islander</option>
<option value="2">Black</option>
<option value="3">Other</option>
<option value="4">White</option>
</select>
<br>
<label for="gender">Gender</label>
<select id="gender" name="gender">
<option value="0">Female</option>
<option value="1">Male</option>
</select>
<br>
<label for="c_gain">Capital Gain </label>
<input type="text" id="c_gain" name="c_gain">btw:[0-99999]
<br>
<label for="c_loss">Capital Loss </label>
<input type="text" id="c_loss" name="c_loss">btw:[0-4356]
<br>
<label for="hours_per_week">Hours per Week </label>
<input type="text" id="hours_per_week" name="hours_per_week">btw:[1-99]
<br>
<label for="native-country">Native Country</label>
<select id="native-country" name="native-country">
<option value="0">Cambodia</option>
<option value="1">Canada</option>
<option value="2">China</option>
<option value="3">Columbia</option>
<option value="4">Cuba</option>
<option value="5">Dominican Republic</option>
<option value="6">Ecuador</option>
<option value="7">El Salvadorr</option>
<option value="8">England</option>
<option value="9">France</option>
<option value="10">Germany</option>
<option value="11">Greece</option>
<option value="12">Guatemala</option>
<option value="13">Haiti</option>
<option value="14">Netherlands</option>
<option value="15">Honduras</option>
<option value="16">HongKong</option>
<option value="17">Hungary</option>
<option value="18">India</option>
<option value="19">Iran</option>
<option value="20">Ireland</option>
<option value="21">Italy</option>
<option value="22">Jamaica</option>
<option value="23">Japan</option>
<option value="24">Laos</option>
<option value="25">Mexico</option>
<option value="26">Nicaragua</option>
<option value="27">Outlying-US(Guam-USVI-etc)</option>
<option value="28">Peru</option>
<option value="29">Philippines</option>
<option value="30">Poland</option>
<option value="11">Portugal</option>
<option value="32">Puerto-Rico</option>
<option value="33">Scotland</option>
<option value="34">South</option>
<option value="35">Taiwan</option>
<option value="36">Thailand</option>
<option value="37">Trinadad&Tobago</option>
<option value="38">United States</option>
<option value="39">Vietnam</option>
<option value="40">Yugoslavia</option>
</select>
<br>
<input type="submit" value="Submit">
</form>
</div>
</body>
</html>
Output :
result.htmlSimple page that will render the predicted output.
HTML
<!doctype html>
<html>
<body>
<h1> {{ prediction }}</h1>
</body>
</html>
Complete preprocessing.py Code Python
import os
import pandas as pd # Use 'pd' for Pandas (standard practice)
import numpy as np
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import pickle
# Load dataset
file_path = os.path.join("C:", "Users", "Asus", "Desktop", "Suven", "practise_DS", "adult.csv")
df = pd.read_csv(r"path_to_the_dataset")
# Filling missing values
df.replace("?", np.nan, inplace=True)
df.fillna(df.mode().iloc[0], inplace=True) # Fill missing values with the mode
# Discretization (simplifying marital status)
df.replace(['Divorced', 'Married-AF-spouse', 'Married-civ-spouse',
'Married-spouse-absent', 'Never-married', 'Separated', 'Widowed'],
['divorced', 'married', 'married', 'married',
'not married', 'not married', 'not married'], inplace=True)
# Label Encoding
category_col = ['workclass', 'race', 'education', 'marital-status', 'occupation',
'relationship', 'gender', 'native-country', 'income']
label_encoder = preprocessing.LabelEncoder()
# Creating a mapping dictionary
mapping_dict = {}
for col in category_col:
df[col] = label_encoder.fit_transform(df[col])
mapping_dict[col] = dict(enumerate(label_encoder.classes_)) # Improved mapping
print(mapping_dict)
# Dropping redundant columns
df.drop(['fnlwgt', 'educational-num'], axis=1, inplace=True)
# Splitting features and target
X = df.iloc[:, :-1].values # All columns except last
Y = df.iloc[:, -1].values # Only last column
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=100)
# Initialize and Train Decision Tree Classifier
dt_clf_gini = DecisionTreeClassifier(criterion="gini", random_state=100, max_depth=5, min_samples_leaf=5)
dt_clf_gini.fit(X_train, y_train)
# Predictions
y_pred_gini = dt_clf_gini.predict(X_test)
# Accuracy Score
print("Decision Tree using Gini Index\nAccuracy:", accuracy_score(y_test, y_pred_gini) * 100)
# Save Model Using Pickle
with open("model.pkl", "wb") as model_file:
pickle.dump(dt_clf_gini, model_file)
Running the Application
To run the application, use this command in the terminal- "python app.py" and visit the developmeent URL- "http://127.0.0.1:5000". Below is the snapshot of the output and testing.
User InputRetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4