Last Updated : 12 Jul, 2025
Analyzing the selling price of used cars is essential for making informed decisions in the automotive market. Using Python, we can efficiently process and visualize data to uncover key factors influencing car prices. This analysis not only aids buyers and sellers but also enables predictive modeling for future price estimation. This article will explore how to analyze the selling price of used cars using Python.
Step 1: Understanding the DatasetThe dataset contains various attributes of used cars, including price, brand, color, horsepower and more. Our goal is to analyze these factors and determine their impact on selling price. To download the file used in this example, click here.
Step 2: ConvertingProblem Statement: Our friend Otis wants to sell his car but isn't sure about the price. He wants to maximize profit while ensuring a reasonable deal for buyers. To help Otis we will analyze the dataset and determine the factors affecting car prices.
.data
File to .csv
If the dataset is in .data
format, follow these steps to convert it to .csv
:
.csv
.Now we can proceed with loading the dataset into Python.
Step 3: Install and Import Required Python LibrariesTo analyze the data install the following Python libraries using the command below:
pip install pandas numpy matplotlib seaborn scipy
Import the following python libraries: numpy, pandas, matplotlib, seaborn and scipy.
Python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy as sp
Step 4: Load the Dataset
Now, we load the dataset into a Pandas DataFrame and preview the first few rows. Let's check the first five entries of dataset.
Python
df = pd.read_csv('output.csv')
df = df.iloc[: , 1:]
df.head()
Output:
Dataset Step 5: Assign Column HeadersTo make our dataset more readable we assign column headers:
Python
headers = ["symboling", "normalized-losses", "make",
"fuel-type", "aspiration","num-of-doors",
"body-style","drive-wheels", "engine-location",
"wheel-base","length", "width","height", "curb-weight",
"engine-type","num-of-cylinders", "engine-size",
"fuel-system","bore","stroke", "compression-ratio",
"horsepower", "peak-rpm","city-mpg","highway-mpg","price"]
df.columns=headers
df.head()
Output:
Column Header Step 6: Check for Missing ValuesMissing values can impact our analysis. Let's check if any columns contain missing values.
Python
data = df
data.isna().any()
data.isnull().any()
Output:
Missing Values Step 7: Convert MPG to L/100kmSince fuel consumption is measured differently in different regions, we convert miles per gallon (MPG) to liters per 100 kilometers (L/100km)
Python
data['city-mpg'] = 235 / df['city-mpg']
data.rename(columns = {'city_mpg': "city-L / 100km"}, inplace = True)
print(data.columns)
data.dtypes
Output:
MPG Step 8: Convert Price Column to IntegerThe price column should be numerical, but it may contain string values like ?
. We need to clean and convert it:
data.price.unique()
data = data[data.price != '?']
data['price'] = data['price'].astype(int)
data.dtypes
Output:
Step 9: Normalize FeaturesTo ensure fair comparisons between different features, we normalize numerical columns. To categorize cars based on their price we divide the price range into three categories: Low, Medium and High.
Python
data['length'] = data['length']/data['length'].max()
data['width'] = data['width']/data['width'].max()
data['height'] = data['height']/data['height'].max()
# binning- grouping values
bins = np.linspace(min(data['price']), max(data['price']), 4)
group_names = ['Low', 'Medium', 'High']
data['price-binned'] = pd.cut(data['price'], bins,
labels = group_names,
include_lowest = True)
print(data['price-binned'])
plt.hist(data['price-binned'])
plt.show()
Output:
Normalization Features Step 10: Convert Categorical Data to NumericalMachine learning models require numerical data. We convert categorical variables into numerical ones using one-hot encoding:
Python
pd.get_dummies(data['fuel-type']).head()
data.describe()
Output:
Convert Categorical Data to Numerical Step 11: Data Visualization Python
plt.boxplot(data['price'])
sns.boxplot(x ='drive-wheels', y ='price', data = data)
plt.scatter(data['engine-size'], data['price'])
plt.title('Scatterplot of Enginesize vs Price')
plt.xlabel('Engine size')
plt.ylabel('Price')
plt.grid()
plt.show()
Output:
Engine Size Step 12: Grouping Data by Drive-Wheels and Body-StyleGrouping data helps identify trends based on key variables:
Python
test = data[['drive-wheels', 'body-style', 'price']]
data_grp = test.groupby(['drive-wheels', 'body-style'],
as_index = False).mean()
data_grp
Output:
Grouping Data Step 13: Create a Pivot Table & Heatmap Python
data_pivot = data_grp.pivot(index = 'drive-wheels',
columns = 'body-style')
data_pivot
plt.pcolor(data_pivot, cmap ='RdBu')
plt.colorbar()
plt.show()
Output:
Pivot Table Step 14: Perform ANOVA TestThe Analysis of Variance (ANOVA) test helps determine if different groups have significantly different means.
Python
data_annova = data[['make', 'price']]
grouped_annova = data_annova.groupby(['make'])
annova_results_l = sp.stats.f_oneway(
grouped_annova.get_group('honda')['price'],
grouped_annova.get_group('subaru')['price']
)
print(annova_results_l)
sns.regplot(x ='engine-size', y ='price', data = data)
plt.ylim(0, )
Output:
ANOVA TestThis step-by-step analysis helps in understanding the key factors influencing the selling price of used cars. Proper data cleaning, visualization and statistical tests ensure that our findings are accurate and insightful.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4