Last Updated : 23 Jul, 2025
Pandas is an excellent tool for working with smaller datasets, typically ranging from two to three gigabytes. However, when the dataset size exceeds this threshold, using Pandas can become problematic. This is because Pandas loads the entire dataset into memory before processing it, which can cause memory issues if the dataset is too large for the available RAM. Even with smaller datasets, memory problems can arise as preprocessing and modifications often create duplicate copies of the DataFrame.
Despite these challenges, there are several techniques that allow you to handle larger datasets efficiently with Pandas in Python. Let’s explore these methods that enable you to work with millions of records while minimizing memory usage.
How to handle Large Datasets in Python?int32
instead of int64
, float32
instead of float64
) to reduce memory usage.use-cols
parameter in pd.read_csv()
to load only the necessary columns, reducing memory consumption.chunksize
parameter in pd.read_csv()
to read the dataset in smaller chunks, processing each chunk iteratively.astype
method to convert columns to more memory-efficient types after loading the data, if appropriate.
import pandas as pd
# Define the size of the dataset
num_rows = 1000000 # 1 million rows
# Example DataFrame with inefficient datatypes
data = {'A': [1, 2, 3, 4],
'B': [5.0, 6.0, 7.0, 8.0]}
df = pd.DataFrame(data)
# Replicate the DataFrame to create a larger dataset
df_large = pd.concat([df] * (num_rows // len(df)), ignore_index=True)
# Check memory usage before conversion
print("Memory usage before conversion:")
print(df_large.memory_usage().sum())
# Convert to more memory-efficient datatypes
df_large['A'] = pd.to_numeric(df_large['A'], downcast='integer')
df_large['B'] = pd.to_numeric(df_large['B'], downcast='float')
# Check memory usage after conversion
print("Memory usage after conversion:")
print(df_large.memory_usage().sum())
Output:
Memory usage before conversion: 16000128 Memory usage after conversion: 5000128Load Less Data
import pandas as pd
# Create sample DataFrame
data = {'A': range(1000),
'B': range(1000),
'C': range(1000),
'D': range(1000)}
# Load only specific columns
df = pd.DataFrame(data)
df_subset = df[['A', 'D']]
print('Specific Columns of the DataFrame')
print(df_subset.head())
Output:
Specific Columns of the DataFrame A D 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4Sampling:
import pandas as pd
# Create sample DataFrame
data = {'A': range(1000),
'B': range(1000),
'C': range(1000),
'D': range(1000)}
# Sample 10% of the dataset
df = pd.DataFrame(data)
df_sample = df.sample(frac=0.1, random_state=42)
print(df_sample.head())
Output:
A B C D 521 521 521 521 521 737 737 737 737 737 740 740 740 740 740 660 660 660 660 660 411 411 411 411 411Chunking:
import pandas as pd
# Create sample DataFrame
data = {'A': range(10000),
'B': range(10000)}
# Process data in chunks
chunk_size = 1000
for chunk in pd.DataFrame(data).groupby(pd.DataFrame(data).index // chunk_size):
print(chunk)
Output:
(0, A B 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4 .. ... ... 995 995 995 996 996 996 997 997 997 998 998 998 999 999 999 [1000 rows x 2 columns]) (1, A B 1000 1000 1000 1001 1001 1001 1002 1002 1002 1003 1003 1003 1004 1004 1004 ... ... ... 1995 1995 1995 1996 1996 1996 1997 1997 1997 1998 1998 1998 1999 1999 1999 [1000 rows x 2 columns]) (2, A B 2000 2000 2000 2001 2001 2001 2002 2002 2002 2003 2003 2003 2004 2004 2004 ... ... ... 2995 2995 2995 2996 2996 2996 2997 2997 2997 2998 2998 2998 2999 2999 2999 [1000 rows x 2 columns]) (3, A B 3000 3000 3000 3001 3001 3001 3002 3002 3002 3003 3003 3003 3004 3004 3004 ... ... ... 3995 3995 3995 3996 3996 3996 3997 3997 3997 3998 3998 3998 3999 3999 3999Optimising Pandas dtypes:
import pandas as pd
# Create sample DataFrame
data = {'date_column': ['2022-01-01', '2022-01-02', '2022-01-03'],
'numeric_column': [1.234, 2.345, 3.456]}
df = pd.DataFrame(data)
# Convert inefficient dtypes
df['date_column'] = pd.to_datetime(df['date_column'])
df['numeric_column'] = pd.to_numeric(df['numeric_column'], downcast='float')
print(df.dtypes)
date_column datetime64[ns] numeric_column float32 dtype: objectParallelising Pandas with Dask:
import dask.dataframe as dd
import pandas as pd
# Create sample DataFrame
data = {'A': range(10000),
'B': range(10000)}
df = pd.DataFrame(data)
# Load data using Dask
ddf = dd.from_pandas(df, npartitions=4)
# Perform parallelized operations
result = ddf.groupby('A').mean().compute()
print(result)
B A 0 0.0 1 1.0 2 2.0 3 3.0 4 4.0 ... ... 9995 9995.0 9996 9996.0 9997 9997.0 9998 9998.0 9999 9999.0 [10000 rows x 1 columns]
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4