A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://www.geeksforgeeks.org/pandas/handling-large-datasets-in-pandas/ below:

Handling Large Datasets in Pandas

Handling Large Datasets in Pandas

Last Updated : 23 Jul, 2025

Pandas is an excellent tool for working with smaller datasets, typically ranging from two to three gigabytes. However, when the dataset size exceeds this threshold, using Pandas can become problematic. This is because Pandas loads the entire dataset into memory before processing it, which can cause memory issues if the dataset is too large for the available RAM. Even with smaller datasets, memory problems can arise as preprocessing and modifications often create duplicate copies of the DataFrame.

Despite these challenges, there are several techniques that allow you to handle larger datasets efficiently with Pandas in Python. Let’s explore these methods that enable you to work with millions of records while minimizing memory usage.

How to handle Large Datasets in Python?
  1. Use Efficient Datatypes: Utilize more memory-efficient data types (e.g., int32 instead of int64, float32 instead of float64) to reduce memory usage.
  2. Load Less Data: Use the use-cols parameter in pd.read_csv() to load only the necessary columns, reducing memory consumption.
  3. Sampling: For exploratory data analysis or testing, consider working with a sample of the dataset instead of the entire dataset.
  4. Chunking: Use the chunksize parameter in pd.read_csv() to read the dataset in smaller chunks, processing each chunk iteratively.
  5. Optimizing Pandas dtypes: Use the astype method to convert columns to more memory-efficient types after loading the data, if appropriate.
  6. Parallelizing Pandas with Dask: Use Dask, a parallel computing library, to scale Pandas workflows to larger-than-memory datasets by leveraging parallel processing.
Using Efficient Data Types: Python
import pandas as pd

# Define the size of the dataset
num_rows = 1000000  # 1 million rows

# Example DataFrame with inefficient datatypes
data = {'A': [1, 2, 3, 4],
        'B': [5.0, 6.0, 7.0, 8.0]}
df = pd.DataFrame(data)

# Replicate the DataFrame to create a larger dataset
df_large = pd.concat([df] * (num_rows // len(df)), ignore_index=True)

# Check memory usage before conversion
print("Memory usage before conversion:")
print(df_large.memory_usage().sum())

# Convert to more memory-efficient datatypes
df_large['A'] = pd.to_numeric(df_large['A'], downcast='integer')
df_large['B'] = pd.to_numeric(df_large['B'], downcast='float')

# Check memory usage after conversion
print("Memory usage after conversion:")
print(df_large.memory_usage().sum())

Output:

Memory usage before conversion:
16000128
Memory usage after conversion:
5000128
Load Less Data Python
import pandas as pd

# Create sample DataFrame
data = {'A': range(1000),
        'B': range(1000), 
        'C': range(1000), 
        'D': range(1000)}

# Load only specific columns
df = pd.DataFrame(data)
df_subset = df[['A', 'D']]
print('Specific Columns of the DataFrame')
print(df_subset.head())

Output:

Specific Columns of the DataFrame
   A  D
0  0  0
1  1  1
2  2  2
3  3  3
4  4  4
Sampling: Python
import pandas as pd

# Create sample DataFrame
data = {'A': range(1000),
        'B': range(1000), 
        'C': range(1000), 
        'D': range(1000)}

# Sample 10% of the dataset
df = pd.DataFrame(data)
df_sample = df.sample(frac=0.1, random_state=42)
print(df_sample.head())

Output:

       A    B    C    D
521  521  521  521  521
737  737  737  737  737
740  740  740  740  740
660  660  660  660  660
411  411  411  411  411
Chunking: Python
import pandas as pd

# Create sample DataFrame
data = {'A': range(10000),
        'B': range(10000)}

# Process data in chunks
chunk_size = 1000
for chunk in pd.DataFrame(data).groupby(pd.DataFrame(data).index // chunk_size):
    print(chunk)

Output:

(0,        A    B
0      0    0
1      1    1
2      2    2
3      3    3
4      4    4
..   ...  ...
995  995  995
996  996  996
997  997  997
998  998  998
999  999  999
[1000 rows x 2 columns])
(1,          A     B
1000  1000  1000
1001  1001  1001
1002  1002  1002
1003  1003  1003
1004  1004  1004
...    ...   ...
1995  1995  1995
1996  1996  1996
1997  1997  1997
1998  1998  1998
1999  1999  1999
[1000 rows x 2 columns])
(2,          A     B
2000  2000  2000
2001  2001  2001
2002  2002  2002
2003  2003  2003
2004  2004  2004
...    ...   ...
2995  2995  2995
2996  2996  2996
2997  2997  2997
2998  2998  2998
2999  2999  2999
[1000 rows x 2 columns])
(3,          A     B
3000  3000  3000
3001  3001  3001
3002  3002  3002
3003  3003  3003
3004  3004  3004
...    ...   ...
3995  3995  3995
3996  3996  3996
3997  3997  3997
3998  3998  3998
3999  3999  3999
Optimising Pandas dtypes: Python
import pandas as pd

# Create sample DataFrame
data = {'date_column': ['2022-01-01', '2022-01-02', '2022-01-03'],
        'numeric_column': [1.234, 2.345, 3.456]}

df = pd.DataFrame(data)

# Convert inefficient dtypes
df['date_column'] = pd.to_datetime(df['date_column'])
df['numeric_column'] = pd.to_numeric(df['numeric_column'], downcast='float')

print(df.dtypes)
date_column       datetime64[ns]
numeric_column           float32
dtype: object
Parallelising Pandas with Dask: Python
import dask.dataframe as dd
import pandas as pd

# Create sample DataFrame
data = {'A': range(10000),
        'B': range(10000)}

df = pd.DataFrame(data)

# Load data using Dask
ddf = dd.from_pandas(df, npartitions=4)

# Perform parallelized operations
result = ddf.groupby('A').mean().compute()
print(result)
           B
A           
0        0.0
1        1.0
2        2.0
3        3.0
4        4.0
...      ...
9995  9995.0
9996  9996.0
9997  9997.0
9998  9998.0
9999  9999.0
[10000 rows x 1 columns]


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4