Fast, Automated Feature Engineering and Selection using Polars!
featurewiz_polars
is a high-performance Python library designed to accelerate your machine learning workflows by automatically creating and selecting the best features from your dataset. It leverages the speed and memory efficiency of the Polars DataFrame library.
Get started in minutes! Here's a minimal example to create some mock data:
import polars as pl # Create a sample Polars DataFrame data = { 'col1': [1, 2, 1, 3, 4, 5, 1, 6], 'col2': [10.0, 11.5, 10.0, 12.5, 13.0, 14.5, 10.0, 15.0], 'category': ['A', 'B', 'A', 'B', 'C', 'A', 'A', 'C'], 'target': [0, 1, 0, 1, 1, 0, 0, 1] } df = pl.DataFrame(data)
Or you can load a CSV file into polars
library's dataframes for use with featurewiz-polars
. Use this code snippet exclusively for featurewiz-polars
pipelines.
# Load a CSV file into Polars DataFrames using: import polars as pl df = pl.read_csv(datapath+filename, null_values=['NULL','NA'], try_parse_dates=True, infer_schema_length=10000, ignore_errors=True) # Before we do feature selection we always need to make sure we split the data ####### target = 'target' predictors = [x for x in df.columns if x not in [target]] X = df[predictors] y = df[target] # BEWARE WHEN USING SCIKIT-LEARN `train_test_split` WITH POLARS DATA FRAMES! # If you perform train-test split using sklearn it will give different train test rows each time # X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Instead you must split using polars_train_test_split with seed parameter # This will ensure your train-test splits are same each time to get same rows of data from featurewiz_polars import polars_train_test_split X_train, X_test, y_train, y_test = polars_train_test_split(X, y, test_size=0.2, random_state=42)
Once you have performed train_test_split on your Polars dataframe, you can initialize featurewiz_polars and perform feature engineering and selection:
from featurewiz_polars import FeatureWiz # Initialize FeatureWiz for classification wiz = FeatureWiz(model_type="Classification", estimator=None, corr_limit=0.7, category_encoders='onehot', classic=True, verbose=0) # Fit and transform the training data X_transformed, y_transformed = wiz.fit_transform(X_train, y_train) # Transform the test data X_test_transformed = wiz.transform(X_test) # Transform the test target variable y_test_transformed = wiz.y_encoder.transform(y_test)
Now you can display the selected features and use them further in your model training pipelines:
# View results print("Selected Features:") print(wiz.selected_features) # Example Output: ['col1', 'col2', 'category_A', 'category_B'] print("\nTransformed DataFrame head:") print(X_transformed.head()) # Example Output: Polars DataFrame with only the selected features🤔 Why Use featurewiz_polars?
While there are many tools for feature manipulation, featurewiz_polars offers a unique combination of speed, automation, and specific algorithms:
Vs. Original featurewiz (Pandas):
Speed & Memory: Built entirely on Polars, featurewiz_polars
offers exceptional speed and memory efficiency. It is particularly well-suited for handling datasets that exceed the limits of Pandas, leveraging Polars' multi-threaded processing and highly optimized Rust-based backend for superior performance.
Modern Backend: Harnesses the power of the cutting-edge Polars ecosystem for unparalleled speed and efficiency.
Vs. scikit-learn Preprocessing/Selection:
featurewiz_polars
simplifies the process.When to Use featurewiz_polars:
With featurewiz_polars
, you get speed, simplicity, and cutting-edge techniques—all in one package.
Install featurewiz_polars directly from PyPI:
pip install featurewiz_polars
Or, install the latest development version directly from GitHub:
pip install git+https://github.com/AutoViML/featurewiz_polars.git
Unlike a new library, the featurewiz-polars
library is well-tested. To help you get started, I've provided example scripts like fs_test.py
. This script demonstrates how to unit-test the library using two existing datasets in ./data sub folder in a concise manner.
To run the unit tests, simply run the following command in your terminal:
cd tests
python fs_test.py
Additionally, I have provided two additional scripts for benchmarking. The fs_lazytransform_test.py
script allows you to compare the performance of featurewiz-polars
after feature engineering with the lazytransform
library. This is a great boon for users who want to effortlessly create hundreds of features using the lazytransform library and then use featurew-polars for feature selection.
cd tests
python fs_lazytransform_test.py
For a more in-depth benchmarking comparison between this library and another popular MRMR library, use fs_mr_comparison_test.py
script. This script compares the performance of featurewiz-polars
with another MRMR library.
cd tests
python fs_lazytransform_test.py
If you prefer working in a Jupyter Notebook or Colab, here are direct links to work in Colab with featurewiz-polars:
I have provided additional examples in ./examples sub folder. Anybody can open a copy of my Github-hosted notebooks within Colab. To make it easier I have created Open-in-Colab
links to those GitHub-hosted notebooks below:
featurewiz-polars
1. Feature Selection Only with FeatureWiz
transformer
You have already seen this in the Quick Start section.
2. Feature Selection and Model Training withFeatureWiz_Model
pipeline
This approach combines both feature selection and model training into a single scikit-learn pipeline.
from featurewiz_polars import FeatureWiz_Model from xgboost import XGBRegressor # Initialize FeatureWiz_Model for regression with an XGBoost Regressor wiz_model = FeatureWiz_Model(model_type="Regression", model=XGBRegressor(), corr_limit=0.7, category_encoders='onehot', classic=True, verbose=0) # Fit and transform the training data X_transformed, y_transformed = wiz_model.fit_transform(X_train, y_train) # Make predictions on test data y_pred = wiz_model.predict(X_test) # View results print(wiz_model.selected_features)
Key Points:
FeatureWiz_Model
class to combine feature selection and model training.fit_transform
method is used to fit the feature selection process and train the specified model on the training data.predict
method handles both transforming the test data using the learned feature selection and making predictions with the trained model, streamlining the entire process.The FeatureWiz
class and FeatureWiz_Model
class are designed for building data pipelines that use the feature engineering, selection, and model training capabilities of Polars. All you need to do is to upload your data into Polars DataFrames and then start calling these pipelines.
estimator
(estimator object, optional): This argument is used to by featurewiz to do the feature selection. Only the following model estimators are supported: XGBoost, CatBoost, RandomForest and LightGBM
model
(estimator object, optional): This estimator is used in the pipeline to train a new model after feature selection
. If None
, a default estimator (Random Forest) will be trained after selection. Defaults to None
. This model
argument can be different from the estimator
argument above. Only the following model estimators are supported: XGBoost, CatBoost, RandomForest and LightGBM
model_type
(str, optional): The type of model to be built ('classification'
or 'regression'
). Determines the appropriate preprocessing and feature selection strategies. Defaults to 'classification'
.
category_encoders
(str, optional): The type of encoding to apply to categorical features ('target'
, 'onehot'
, etc.). 'woe'
encoding is only available for classification model types. Defaults to 'target'
.
imputation_strategy
(str, optional): The strategy for handling missing values ('mean'
, 'median'
, 'zeros'
). Determines how missing data will be filled in before feature selection. Defaults to 'mean'
.
corr_limit
(float, optional): The correlation threshold for removing highly correlated features. Features with a correlation above this threshold will be targeted for removal. Defaults to 0.7
.
classic
(bool, optional): If True
, it implements the original classic featurewiz
library using Polars. If False
, implements the train-validation-split-recursive-xgboost version, which is faster and uses train/validation splits to stabilize features. Defaults to False
.
verbose
(int, optional): Controls the verbosity of the output during feature selection. 0
for minimal output, higher values for more detailed information. Defaults to 0
.
Select either the old featurewiz method or the new method using the classic
argument in the new library: (e.g., if you set classic
=True, you will get features similar to the old feature selection method). If you set it to False, you will use the new feature selection method. I would suggest you try both methods to see which set of features works well for your dataset.
The new featurewiz-polars
library uses an improved method for recursive_xgboost
feature selection known as Split-Driven Recursive_XGBoost
: In this method, we use Polars under the hood to speed up calculations for large datasets and in addition perform the following steps:
If you are working on processing massive datasets with Polars' speed and efficiency, while leveraging the power of featurewiz_polars
for building high quality MLOps workflows, I welcome your feedback and comments to me at rsesha2001 at yahoo dot com for making it more useful to you in the months to come. Please star
this repo or open a pull request or report an issue. Every which way, you make this repo more useful and better for everyone!
Apache License 2.0
Copyright & All Rights Reserved 2025 Ram Seshadri
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4