Pipeline components that support partial_fit.
The goal of scikit-partial is to offer a pipeline that can run partial_fit
. This allows of online learning on an entire pipeline.
You can install everything with pip
:
python -m pip install --upgrade pip
python -m pip install scikit-partial
Assuming that you use a stateless featurizer in your pipeline, such as HashingVectorizer or language models from whatlies, you choose to pre-train your scikit-learn model beforehand and fine-tune it later using models that offer the .partial_fit()
-api. If you're unfamiliar with this api, you might appreciate this course on calmcode.
import pandas as pd from sklearn.linear_model import SGDClassifier from sklearn.feature_extraction.text import HashingVectorizer from skpartial.pipeline import make_partial_pipeline url = "https://raw.githubusercontent.com/koaning/icepickle/main/datasets/imdb_subset.csv" df = pd.read_csv(url) X, y = list(df['text']), df['label'] # Construct a pipeline with components that are `.partial_fit()` compatible pipe = make_partial_pipeline(HashingVectorizer(), SGDClassifier(loss="log")) # Run the learning algorithm on batches of data for i in range(10): # We could also do a whole bunch of data augmentation here! pipe.partial_fit(X, y, classes=[0, 1])
When is this pattern useful? Let's consider spelling errors. Suppose that we'd like our algorithm to be robust against typos. Then we can simulate typos on our X
inside of our learning loop.
The following pipeline components are added.
from skpartial.pipeline import ( PartialPipeline, PartialFeatureUnion, make_partial_pipeline, make_partial_union, )
These tools allow you to declare pipelines that support .partial_fit()
. Note that components used in these pipelines all need to have .partial_fit()
implemented.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4