Deploying machine learning data pipelines and algorithms should not be a time-consuming or difficult task. MLeap allows data scientists and engineers to deploy machine learning pipelines from Spark and Scikit-learn to a portable format and execution engine.
Documentation is available at https://combust.github.io/mleap-docs/.
Read Serializing a Spark ML Pipeline and Scoring with MLeap to gain a full sense of what is possible.
Using the MLeap execution engine and serialization format, we provide a performant, portable and easy-to-integrate production library for machine learning data pipelines and algorithms.
For portability, we build our software on the JVM and only use serialization formats that are widely-adopted.
We also provide a high level of integration with existing technologies.
Our goals for this project are:
Other versions besides those listed below may also work (especially more recent Java versions for the JRE), but these are the configurations which are tested by mleap.
MLeap Version Spark Version Scala Version Java Version Python Version XGBoost Version Tensorflow Version 0.23.3 3.4.0 2.12.18 11 3.7, 3.8 1.7.6 2.10.1 0.23.2 3.4.0 2.12.18 11 3.7, 3.8 1.7.6 2.10.1 0.23.1 3.4.0 2.12.18 11 3.7, 3.8 1.7.6 2.10.1 0.23.0 3.4.0 2.12.13 11 3.7, 3.8 1.7.3 2.10.1 0.22.0 3.3.0 2.12.13 11 3.7, 3.8 1.6.1 2.7.0 0.21.1 3.2.0 2.12.13 11 3.7 1.6.1 2.7.0 0.21.0 3.2.0 2.12.13 11 3.6, 3.7 1.6.1 2.7.0 0.20.0 3.2.0 2.12.13 8 3.6, 3.7 1.5.2 2.7.0 0.19.0 3.0.2 2.12.13 8 3.6, 3.7 1.3.1 2.4.1 0.18.1 3.0.2 2.12.13 8 3.6, 3.7 1.0.0 2.4.1 0.18.0 3.0.2 2.12.13 8 3.6, 3.7 1.0.0 2.4.1 0.17.0 2.4.5 2.11.12, 2.12.10 8 3.6, 3.7 1.0.0 1.11.0libraryDependencies += "ml.combust.mleap" %% "mleap-runtime" % "0.23.3"
<dependency>
<groupId>ml.combust.mleap</groupId>
<artifactId>mleap-runtime_2.12</artifactId>
<version>0.23.3</version>
</dependency>
libraryDependencies += "ml.combust.mleap" %% "mleap-spark" % "0.23.3"
<dependency>
<groupId>ml.combust.mleap</groupId>
<artifactId>mleap-spark_2.12</artifactId>
<version>0.23.3</version>
</dependency>
Install MLeap from PyPI
For more complete examples, see our other Git repository: MLeap Demos
Create and Export a Spark PipelineThe first step is to create our pipeline in Spark. For our example we will manually build a simple Spark ML pipeline.
import ml.combust.bundle.BundleFile import ml.combust.mleap.spark.SparkSupport._ import org.apache.spark.ml.Pipeline import org.apache.spark.ml.bundle.SparkBundleContext import org.apache.spark.ml.feature.{Binarizer, StringIndexer} import org.apache.spark.sql._ import org.apache.spark.sql.functions._ import scala.util.Using val datasetName = "./examples/spark-demo.csv" val dataframe: DataFrame = spark.sqlContext.read.format("csv") .option("header", true) .load(datasetName) .withColumn("test_double", col("test_double").cast("double")) // User out-of-the-box Spark transformers like you normally would val stringIndexer = new StringIndexer(). setInputCol("test_string"). setOutputCol("test_index") val binarizer = new Binarizer(). setThreshold(0.5). setInputCol("test_double"). setOutputCol("test_bin") val pipelineEstimator = new Pipeline() .setStages(Array(stringIndexer, binarizer)) val pipeline = pipelineEstimator.fit(dataframe) // then serialize pipeline val sbc = SparkBundleContext().withDataset(pipeline.transform(dataframe)) Using(BundleFile("jar:file:/tmp/simple-spark-pipeline.zip")) { bf => pipeline.writeBundle.save(bf)(sbc).get }
The dataset used for training can be found here
Spark pipelines are not meant to be run outside of Spark. They require a DataFrame and therefore a SparkContext to run. These are expensive data structures and libraries to include in a project. With MLeap, there is no dependency on Spark to execute a pipeline. MLeap dependencies are lightweight and we use fast data structures to execute your ML pipelines.
Import the MLeap library in your PySpark job
import mleap.pyspark from mleap.pyspark.spark_support import SimpleSparkSerializer
See PySpark Integration of python/README.md for more.
Create and Export a Scikit-Learn Pipelineimport pandas as pd from mleap.sklearn.pipeline import Pipeline from mleap.sklearn.preprocessing.data import FeatureExtractor, LabelEncoder, ReshapeArrayToN1 from sklearn.preprocessing import OneHotEncoder data = pd.DataFrame(['a', 'b', 'c'], columns=['col_a']) categorical_features = ['col_a'] feature_extractor_tf = FeatureExtractor(input_scalars=categorical_features, output_vector='imputed_features', output_vector_items=categorical_features) # Label Encoder for x1 Label label_encoder_tf = LabelEncoder(input_features=feature_extractor_tf.output_vector_items, output_features='{}_label_le'.format(categorical_features[0])) # Reshape the output of the LabelEncoder to N-by-1 array reshape_le_tf = ReshapeArrayToN1() # Vector Assembler for x1 One Hot Encoder one_hot_encoder_tf = OneHotEncoder(sparse=False) one_hot_encoder_tf.mlinit(prior_tf = label_encoder_tf, output_features = '{}_label_one_hot_encoded'.format(categorical_features[0])) one_hot_encoder_pipeline_x0 = Pipeline([ (feature_extractor_tf.name, feature_extractor_tf), (label_encoder_tf.name, label_encoder_tf), (reshape_le_tf.name, reshape_le_tf), (one_hot_encoder_tf.name, one_hot_encoder_tf) ]) one_hot_encoder_pipeline_x0.mlinit() one_hot_encoder_pipeline_x0.fit_transform(data) one_hot_encoder_pipeline_x0.serialize_to_bundle('/tmp', 'mleap-scikit-test-pipeline', init=True) # array([[ 1., 0., 0.], # [ 0., 1., 0.], # [ 0., 0., 1.]])Load and Transform Using MLeap
Because we export Spark and Scikit-learn pipelines to a standard format, we can use either our Spark-trained pipeline or our Scikit-learn pipeline from the previous steps to demonstrate usage of MLeap in this section. The choice is yours!
import ml.combust.bundle.BundleFile import ml.combust.mleap.runtime.MleapSupport._ import scala.util.Using // load the Spark pipeline we saved in the previous section val bundle = Using(BundleFile("jar:file:/tmp/simple-spark-pipeline.zip"))) { bundleFile => bundleFile.loadMleapBundle().get }).opt.get // create a simple LeapFrame to transform import ml.combust.mleap.runtime.frame.{DefaultLeapFrame, Row} import ml.combust.mleap.core.types._ // MLeap makes extensive use of monadic types like Try val schema = StructType(StructField("test_string", ScalarType.String), StructField("test_double", ScalarType.Double)).get val data = Seq(Row("hello", 0.6), Row("MLeap", 0.2)) val frame = DefaultLeapFrame(schema, data) // transform the dataframe using our pipeline val mleapPipeline = bundle.root val frame2 = mleapPipeline.transform(frame).get val data2 = frame2.dataset // get data from the transformed rows and make some assertions assert(data2(0).getDouble(2) == 1.0) // string indexer output assert(data2(0).getDouble(3) == 1.0) // binarizer output // the second row assert(data2(1).getDouble(2) == 2.0) assert(data2(1).getDouble(3) == 0.0)
For more documentation, please see our documentation, where you can learn to:
Please ensure you have sbt 1.9.3, java 11, scala 2.12.18
git submodule update --init --recursive
sbt compile
Thank you to Swoop for supporting the XGboost integration.
See LICENSE and NOTICE file in this repository.
Copyright 20 Combust, Inc.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4