Note
Estimated Reading Time: 15 minutes
Often in practice we have a need to exceed the capabilities of a single machine. Modin works and performs well in both local mode and in a cluster environment. The key advantage of Modin is that your python code does not change between local development and cluster execution. Users are not required to think about how many workers exist or how to distribute and partition their data; Modin handles all of this seamlessly and transparently.
Note
It is possible to use a Jupyter notebook, but you will have to deploy a Jupyter server on the remote cluster head node and connect to it.
Starting and connecting to the cluster#This example starts 1 head node (m5.24xlarge) and 5 worker nodes (m5.24xlarge), 576 total CPUs. You can check the Amazon EC2 pricing page.
It is possble to manually create AWS EC2 instances and configure them or just use the Ray CLI to create and initialize a Ray cluster on AWS using Modin’s Ray cluster setup config, which we are going to utilize in this example. Refer to Ray’s autoscaler options page on how to modify the file.
More details on how to launch a Ray cluster can be found on Ray’s cluster docs.
To start up the Ray cluster, run the following command in your terminal:
ray up modin-cluster.yaml
Once the head node has completed initialization, you can optionally connect to it by running the following command.
ray attach modin-cluster.yaml
To exit the ssh session and return back into your local shell session, type:
Executing in a cluster environment#Modin lets you instantly speed up your workflows with a large data by scaling pandas on a cluster. In this tutorial, we will use a 12.5 GB big_yellow.csv
file that was created by concatenating a 200MB NYC Taxi dataset file 64 times. Preparing this file was provided as part of our Modin’s Ray cluster setup config.
If you want to use the other dataset, you should provide it to each of the cluster nodes with the same path. We recomnend doing this by customizing the setup_commands
section of the Modin’s Ray cluster setup config.
To run any script in a remote cluster, you need to submit it to the Ray. In this way, the script file is sent to the the remote cluster head node and executed there.
In this tutorial, we provide the exercise_5.py script, which reads the data from the CSV file and executes such pandas operations as count, groupby and map. As the result, you will see the size of the file being read and the execution time of the entire script.
You can submit this script to the existing remote cluster by running the following command.
ray submit modin-cluster.yaml exercise_5.py
To download or upload files to the cluster head node, use ray rsync_down
or ray rsync_up
. It may help if you want to use some other Python modules that should be available to execute your own script or download a result file after executing the script.
# download a file from the cluster to the local machine: ray rsync_down modin-cluster.yaml '/path/on/cluster' '/local/path' # upload a file from the local machine to the cluster: ray rsync_up modin-cluster.yaml '/local/path' '/path/on/cluster'Shutting down the cluster#
Now that we have finished the computation, we need to shut down the cluster with ray down command.
ray down modin-cluster.yaml
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4