RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from http://cloud.google.com/vertex-ai/docs/workbench/instances/create-dataproc-enabled below:

Create a Dataproc Spark-enabled Vertex AI Workbench instance

Stay organized with collections Save and categorize content based on your preferences.

Create a Dataproc Spark-enabled instance Dataproc Serverless is now Google Cloud Serverless for Apache Spark. Until updated, documents, the Google Cloud console, and JupyterLab pages will refer to the previous name.

This page describes how to create a Dataproc Spark-enabled Vertex AI Workbench instance. This page also describes the benefits of the Dataproc JupyterLab extension and provides an overview on how to use the extension with Dataproc Serverless and Dataproc on Compute Engine.

Overview of the Dataproc JupyterLab extension

Vertex AI Workbench instances have the Dataproc JupyterLab extension preinstalled, as of version M113 and later.

Note: You can also install and use the Dataproc JupyterLab extension on your local machine or a Compute Engine VM instance.

The Dataproc JupyterLab extension provides two ways to run Apache Spark notebook jobs: Dataproc clusters and Dataproc Serverless.

Dataproc clusters include a rich set of features with control over the infrastructure that Spark runs on. You choose the size and configuration of your Spark cluster, allowing for customization and control over your environment. This approach is ideal for complex workloads, long-running jobs, and fine-grained resource management.
Dataproc Serverless eliminates infrastructure concerns. You submit your Spark jobs, and Google handles the provisioning, scaling, and optimization of resources behind the scenes. This serverless approach offers a cost-efficient option for data science and ML workloads.

With both options, you can use Spark for data processing and analysis. The choice between Dataproc clusters and Dataproc Serverless depends on your specific workload requirements, required level of control, and resource usage patterns.

Benefits of using Dataproc Serverless for data science and ML workloads include:

No cluster management: You don't need to worry about provisioning, configuring, or managing Spark clusters. This saves you time and resources.
Autoscaling: Dataproc Serverless automatically scales up and down based on the workload, so you only pay for the resources you use.
High performance: Dataproc Serverless is optimized for performance and takes advantage of Google Cloud's infrastructure.
Integration with other Google Cloud technologies: Dataproc Serverless integrates with other Google Cloud products, such as BigQuery and Dataplex Universal Catalog.

For more information, see the Dataproc Serverless documentation.

Before you begin

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.
Go to project selector
Enable the Cloud Resource Manager, Dataproc, and Notebooks APIs.

Enable the APIs
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.
Go to project selector
Enable the Cloud Resource Manager, Dataproc, and Notebooks APIs.

Enable the APIs

Required roles

To ensure that the service account has the necessary permissions to run a notebook file on a Dataproc Serverless cluster or a Dataproc cluster, ask your administrator to grant the service account the following IAM roles:

Important: You must grant these roles to the service account, not to your user account. Failure to grant the roles to the correct principal might result in permission errors.

Dataproc Worker (roles/dataproc.worker) on your project
Dataproc Editor (roles/dataproc.editor) on the cluster for the dataproc.clusters.use permission

For more information about granting roles, see Manage access to projects, folders, and organizations.

These predefined roles contain the permissions required to run a notebook file on a Dataproc Serverless cluster or a Dataproc cluster. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

The following permissions are required to run a notebook file on a Dataproc Serverless cluster or a Dataproc cluster:

dataproc.agents.create
dataproc.agents.delete
dataproc.agents.get
dataproc.agents.update
dataproc.tasks.lease
dataproc.tasks.listInvalidatedLeases
dataproc.tasks.reportStatus
dataproc.clusters.use

Your administrator might also be able to give the service account these permissions with custom roles or other predefined roles.

Create an instance with Dataproc enabled

To create a Vertex AI Workbench instance with Dataproc enabled, do the following:

In the Google Cloud console, go to the Instances page.

Go to Instances
Click add_box Create new.
In the New instance dialog, click Advanced options.
In the Create instance dialog, in the Details section, make sure Enable Dataproc Serverless Interactive Sessions is selected.
Make sure Workbench type is set to Instance.
In the Environment section, make sure you use the latest version or a version numbered M113 or higher.
Click Create.

Vertex AI Workbench creates an instance and automatically starts it. When the instance is ready to use, Vertex AI Workbench activates an Open JupyterLab link.

Note: Specific network configurations could affect your ability to use the Dataproc extension. For more information on how to ensure that your network configuration is compatible, see Network configuration options. Open JupyterLab

Next to your instance's name, click Open JupyterLab.

The JupyterLab Launcher tab opens in your browser. By default it contains sections for Dataproc Serverless Notebooks and Dataproc Jobs and Sessions. If there are Jupyter-ready clusters in the selected project and region, there will be a section called Dataproc Cluster Notebooks.

Use the extension with Dataproc Serverless

Dataproc Serverless runtime templates that are in the same region and project as your Vertex AI Workbench instance appear in the Dataproc Serverless Notebooks section of the JupyterLab Launcher tab.

To create a runtime template, see Create a Dataproc Serverless runtime template.

To open a new serverless Spark notebook, click a runtime template. It takes about a minute for the remote Spark kernel to start. After the kernel starts, you can start coding.

Use the extension with Dataproc on Compute Engine

If you created a Dataproc on Compute Engine Jupyter cluster, the Launcher tab has a Dataproc Cluster Notebooks section.

Four cards appear for each Jupyter-ready Dataproc cluster that you have access to in that region and project.

To change the region and project, do the following:

Select Settings > Cloud Dataproc Settings.
On the Setup Config tab, under Project Info, change the Project ID and Region, and then click Save.

These changes don't take effect until you restart JupyterLab.
To restart JupyterLab, select File > Shut Down, and then click Open JupyterLab on the Vertex AI Workbench instances page.

To create a new notebook, click a card. After the remote kernel on the Dataproc cluster starts, you can start writing your code and then run it on your cluster.

Manage Dataproc on an instance using the gcloud CLI and the API

This section describes ways to manage Dataproc on a Vertex AI Workbench instance.

Change the region of your Dataproc cluster

Your Vertex AI Workbench instance's default kernels, such as Python and TensorFlow, are local kernels that run in the instance's VM. On a Dataproc Spark-enabled Vertex AI Workbench instance, your notebook runs on a Dataproc cluster through a remote kernel. The remote kernel runs on a service outside of your instance's VM, which lets you access any Dataproc cluster within the same project.

By default Vertex AI Workbench uses Dataproc clusters within the same region as your instance, but you can change the Dataproc region as long as the Component Gateway and the optional Jupyter component are enabled on the Dataproc cluster.

Test Access

The Dataproc JupyterLab extension is enabled by default for Vertex AI Workbench instances. To test access to Dataproc, you can check access to your instance's remote kernels by sending the following curl request to the kernels.googleusercontent.com domain:

curl --verbose -H "Authorization: Bearer $(gcloud auth print-access-token)" https://PROJECT_ID-dot-REGION.kernels.googleusercontent.com/api/kernelspecs | jq .

If the curl command fails, check to make sure that:

Your DNS entries are configured correctly.
There is a cluster available in the same project (or you will need to create one if it doesn't exist).
Your cluster has both the Component Gateway and the optional Jupyter component enabled.

Turn off Dataproc

Vertex AI Workbench instances are created with Dataproc enabled by default. You can create a Vertex AI Workbench instance with Dataproc turned off by setting the disable-mixer metadata key to true.

gcloud workbench instances create INSTANCE_NAME --metadata=disable-mixer=true

Enable Dataproc

You can enable Dataproc on a stopped Vertex AI Workbench instance by updating the metadata value.

gcloud workbench instances update INSTANCE_NAME --metadata=disable-mixer=false

Manage Dataproc using Terraform

Dataproc for Vertex AI Workbench instances on Terraform is managed using the disable-mixer key in the metadata field. Turn on Dataproc by setting the disable-mixer metadata key to false. Turn off Dataproc by setting the disable-mixer metadata key to true.

To learn how to apply or remove a Terraform configuration, see Basic Terraform commands.

Troubleshoot

To diagnose and resolve issues related to creating a Dataproc Spark-enabled instance, see Troubleshooting Vertex AI Workbench.

What's next

For more information about the Dataproc JupyterLab extension, see Use the JupyterLab extension to develop serverless Spark workloads.
To learn more about Dataproc Serverless, see the Dataproc Serverless documentation
Learn how to run Dataproc Serverless workloads without provisioning and managing clusters.
To learn more about using Spark with Google Cloud products and services, see Spark on Google Cloud.
Browse the available Dataproc templates on GitHub.
Learn about Serverless Spark through the serverless-spark-workshop on GitHub.
Read the Apache Spark documentation.

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4