A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage below:

Cloud Storage connector | Dataproc

Cloud Storage connector

Stay organized with collections Save and categorize content based on your preferences.

The Cloud Storage connector open source Java library lets you run Apache Hadoop or Apache Spark jobs directly on data in Cloud Storage.

Benefits of the Cloud Storage connector Connector setup on Dataproc clusters

The Cloud Storage connector is installed by default on all Dataproc cluster nodes in the /usr/local/share/google/dataproc/lib/ directory. The following subsections describe steps you can take to complete connector setup on Dataproc clusters.

Note: To set up the connector on other clusters, see Non-Dataproc clusters. VM service account

When running the connector on Dataproc cluster nodes and other Compute Engine VMs, the google.cloud.auth.service.account.enable property is set to false by default, which means you don't need to configure the VM service account credentials for the connector—VM service account credentials are provided by the VM metadata server.

The Dataproc VM service account must have permission to access your Cloud Storage bucket.

User-selected connector versions

The default Cloud Storage connector versions used in the latest images installed on Dataproc clusters are listed in the image version pages. If your application depends on a non-default connector version deployed on your cluster, you can perform one of the following actions to use your selected connector version:

Connector setup on non-Dataproc clusters

You can take the following steps to setup the Cloud Storage connector on a non-Dataproc cluster, such as an Apache Hadoop or Spark cluster that you use to move on-premises HDFS data to Cloud Storage.

  1. Download the connector.

  2. Install the connector.

    Follow the GitHub instructions to install, configure, and test the Cloud Storage connector.

Connector usage

You can use the connector to access Cloud Storage data in the following ways:

Java usage

The Cloud Storage connector requires Java 8.

The following is a sample Maven POM dependency management section for the Cloud Storage connector. For additional information, see Dependency Management.

<dependency>
    <groupId>com.google.cloud.bigdataoss</groupId>
    <artifactId>gcs-connector</artifactId>
    <version>hadoopX-X.X.XCONNECTOR VERSION</version>
    <scope>provided</scope>
</dependency>

For a shaded version:

<dependency>
    <groupId>com.google.cloud.bigdataoss</groupId>
    <artifactId>gcs-connector</artifactId>
    <version>hadoopX-X.X.XCONNECTOR VERSION</version>
    <scope>provided</scope>
    <classifier>shaded</classifier>
</dependency>
Connector support

The Cloud Storage connector is supported by Google Cloud for use with Google Cloud products and use cases. When used with Dataproc, it is supported at the same level as Dataproc. For more information, see Get support.

Connect to Cloud Storage using gRPC

By default, the Cloud Storage connector on Dataproc uses the Cloud Storage JSON API. This section shows you how to enable the Cloud Storage connector to use gRPC.

Usage considerations

Using the Cloud Storage connector with gRPC includes the following considerations:

Requirements

The following requirements apply when using gRPC with the Cloud Storage connector:

Enable gRPC on the Cloud Storage connector

You can enable gRPC on the Cloud Storage connector at the cluster or job level. Once enabled on the cluster, Cloud Storage conector read requests use gRPC. If enabled on a job instead of at the cluster level, Cloud Storage connector read requests use gRPC for the job only.

Enable a cluster

To enable gRPC on the Cloud Storage connector at the cluster level, set the core:fs.gs.client.type=STORAGE_CLIENT property when you create a Dataproc cluster. Once gRPC is enabled at the cluster level, Cloud Storage connector read requests made by jobs running on the cluster use gRPC.

gcloud CLI example:

gcloud dataproc clusters create CLUSTER_NAME \
    --project=PROJECT_ID \
    --region=REGION \
    --properties=core:fs.gs.client.type=STORAGE_CLIENT

Replace the following;

Enable a job

To enable gRPC on the Cloud Storage connector for a specific job, include --properties=spark.hadoop.fs.gs.client.type=STORAGE_CLIENT when you submit a job.

Example: Run a job on an existing cluster that uses gRPC to read from Cloud Storage.

  1. Create a local /tmp/line-count.py PySpark script that uses gRPC to read a Cloud Storage text file and output the number of lines in the file.

    cat <<EOF >"/tmp/line-count.py"
    #!/usr/bin/python
    import sys
    from pyspark.sql import SparkSession
    path = sys.argv[1]
    spark = SparkSession.builder.getOrCreate()
    rdd = spark.read.text(path)
    lines_counter = rdd.count()
    print("There are {} lines in file: {}".format(lines_counter,path))
    EOF
    
  2. Create a local /tmp/line-count-sample.txt text file.

    cat <<EOF >"/tmp/line-count-sample.txt"
    Line 1
    Line 2
    line 3
    EOF
    
  3. Upload local /tmp/line-count.py and /tmp/line-count-sample.txt to your bucket in Cloud Storage.

    gcloud storage cp /tmp/line-count* gs://BUCKET
    
  4. Run the line-count.py job on your cluster. Set --properties=spark.hadoop.fs.gs.client.type=STORAGE_CLIENT to enable gRPC for Cloud Storage connector read requests.

    gcloud dataproc jobs submit pyspark gs://BUCKET/line-count.py \
    --cluster=CLUSTER_NAME \
    --project=PROJECT_ID  \
    --region=REGION \
    --properties=spark.hadoop.fs.gs.client.type=STORAGE_CLIENT \
    -- gs://BUCKET/line-count-sample.txt
    

    Replace the following;

Generate gRPC client-side metrics

You can configure the Cloud Storage connector to generate gRPC related metrics in Cloud Monitoring. The gRPC related metrics can help you to do the following:

For information about how to configure the Cloud Storage connector to generate gRPC related metrics, see Use gRPC client-side metrics.

Resources

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-10-13 UTC.

[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-10-13 UTC."],[],[]]


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.5