Stay organized with collections Save and categorize content based on your preferences.
The Cloud Storage connector open source Java library lets you run Apache Hadoop or Apache Spark jobs directly on data in Cloud Storage.
Benefits of the Cloud Storage connectorgs://
prefix instead of hdfs://
.NameNode
is out of safe mode, a process that can take from a few seconds to many minutes depending on the size and state of your data. With Cloud Storage, you can start your job as soon as the task nodes start, which leads to significant cost savings over time.The Cloud Storage connector is installed by default on all Dataproc cluster nodes in the /usr/local/share/google/dataproc/lib/
directory. The following subsections describe steps you can take to complete connector setup on Dataproc clusters.
When running the connector on Dataproc cluster nodes and other Compute Engine VMs, the google.cloud.auth.service.account.enable
property is set to false
by default, which means you don't need to configure the VM service account credentials for the connector—VM service account credentials are provided by the VM metadata server.
The Dataproc VM service account must have permission to access your Cloud Storage bucket.
User-selected connector versionsThe default Cloud Storage connector versions used in the latest images installed on Dataproc clusters are listed in the image version pages. If your application depends on a non-default connector version deployed on your cluster, you can perform one of the following actions to use your selected connector version:
--metadata=GCS_CONNECTOR_VERSION=x.y.z
flag, which updates the connector used by applications running on the cluster to the specified connector version.You can take the following steps to setup the Cloud Storage connector on a non-Dataproc cluster, such as an Apache Hadoop or Spark cluster that you use to move on-premises HDFS data to Cloud Storage.
Download the connector.
latest
version located in Cloud Storage bucket (using a latest
version is not recommended for production applications):
gcs-connector-HADOOP_VERSION-CONNECTOR_VERSION.jar
name pattern, for example, gs://hadoop-lib/gcs/gcs-connector-hadoop2-2.1.1.jar
.-shaded
suffix in the name.Install the connector.
Follow the GitHub instructions to install, configure, and test the Cloud Storage connector.
You can use the connector to access Cloud Storage data in the following ways:
gs://
prefixhadoop fs -ls gs://bucket/dir/file
The Cloud Storage connector requires Java 8.
The following is a sample Maven POM dependency management section for the Cloud Storage connector. For additional information, see Dependency Management.
<dependency> <groupId>com.google.cloud.bigdataoss</groupId> <artifactId>gcs-connector</artifactId> <version>hadoopX-X.X.XCONNECTOR VERSION</version> <scope>provided</scope> </dependency>
For a shaded version:
<dependency> <groupId>com.google.cloud.bigdataoss</groupId> <artifactId>gcs-connector</artifactId> <version>hadoopX-X.X.XCONNECTOR VERSION</version> <scope>provided</scope> <classifier>shaded</classifier> </dependency>Connector support
The Cloud Storage connector is supported by Google Cloud for use with Google Cloud products and use cases. When used with Dataproc, it is supported at the same level as Dataproc. For more information, see Get support.
Connect to Cloud Storage using gRPCBy default, the Cloud Storage connector on Dataproc uses the Cloud Storage JSON API. This section shows you how to enable the Cloud Storage connector to use gRPC.
Usage considerationsUsing the Cloud Storage connector with gRPC includes the following considerations:
The following requirements apply when using gRPC with the Cloud Storage connector:
Your Dataproc cluster VPC network must support direct connectivity. This means that the network's routes and firewall rules must allow egress traffic to reach 34.126.0.0/18
and 2001:4860:8040::/42
.
When creating a Dataproc cluster, you must use Cloud Storage connector version 2.2.23
or later with image version 2.1.56+
or Cloud Storage connector version v3.0.0 or later with image version 2.2.0+. The Cloud Storage connector version installed on each Dataproc image version is listed in the Dataproc image version pages.
1.28.5-gke.1199000
with gke-metadata-server 0.4.285
is recommended. This combination supports direct connectivity.You or your organization administrator must grant Identity and Access Management roles that include the permissions necessary to set up and make gRPC requests to the Cloud Storage connector. These roles can include the following:
You can enable gRPC on the Cloud Storage connector at the cluster or job level. Once enabled on the cluster, Cloud Storage conector read requests use gRPC. If enabled on a job instead of at the cluster level, Cloud Storage connector read requests use gRPC for the job only.
Enable a clusterTo enable gRPC on the Cloud Storage connector at the cluster level, set the core:fs.gs.client.type=STORAGE_CLIENT
property when you create a Dataproc cluster. Once gRPC is enabled at the cluster level, Cloud Storage connector read requests made by jobs running on the cluster use gRPC.
gcloud CLI example:
gcloud dataproc clusters create CLUSTER_NAME \ --project=PROJECT_ID \ --region=REGION \ --properties=core:fs.gs.client.type=STORAGE_CLIENT
Replace the following;
To enable gRPC on the Cloud Storage connector for a specific job, include --properties=spark.hadoop.fs.gs.client.type=STORAGE_CLIENT
when you submit a job.
Example: Run a job on an existing cluster that uses gRPC to read from Cloud Storage.
Create a local /tmp/line-count.py
PySpark script that uses gRPC to read a Cloud Storage text file and output the number of lines in the file.
cat <<EOF >"/tmp/line-count.py" #!/usr/bin/python import sys from pyspark.sql import SparkSession path = sys.argv[1] spark = SparkSession.builder.getOrCreate() rdd = spark.read.text(path) lines_counter = rdd.count() print("There are {} lines in file: {}".format(lines_counter,path)) EOF
Create a local /tmp/line-count-sample.txt
text file.
cat <<EOF >"/tmp/line-count-sample.txt" Line 1 Line 2 line 3 EOF
Upload local /tmp/line-count.py
and /tmp/line-count-sample.txt
to your bucket in Cloud Storage.
gcloud storage cp /tmp/line-count* gs://BUCKET
Run the line-count.py
job on your cluster. Set --properties=spark.hadoop.fs.gs.client.type=STORAGE_CLIENT
to enable gRPC for Cloud Storage connector read requests.
gcloud dataproc jobs submit pyspark gs://BUCKET/line-count.py \ --cluster=CLUSTER_NAME \ --project=PROJECT_ID \ --region=REGION \ --properties=spark.hadoop.fs.gs.client.type=STORAGE_CLIENT \ -- gs://BUCKET/line-count-sample.txt
Replace the following;
You can configure the Cloud Storage connector to generate gRPC related metrics in Cloud Monitoring. The gRPC related metrics can help you to do the following:
For information about how to configure the Cloud Storage connector to generate gRPC related metrics, see Use gRPC client-side metrics.
ResourcesExcept as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-10-13 UTC.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-10-13 UTC."],[],[]]
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.5