A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://cloud.google.com/dataflow/docs/quickstarts below:

Dataflow overview | Google Cloud

Dataflow overview

Stay organized with collections Save and categorize content based on your preferences.

Dataflow is a Google Cloud service that provides unified stream and batch data processing at scale. Use Dataflow to create data pipelines that read from one or more sources, transform the data, and write the data to a destination.

Typical use cases for Dataflow include the following:

Dataflow uses the same programming model for both batch and stream analytics. Streaming pipelines can achieve very low latency. You can ingest, process, and analyze fluctuating volumes of real-time data. By default, Dataflow guarantees exactly-once processing of every record. For streaming pipelines that can tolerate duplicates, you can often reduce cost and improve latency by enabling at-least-once mode.

Advantages of Dataflow

This section describes some of the advantages of using Dataflow.

Managed

Dataflow is a fully managed service. That means Google manages all of the resources needed to run Dataflow. When you run a Dataflow job, the Dataflow service allocates a pool of worker VMs to execute the pipeline. You don't need to provision or manage these VMs. When the job completes or is cancelled, Dataflow automatically deletes the VMs. You're billed for the compute resources that your job uses. For more information about costs, see Dataflow pricing.

Scalable

Dataflow is designed to support batch and streaming pipelines at large scale. Data is processed in parallel, so the work is distributed across multiple VMs.

Dataflow can autoscale by provisioning extra worker VMs, or by shutting down some worker VMs if fewer are needed. It also optimizes the work, based on the characteristics of the pipeline. For example, Dataflow can dynamically rebalance work among the VMs, so that parallel work completes more efficiently.

Portable

Dataflow is built on the open source Apache Beam project. Apache Beam lets you write pipelines using a language-specific SDK. Apache Beam supports Java, Python, and Go SDKs, as well as multi-language pipelines.

Dataflow executes Apache Beam pipelines. If you decide later to run your pipeline on a different platform, such as Apache Flink or Apache Spark, you can do so without rewriting the pipeline code.

Flexible

You can use Dataflow for relatively simple pipelines, such as moving data. However, it's also suitable for more advanced applications, such as real-time streaming analytics. A solution built on Dataflow can grow with your needs as you move from batch to streaming or encounter more advanced use cases.

Dataflow supports several different ways to create and execute pipelines, depending on your needs:

Observable

You can monitor the status of your Dataflow jobs through the Dataflow monitoring interface in the Google Cloud console. The monitoring interface includes a graphical representation of your pipeline, showing the progress and execution details of each pipeline stage. The monitoring interface makes it easier to spot problems such as bottlenecks or high latency. You can also profile your Dataflow jobs to monitor CPU usage and memory allocation.

How it works

Dataflow uses a data pipeline model, where data moves through a series of stages. Stages can include reading data from a source, transforming and aggregating the data, and writing the results to a destination.

Pipelines can range from very simple to more complex processing. For example, a pipeline might do the following:

A pipeline that is defined in Apache Beam does not specify how the pipeline is executed. Running the pipeline is the job of a runner. The purpose of a runner is to run an Apache Beam pipeline on a specific platform. Apache Beam supports multiple runners, including a Dataflow runner.

To use Dataflow with your Apache Beam pipelines, specify the Dataflow runner. The runner uploads your executable code and dependencies to a Cloud Storage bucket and creates a Dataflow job. Dataflow then allocates a pool of VMs to execute the pipeline.

The following diagram shows a typical ETL and BI solution using Dataflow and other Google Cloud services:

This diagram shows the following stages:

  1. Pub/Sub ingests data from an external system.
  2. Dataflow reads the data from Pub/Sub and writes it to BigQuery. During this stage, Dataflow might transform or aggregate the data.
  3. BigQuery acts as a data warehouse, allowing data analysts to run ad hoc queries on the data.
  4. Looker provides real-time BI insights from the data stored in BigQuery.

For basic data movement scenarios, you might run a Google-provided template. Some templates support user-defined functions (UDFs) written in JavaScript. UDFs let you add custom processing logic to a template. For more complex pipelines, start with the Apache Beam SDK.

What's next

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-10-13 UTC.

[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-10-13 UTC."],[],[]]


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.5