org.apache.spark.sql.Dataset<T>
Serializable
A Dataset is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations. Each Dataset also has an untyped view called a
DataFrame
, which is a Dataset of
Row
.
Operations available on Datasets are divided into transformations and actions. Transformations are the ones that produce new Datasets, and actions are the ones that trigger computation and return results. Example transformations include map, filter, select, and aggregate (groupBy
). Example actions count, show, or writing data out to file systems.
Datasets are "lazy", i.e. computations are only triggered when an action is invoked. Internally, a Dataset represents a logical plan that describes the computation required to produce the data. When an action is invoked, Spark's query optimizer optimizes the logical plan and generates a physical plan for efficient execution in a parallel and distributed manner. To explore the logical plan as well as optimized physical plan, use the explain
function.
To efficiently support domain-specific objects, an Encoder
is required. The encoder maps the domain specific type T
to Spark's internal type system. For example, given a class Person
with two fields, name
(string) and age
(int), an encoder is used to tell Spark to generate code at runtime to serialize the Person
object into a binary structure. This binary structure often has much lower memory footprint as well as are optimized for efficiency in data processing (e.g. in a columnar format). To understand the internal binary representation for data, use the schema
function.
There are typically two ways to create a Dataset. The most common way is by pointing Spark to some files on storage systems, using the read
function available on a SparkSession
.
val people = spark.read.parquet("...").as[Person] // Scala
Dataset<Person> people = spark.read().parquet("...").as(Encoders.bean(Person.class)); // Java
Datasets can also be created through transformations available on existing Datasets. For example, the following creates a new Dataset by applying a filter on the existing one:
val names = people.map(_.name) // in Scala; names is a Dataset[String]
Dataset<String> names = people.map(
(MapFunction<Person, String>) p -> p.name, Encoders.STRING()); // Java
Dataset operations can also be untyped, through various domain-specific-language (DSL) functions defined in: Dataset (this class), Column
, and functions
. These operations are very similar to the operations available in the data frame abstraction in R or Python.
To select a column from the Dataset, use apply
method in Scala and col
in Java.
val ageCol = people("age") // in Scala
Column ageCol = people.col("age"); // in Java
Note that the Column
type can also be manipulated through its various functions.
// The following creates a new column that increases everybody's age by 10.
people("age") + 10 // in Scala
people.col("age").plus(10); // in Java
A more concrete example in Scala:
// To create Dataset[Row] using SparkSession
val people = spark.read.parquet("...")
val department = spark.read.parquet("...")
people.filter("age > 30")
.join(department, people("deptId") === department("id"))
.groupBy(department("name"), people("gender"))
.agg(avg(people("salary")), max(people("age")))
and in Java:
// To create Dataset<Row> using SparkSession
Dataset<Row> people = spark.read().parquet("...");
Dataset<Row> department = spark.read().parquet("...");
people.filter(people.col("age").gt(30))
.join(department, people.col("deptId").equalTo(department.col("id")))
.groupBy(department.col("name"), people.col("gender"))
.agg(avg(people.col("salary")), max(people.col("age")));
Constructors
(Java-specific) Aggregates on the entire Dataset without groups.
Aggregates on the entire Dataset without groups.
Aggregates on the entire Dataset without groups.
(Scala-specific) Aggregates on the entire Dataset without groups.
(Scala-specific) Aggregates on the entire Dataset without groups.
Returns a new Dataset with an alias set.
alias(scala.Symbol alias)
(Scala-specific) Returns a new Dataset with an alias set.
Selects column based on the column name and returns it as a
Column
.
Returns a new Dataset with an alias set.
Returns a new Dataset where each record has been mapped on to the specified type.
(Scala-specific) Returns a new Dataset with an alias set.
Persist this Dataset with the default storage level (MEMORY_AND_DISK
).
Eagerly checkpoint a Dataset and return the new Dataset.
Returns a checkpointed version of this Dataset.
Returns a new Dataset that has exactly numPartitions
partitions, when the fewer partitions are requested.
Selects column based on the column name and returns it as a
Column
.
Returns an array that contains all rows in this Dataset.
Returns a Java list that contains all rows in this Dataset.
Selects column based on the column name specified as a regex and returns it as
Column
.
Returns all column names as an array.
abstract long
Returns the number of rows in the Dataset.
void
Creates a global temporary view using the given name.
void
Creates or replaces a global temporary view using the given name.
void
Creates a local temporary view using the given name.
void
Creates a local temporary view using the given name.
Explicit cartesian join with another DataFrame
.
Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them.
Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them.
Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them.
Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them.
Computes basic statistics for numeric and string columns, including count, mean, stddev, min, and max.
Computes basic statistics for numeric and string columns, including count, mean, stddev, min, and max.
Returns a new Dataset that contains only the unique rows from this Dataset.
Returns a new Dataset with a column dropped.
Returns a new Dataset with columns dropped.
Returns a new Dataset with column dropped.
Returns a new Dataset with columns dropped.
Returns a new Dataset with columns dropped.
drop(scala.collection.immutable.Seq<String> colNames)
Returns a new Dataset with columns dropped.
Returns a new Dataset that contains only the unique rows from this Dataset.
Returns a new Dataset with duplicate rows removed, considering only the subset of columns.
Returns a new
Dataset
with duplicate rows removed, considering only the subset of columns.
Returns a new
Dataset
with duplicate rows removed, considering only the subset of columns.
(Scala-specific) Returns a new Dataset with duplicate rows removed, considering only the subset of columns.
Returns a new Dataset with duplicates rows removed, within watermark.
Returns a new Dataset with duplicates rows removed, considering only the subset of columns, within watermark.
Returns a new Dataset with duplicates rows removed, considering only the subset of columns, within watermark.
Returns a new Dataset with duplicates rows removed, considering only the subset of columns, within watermark.
Returns a new Dataset with duplicates rows removed, considering only the subset of columns, within watermark.
Returns all column names and their data types as an array.
Returns a new Dataset containing rows in this Dataset but not in another Dataset.
Returns a new Dataset containing rows in this Dataset but not in another Dataset while preserving the duplicates.
Return a Column
object for an EXISTS Subquery.
void
Prints the physical plan to the console for debugging purposes.
void
Prints the plans (logical and physical) to the console for debugging purposes.
abstract void
Prints the plans (logical and physical) with a format specified by a given explain mode.
explode(String inputColumn, String outputColumn, scala.Function1<A,scala.collection.IterableOnce<B>> f, scala.reflect.api.TypeTags.TypeTag<B> evidence$4)
explode(scala.collection.immutable.Seq<Column> input, scala.Function1<Row,scala.collection.IterableOnce<A>> f, scala.reflect.api.TypeTags.TypeTag<A> evidence$3)
Filters rows using the given SQL expression.
(Java-specific) Returns a new Dataset that only contains elements where func
returns true
.
Filters rows using the given condition.
(Scala-specific) Returns a new Dataset that only contains elements where func
returns true
.
(Java-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.
flatMap(scala.Function1<T,scala.collection.IterableOnce<U>> func, Encoder<U> evidence$7)
(Scala-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.
void
(Java-specific) Runs func
on each element of this Dataset.
void
foreach(scala.Function1<T,scala.runtime.BoxedUnit> f)
Applies a function f
to all rows.
void
(Java-specific) Runs func
on each partition of this Dataset.
abstract void
foreachPartition(scala.Function1<scala.collection.Iterator<T>,scala.runtime.BoxedUnit> f)
Applies a function f
to each partition of this Dataset.
Groups the Dataset using the specified columns, so that we can run aggregation on them.
Groups the Dataset using the specified columns, so that we can run aggregation on them.
Groups the Dataset using the specified columns, so we can run aggregation on them.
Groups the Dataset using the specified columns, so we can run aggregation on them.
Create multi-dimensional aggregation for the current Dataset using the specified grouping sets, so we can run aggregation on them.
groupingSets(scala.collection.immutable.Seq<scala.collection.immutable.Seq<Column>> groupingSets, scala.collection.immutable.Seq<Column> cols)
Create multi-dimensional aggregation for the current Dataset using the specified grouping sets, so we can run aggregation on them.
Returns the first n
rows.
Specifies some hint on the current Dataset.
Specifies some hint on the current Dataset.
Returns a best-effort snapshot of the files that compose this Dataset.
Returns a new Dataset containing rows only in both this Dataset and another Dataset.
Returns a new Dataset containing rows only in both this Dataset and another Dataset while preserving the duplicates.
abstract boolean
Returns true if the Dataset
is empty.
abstract boolean
Returns true if the collect
and take
methods can be run locally (without any Spark executors).
abstract boolean
Returns true if this Dataset contains one or more sources that continuously return data as it arrives.
Returns the content of the Dataset as a JavaRDD
of T
s.
Join with another DataFrame
.
Inner equi-join with another DataFrame
using the given column.
(Java-specific) Inner equi-join with another DataFrame
using the given columns.
(Java-specific) Equi-join with another DataFrame
using the given columns.
Equi-join with another DataFrame
using the given column.
Inner join with another DataFrame
, using the given join expression.
Join with another DataFrame
, using the given join expression.
(Scala-specific) Inner equi-join with another DataFrame
using the given columns.
(Scala-specific) Equi-join with another DataFrame
using the given columns.
Using inner equi-join to join this Dataset returning a Tuple2
for each pair where condition
evaluates to true.
Joins this Dataset returning a Tuple2
for each pair where condition
evaluates to true.
Lateral join with another DataFrame
.
Lateral join with another DataFrame
.
Lateral join with another DataFrame
.
Lateral join with another DataFrame
.
Returns a new Dataset by taking the first n
rows.
Eagerly locally checkpoints a Dataset and return the new Dataset.
Locally checkpoints a Dataset and return the new Dataset.
Locally checkpoints a Dataset and return the new Dataset.
(Java-specific) Returns a new Dataset that contains the result of applying func
to each element.
(Scala-specific) Returns a new Dataset that contains the result of applying func
to each element.
(Java-specific) Returns a new Dataset that contains the result of applying f
to each partition.
mapPartitions(scala.Function1<scala.collection.Iterator<T>,scala.collection.Iterator<U>> func, Encoder<U> evidence$6)
(Scala-specific) Returns a new Dataset that contains the result of applying func
to each partition.
Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set.
Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set.
Merges a set of updates, insertions, and deletions based on a source table into a target table.
Selects a metadata column based on its logical column name, and returns it as a
Column
.
Define (named) metrics to observe on the Dataset.
Define (named) metrics to observe on the Dataset.
Observe (named) metrics through an org.apache.spark.sql.Observation
instance.
Observe (named) metrics through an org.apache.spark.sql.Observation
instance.
Returns a new Dataset by skipping the first n
rows.
Returns a new Dataset sorted by the given expressions.
Returns a new Dataset sorted by the given expressions.
Returns a new Dataset sorted by the given expressions.
Returns a new Dataset sorted by the given expressions.
Persist this Dataset with the default storage level (MEMORY_AND_DISK
).
Persist this Dataset with the given storage level.
void
Prints the schema to the console in a nice tree format.
void
Prints the schema up to the given level to the console in a nice tree format.
abstract org.apache.spark.sql.execution.QueryExecution
Randomly splits this Dataset with the provided weights.
Randomly splits this Dataset with the provided weights.
Returns a Java list that contains randomly split Dataset with the provided weights.
Represents the content of the Dataset as an RDD
of T
.
(Java-specific) Reduces the elements of this Dataset using the specified binary function.
(Scala-specific) Reduces the elements of this Dataset using the specified binary function.
void
Returns a new Dataset that has exactly numPartitions
partitions.
Returns a new Dataset partitioned by the given partitioning expressions into numPartitions
.
repartition(int numPartitions, scala.collection.immutable.Seq<Column> partitionExprs)
Returns a new Dataset partitioned by the given partitioning expressions into numPartitions
.
Returns a new Dataset partitioned by the given partitioning expressions, using spark.sql.shuffle.partitions
as number of partitions.
Returns a new Dataset partitioned by the given partitioning expressions, using spark.sql.shuffle.partitions
as number of partitions.
Returns a new Dataset partitioned by the given partitioning expressions into numPartitions
.
Returns a new Dataset partitioned by the given partitioning expressions into numPartitions
.
Returns a new Dataset partitioned by the given partitioning expressions, using spark.sql.shuffle.partitions
as number of partitions.
Returns a new Dataset partitioned by the given partitioning expressions, using spark.sql.shuffle.partitions
as number of partitions.
Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them.
Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them.
Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them.
Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them.
abstract boolean
Returns
true
when the logical query plans inside both
Dataset
s are equal and therefore return same results.
sample(boolean withReplacement, double fraction)
Returns a new
Dataset
by sampling a fraction of rows, using a random seed.
sample(boolean withReplacement, double fraction, long seed)
Returns a new
Dataset
by sampling a fraction of rows, using a user-supplied seed.
Returns a new
Dataset
by sampling a fraction of rows (without replacement), using a random seed.
sample(double fraction, long seed)
Returns a new
Dataset
by sampling a fraction of rows (without replacement), using a user-supplied seed.
Return a Column
object for a SCALAR Subquery containing exactly one row and one column.
Returns the schema of this Dataset.
Selects a set of columns.
Selects a set of columns.
Selects a set of column based expressions.
Returns a new Dataset by computing the given
Column
expression for each element.
<U1, U2> Dataset<scala.Tuple2<U1,U2>>
Returns a new Dataset by computing the given
Column
expressions for each element.
<U1, U2, U3> Dataset<scala.Tuple3<U1,U2,U3>>
Returns a new Dataset by computing the given
Column
expressions for each element.
<U1, U2, U3, U4>
Dataset<scala.Tuple4<U1,U2,U3,U4>>
Returns a new Dataset by computing the given
Column
expressions for each element.
<U1, U2, U3, U4, U5>
Dataset<scala.Tuple5<U1,U2,U3,U4,U5>>
Returns a new Dataset by computing the given
Column
expressions for each element.
Selects a set of column based expressions.
Selects a set of SQL expressions.
Selects a set of SQL expressions.
abstract int
Returns a
hashCode
of the logical query plan against this
Dataset
.
void
Displays the top 20 rows of Dataset in a tabular form.
void
Displays the top 20 rows of Dataset in a tabular form.
void
Displays the Dataset in a tabular form.
abstract void
show(int numRows, boolean truncate)
Displays the Dataset in a tabular form.
void
show(int numRows, int truncate)
Displays the Dataset in a tabular form.
abstract void
show(int numRows, int truncate, boolean vertical)
Displays the Dataset in a tabular form.
Returns a new Dataset sorted by the specified column, all in ascending order.
Returns a new Dataset sorted by the specified column, all in ascending order.
Returns a new Dataset sorted by the given expressions.
sort(scala.collection.immutable.Seq<Column> sortExprs)
Returns a new Dataset sorted by the given expressions.
Returns a new Dataset with each partition sorted by the given expressions.
Returns a new Dataset with each partition sorted by the given expressions.
Returns a new Dataset with each partition sorted by the given expressions.
Returns a new Dataset with each partition sorted by the given expressions.
Get the Dataset's current storage level, or StorageLevel.NONE if not persisted.
Computes specified statistics for numeric and string columns.
Computes specified statistics for numeric and string columns.
Returns the last n
rows in the Dataset.
Returns the first n
rows in the Dataset.
Returns the first n
rows in the Dataset as a list.
Returns a new DataFrame where each row is reconciled to match the specified schema.
Converts this strongly typed collection of data to generic Dataframe.
Converts this strongly typed collection of data to generic DataFrame
with columns renamed.
toDF(scala.collection.immutable.Seq<String> colNames)
Converts this strongly typed collection of data to generic DataFrame
with columns renamed.
Returns the content of the Dataset as a JavaRDD
of T
s.
Returns the content of the Dataset as a Dataset of JSON strings.
Returns an iterator that contains all rows in this Dataset.
Concise syntax for chaining custom transformations.
Transposes a DataFrame, switching rows to columns.
Transposes a DataFrame such that the values in the specified index column become the new columns of the DataFrame.
Returns a new Dataset containing union of rows in this Dataset and another Dataset.
Returns a new Dataset containing union of rows in this Dataset and another Dataset.
Returns a new Dataset containing union of rows in this Dataset and another Dataset.
Returns a new Dataset containing union of rows in this Dataset and another Dataset.
Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.
Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.
Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set.
Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set.
Filters rows using the given SQL expression.
Filters rows using the given condition.
Returns a new Dataset by adding a column or replacing the existing column that has the same name.
Returns a new Dataset with a column renamed.
(Java-specific) Returns a new Dataset by adding columns or replacing the existing columns that has the same names.
(Scala-specific) Returns a new Dataset by adding columns or replacing the existing columns that has the same names.
(Java-specific) Returns a new Dataset with a columns renamed.
(Scala-specific) Returns a new Dataset with a columns renamed.
Returns a new Dataset by updating an existing column with metadata.
Defines an event time watermark for this
Dataset
.
Interface for saving the content of the non-streaming Dataset out into external storage.
Interface for saving the content of the streaming Dataset out into external storage.
Create a write configuration builder for v2 sources.
public Dataset()
Aggregates on the entire Dataset without groups.
// ds.agg(...) is a shorthand for ds.groupBy().agg(...)
ds.agg(max($"age"), avg($"salary"))
ds.groupBy().agg(max($"age"), avg($"salary"))
expr
- (undocumented)
exprs
- (undocumented)
(Scala-specific) Aggregates on the entire Dataset without groups.
// ds.agg(...) is a shorthand for ds.groupBy().agg(...)
ds.agg("age" -> "max", "salary" -> "avg")
ds.groupBy().agg("age" -> "max", "salary" -> "avg")
aggExpr
- (undocumented)
aggExprs
- (undocumented)
(Scala-specific) Aggregates on the entire Dataset without groups.
// ds.agg(...) is a shorthand for ds.groupBy().agg(...)
ds.agg(Map("age" -> "max", "salary" -> "avg"))
ds.groupBy().agg(Map("age" -> "max", "salary" -> "avg"))
exprs
- (undocumented)
(Java-specific) Aggregates on the entire Dataset without groups.
// ds.agg(...) is a shorthand for ds.groupBy().agg(...)
ds.agg(Map("age" -> "max", "salary" -> "avg"))
ds.groupBy().agg(Map("age" -> "max", "salary" -> "avg"))
exprs
- (undocumented)
Aggregates on the entire Dataset without groups.
// ds.agg(...) is a shorthand for ds.groupBy().agg(...)
ds.agg(max($"age"), avg($"salary"))
ds.groupBy().agg(max($"age"), avg($"salary"))
expr
- (undocumented)
exprs
- (undocumented)
Returns a new Dataset with an alias set. Same as
as
.
alias
- (undocumented)
(Scala-specific) Returns a new Dataset with an alias set. Same as
as
.
alias
- (undocumented)
Selects column based on the column name and returns it as a
Column
.
colName
- (undocumented)
a.b
.
Returns a new Dataset where each record has been mapped on to the specified type. The method used to map columns depend on the type of
U
:
U
is a class, fields for the class will be mapped to columns of the same name (case sensitivity is determined by spark.sql.caseSensitive
).U
is a tuple, the columns will be mapped by ordinal (i.e. the first column will be assigned to _1
).U
is a primitive type (i.e. String, Int, etc), then the first column of the DataFrame
will be used.If the schema of the Dataset does not match the desired U
type, you can use select
along with alias
or as
to rearrange or rename as required.
Note that as[]
only changes the view of the data that is passed into typed operations, such as map()
, and does not eagerly project away any columns that are not present in the specified class.
evidence$1
- (undocumented)
alias
- (undocumented)
alias
- (undocumented)
Persist this Dataset with the default storage level (
MEMORY_AND_DISK
).
Eagerly checkpoint a Dataset and return the new Dataset. Checkpointing can be used to truncate the logical plan of this Dataset, which is especially useful in iterative algorithms where the plan may grow exponentially. It will be saved to files inside the checkpoint directory set with
SparkContext#setCheckpointDir
.
Returns a checkpointed version of this Dataset. Checkpointing can be used to truncate the logical plan of this Dataset, which is especially useful in iterative algorithms where the plan may grow exponentially. It will be saved to files inside the checkpoint directory set with
SparkContext#setCheckpointDir
.
eager
- Whether to checkpoint this dataframe immediately
Returns a new Dataset that has exactly
numPartitions
partitions, when the fewer partitions are requested. If a larger number of partitions is requested, it will stay at the current number of partitions. Similar to coalesce defined on an
RDD
, this operation results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions.
However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can call repartition. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).
numPartitions
- (undocumented)
Selects column based on the column name and returns it as a
Column
.
colName
- (undocumented)
a.b
.
Selects column based on the column name specified as a regex and returns it as
Column
.
colName
- (undocumented)
()
Returns an array that contains all rows in this Dataset.
Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.
For Java API, use collectAsList()
.
()
Returns a Java list that contains all rows in this Dataset.
Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.
public abstract long count()
Creates a global temporary view using the given name. The lifetime of this temporary view is tied to this Spark application.
Global temporary view is cross-session. Its lifetime is the lifetime of the Spark application, i.e. it will be automatically dropped when the application terminates. It's tied to a system preserved database global_temp
, and we must use the qualified name to refer a global temp view, e.g. SELECT * FROM global_temp.view1
.
viewName
- (undocumented)
AnalysisException
- if the view name is invalid or already exists
Creates or replaces a global temporary view using the given name. The lifetime of this temporary view is tied to this Spark application.
Global temporary view is cross-session. Its lifetime is the lifetime of the Spark application, i.e. it will be automatically dropped when the application terminates. It's tied to a system preserved database global_temp
, and we must use the qualified name to refer a global temp view, e.g. SELECT * FROM global_temp.view1
.
viewName
- (undocumented)
Creates a local temporary view using the given name. The lifetime of this temporary view is tied to the
SparkSession
that was used to create this Dataset.
viewName
- (undocumented)
Creates a local temporary view using the given name. The lifetime of this temporary view is tied to the
SparkSession
that was used to create this Dataset.
Local temporary view is session-scoped. Its lifetime is the lifetime of the session that created it, i.e. it will be automatically dropped when the session terminates. It's not tied to any databases, i.e. we can't use db1.view1
to reference a local temporary view.
viewName
- (undocumented)
AnalysisException
- if the view name is invalid or already exists
Explicit cartesian join with another
DataFrame
.
right
- Right side of the join operation.
Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them. See
RelationalGroupedDataset
for all the available aggregate functions.
// Compute the average for all numeric columns cubed by department and group.
ds.cube($"department", $"group").avg()
// Compute the max age and average salary, cubed by department and gender.
ds.cube($"department", $"gender").agg(Map(
"salary" -> "avg",
"age" -> "max"
))
cols
- (undocumented)
Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them. See
RelationalGroupedDataset
for all the available aggregate functions.
This is a variant of cube that can only group by existing columns using column names (i.e. cannot construct expressions).
// Compute the average for all numeric columns cubed by department and group.
ds.cube("department", "group").avg()
// Compute the max age and average salary, cubed by department and gender.
ds.cube($"department", $"gender").agg(Map(
"salary" -> "avg",
"age" -> "max"
))
col1
- (undocumented)
cols
- (undocumented)
Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them. See
RelationalGroupedDataset
for all the available aggregate functions.
// Compute the average for all numeric columns cubed by department and group.
ds.cube($"department", $"group").avg()
// Compute the max age and average salary, cubed by department and gender.
ds.cube($"department", $"gender").agg(Map(
"salary" -> "avg",
"age" -> "max"
))
cols
- (undocumented)
Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them. See
RelationalGroupedDataset
for all the available aggregate functions.
This is a variant of cube that can only group by existing columns using column names (i.e. cannot construct expressions).
// Compute the average for all numeric columns cubed by department and group.
ds.cube("department", "group").avg()
// Compute the max age and average salary, cubed by department and gender.
ds.cube($"department", $"gender").agg(Map(
"salary" -> "avg",
"age" -> "max"
))
col1
- (undocumented)
cols
- (undocumented)
Computes basic statistics for numeric and string columns, including count, mean, stddev, min, and max. If no columns are given, this function computes statistics for all numerical or string columns.
This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting Dataset. If you want to programmatically compute summary statistics, use the agg
function instead.
ds.describe("age", "height").show()
// output:
// summary age height
// count 10.0 10.0
// mean 53.3 178.05
// stddev 11.6 15.7
// min 18.0 163.0
// max 92.0 192.0
Use summary(java.lang.String...)
for expanded statistics and control over which statistics to compute.
cols
- Columns to compute statistics on.
Computes basic statistics for numeric and string columns, including count, mean, stddev, min, and max. If no columns are given, this function computes statistics for all numerical or string columns.
This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting Dataset. If you want to programmatically compute summary statistics, use the agg
function instead.
ds.describe("age", "height").show()
// output:
// summary age height
// count 10.0 10.0
// mean 53.3 178.05
// stddev 11.6 15.7
// min 18.0 163.0
// max 92.0 192.0
Use summary(java.lang.String...)
for expanded statistics and control over which statistics to compute.
cols
- Columns to compute statistics on.
Returns a new Dataset that contains only the unique rows from this Dataset. This is an alias for
dropDuplicates
.
Note that for a streaming Dataset
, this method returns distinct rows only once regardless of the output mode, which the behavior may not be same with DISTINCT
in SQL against streaming Dataset
.
equals
function defined on T
.
Returns a new Dataset with columns dropped. This is a no-op if schema doesn't contain column name(s).
This method can only be used to drop top level columns. the colName string is treated literally without further interpretation.
colNames
- (undocumented)
Returns a new Dataset with columns dropped.
This method can only be used to drop top level columns. This is a no-op if the Dataset doesn't have a columns with an equivalent expression.
col
- (undocumented)
cols
- (undocumented)
Returns a new Dataset with a column dropped. This is a no-op if schema doesn't contain column name.
This method can only be used to drop top level columns. the colName string is treated literally without further interpretation.
Note: drop(colName)
has different semantic with drop(col(colName))
, for example: 1, multi column have the same colName:
val df1 = spark.range(0, 2).withColumn("key1", lit(1))
val df2 = spark.range(0, 2).withColumn("key2", lit(2))
val df3 = df1.join(df2)
df3.show
// +---+----+---+----+
// | id|key1| id|key2|
// +---+----+---+----+
// | 0| 1| 0| 2|
// | 0| 1| 1| 2|
// | 1| 1| 0| 2|
// | 1| 1| 1| 2|
// +---+----+---+----+
df3.drop("id").show()
// output: the two 'id' columns are both dropped.
// |key1|key2|
// +----+----+
// | 1| 2|
// | 1| 2|
// | 1| 2|
// | 1| 2|
// +----+----+
df3.drop(col("id")).show()
// ...AnalysisException: [AMBIGUOUS_REFERENCE] Reference `id` is ambiguous...
2, colName contains special characters, like dot.
val df = spark.range(0, 2).withColumn("a.b.c", lit(1))
df.show()
// +---+-----+
// | id|a.b.c|
// +---+-----+
// | 0| 1|
// | 1| 1|
// +---+-----+
df.drop("a.b.c").show()
// +---+
// | id|
// +---+
// | 0|
// | 1|
// +---+
df.drop(col("a.b.c")).show()
// no column match the expression 'a.b.c'
// +---+-----+
// | id|a.b.c|
// +---+-----+
// | 0| 1|
// | 1| 1|
// +---+-----+
colName
- (undocumented)
Returns a new Dataset with columns dropped. This is a no-op if schema doesn't contain column name(s).
This method can only be used to drop top level columns. the colName string is treated literally without further interpretation.
colNames
- (undocumented)
Returns a new Dataset with column dropped.
This method can only be used to drop top level column. This version of drop accepts a Column
rather than a name. This is a no-op if the Dataset doesn't have a column with an equivalent expression.
Note: drop(col(colName))
has different semantic with drop(colName)
, please refer to Dataset#drop(colName: String)
.
col
- (undocumented)
Returns a new Dataset with columns dropped.
This method can only be used to drop top level columns. This is a no-op if the Dataset doesn't have a columns with an equivalent expression.
col
- (undocumented)
cols
- (undocumented)
Returns a new
Dataset
with duplicate rows removed, considering only the subset of columns.
For a static batch Dataset
, it just drops duplicate rows. For a streaming Dataset
, it will keep all data across triggers as intermediate state to drop duplicates rows. You can use withWatermark(java.lang.String,java.lang.String)
to limit how late the duplicate data can be and system will accordingly limit the state. In addition, too late data older than watermark will be dropped to avoid any possibility of duplicates.
col1
- (undocumented)
cols
- (undocumented)
()
Returns a new Dataset that contains only the unique rows from this Dataset. This is an alias for
distinct
.
For a static batch Dataset
, it just drops duplicate rows. For a streaming Dataset
, it will keep all data across triggers as intermediate state to drop duplicates rows. You can use withWatermark(java.lang.String,java.lang.String)
to limit how late the duplicate data can be and system will accordingly limit the state. In addition, too late data older than watermark will be dropped to avoid any possibility of duplicates.
(Scala-specific) Returns a new Dataset with duplicate rows removed, considering only the subset of columns.
For a static batch Dataset
, it just drops duplicate rows. For a streaming Dataset
, it will keep all data across triggers as intermediate state to drop duplicates rows. You can use withWatermark(java.lang.String,java.lang.String)
to limit how late the duplicate data can be and system will accordingly limit the state. In addition, too late data older than watermark will be dropped to avoid any possibility of duplicates.
colNames
- (undocumented)
Returns a new Dataset with duplicate rows removed, considering only the subset of columns.
For a static batch Dataset
, it just drops duplicate rows. For a streaming Dataset
, it will keep all data across triggers as intermediate state to drop duplicates rows. You can use withWatermark(java.lang.String,java.lang.String)
to limit how late the duplicate data can be and system will accordingly limit the state. In addition, too late data older than watermark will be dropped to avoid any possibility of duplicates.
colNames
- (undocumented)
Returns a new
Dataset
with duplicate rows removed, considering only the subset of columns.
For a static batch Dataset
, it just drops duplicate rows. For a streaming Dataset
, it will keep all data across triggers as intermediate state to drop duplicates rows. You can use withWatermark(java.lang.String,java.lang.String)
to limit how late the duplicate data can be and system will accordingly limit the state. In addition, too late data older than watermark will be dropped to avoid any possibility of duplicates.
col1
- (undocumented)
cols
- (undocumented)
Returns a new Dataset with duplicates rows removed, considering only the subset of columns, within watermark.
This only works with streaming Dataset
, and watermark for the input Dataset
must be set via withWatermark(java.lang.String,java.lang.String)
.
For a streaming Dataset
, this will keep all data across triggers as intermediate state to drop duplicated rows. The state will be kept to guarantee the semantic, "Events are deduplicated as long as the time distance of earliest and latest events are smaller than the delay threshold of watermark." Users are encouraged to set the delay threshold of watermark longer than max timestamp differences among duplicated events.
Note: too late data older than watermark will be dropped.
col1
- (undocumented)
cols
- (undocumented)
()
Returns a new Dataset with duplicates rows removed, within watermark.
This only works with streaming Dataset
, and watermark for the input Dataset
must be set via withWatermark(java.lang.String,java.lang.String)
.
For a streaming Dataset
, this will keep all data across triggers as intermediate state to drop duplicated rows. The state will be kept to guarantee the semantic, "Events are deduplicated as long as the time distance of earliest and latest events are smaller than the delay threshold of watermark." Users are encouraged to set the delay threshold of watermark longer than max timestamp differences among duplicated events.
Note: too late data older than watermark will be dropped.
Returns a new Dataset with duplicates rows removed, considering only the subset of columns, within watermark.
This only works with streaming Dataset
, and watermark for the input Dataset
must be set via withWatermark(java.lang.String,java.lang.String)
.
For a streaming Dataset
, this will keep all data across triggers as intermediate state to drop duplicated rows. The state will be kept to guarantee the semantic, "Events are deduplicated as long as the time distance of earliest and latest events are smaller than the delay threshold of watermark." Users are encouraged to set the delay threshold of watermark longer than max timestamp differences among duplicated events.
Note: too late data older than watermark will be dropped.
colNames
- (undocumented)
Returns a new Dataset with duplicates rows removed, considering only the subset of columns, within watermark.
This only works with streaming Dataset
, and watermark for the input Dataset
must be set via withWatermark(java.lang.String,java.lang.String)
.
For a streaming Dataset
, this will keep all data across triggers as intermediate state to drop duplicated rows. The state will be kept to guarantee the semantic, "Events are deduplicated as long as the time distance of earliest and latest events are smaller than the delay threshold of watermark." Users are encouraged to set the delay threshold of watermark longer than max timestamp differences among duplicated events.
Note: too late data older than watermark will be dropped.
colNames
- (undocumented)
Returns a new Dataset with duplicates rows removed, considering only the subset of columns, within watermark.
This only works with streaming Dataset
, and watermark for the input Dataset
must be set via withWatermark(java.lang.String,java.lang.String)
.
For a streaming Dataset
, this will keep all data across triggers as intermediate state to drop duplicated rows. The state will be kept to guarantee the semantic, "Events are deduplicated as long as the time distance of earliest and latest events are smaller than the delay threshold of watermark." Users are encouraged to set the delay threshold of watermark longer than max timestamp differences among duplicated events.
Note: too late data older than watermark will be dropped.
col1
- (undocumented)
cols
- (undocumented)
Returns a new Dataset containing rows in this Dataset but not in another Dataset. This is equivalent to
EXCEPT DISTINCT
in SQL.
other
- (undocumented)
equals
function defined on T
.
Returns a new Dataset containing rows in this Dataset but not in another Dataset while preserving the duplicates. This is equivalent to
EXCEPT ALL
in SQL.
other
- (undocumented)
equals
function defined on T
. Also as standard in SQL, this function resolves columns by position (not by name).
()
Return a
Column
object for an EXISTS Subquery.
The exists
method provides a way to create a boolean column that checks for the presence of related records in a subquery. When applied within a DataFrame
, this method allows you to filter rows based on whether matching records exist in the related dataset. The resulting Column
object can be used directly in filtering conditions or as a computed column.
mode
- specifies the expected output format of plans.
simple
Print only a physical plan.extended
: Print both logical and physical plans.codegen
: Print a physical plan and generated codes if they are available.cost
: Print a logical plan and statistics if they are available.formatted
: Split explain output into two sections: a physical plan outline and node details.public void explain(boolean extended)
extended
- default false
. If false
, prints only the physical plan.
public void explain()
(Scala-specific) Returns a new Dataset where each row has been expanded to zero or more rows by the provided function. This is similar to a
LATERAL VIEW
in HiveQL. The columns of the input row are implicitly joined with each row that is output by the function.
Given that this is deprecated, as an alternative, you can explode columns either using functions.explode()
or flatMap()
. The following example uses these alternatives to count the number of books that contain a given word:
case class Book(title: String, words: String)
val ds: Dataset[Book]
val allWords = ds.select($"title", explode(split($"words", " ")).as("word"))
val bookCountPerWord = allWords.groupBy("word").agg(count_distinct("title"))
Using flatMap()
this can similarly be exploded as:
ds.flatMap(_.words.split(" "))
input
- (undocumented)
f
- (undocumented)
evidence$3
- (undocumented)
(Scala-specific) Returns a new Dataset where a single column has been expanded to zero or more rows by the provided function. This is similar to a
LATERAL VIEW
in HiveQL. All columns of the input row are implicitly joined with each value that is output by the function.
Given that this is deprecated, as an alternative, you can explode columns either using functions.explode()
:
ds.select(explode(split($"words", " ")).as("word"))
or flatMap()
:
ds.flatMap(_.words.split(" "))
inputColumn
- (undocumented)
outputColumn
- (undocumented)
f
- (undocumented)
evidence$4
- (undocumented)
Filters rows using the given condition.
// The following are equivalent:
peopleDs.filter($"age" > 15)
peopleDs.where($"age" > 15)
condition
- (undocumented)
Filters rows using the given SQL expression.
peopleDs.filter("age > 15")
conditionExpr
- (undocumented)
(Scala-specific) Returns a new Dataset that only contains elements where
func
returns
true
.
func
- (undocumented)
(Java-specific) Returns a new Dataset that only contains elements where
func
returns
true
.
func
- (undocumented)
func
- (undocumented)
evidence$7
- (undocumented)
f
- (undocumented)
encoder
- (undocumented)
Applies a function
f
to all rows.
f
- (undocumented)
(Java-specific) Runs
func
on each element of this Dataset.
func
- (undocumented)
Applies a function
f
to each partition of this Dataset.
f
- (undocumented)
(Java-specific) Runs
func
on each partition of this Dataset.
func
- (undocumented)
Groups the Dataset using the specified columns, so we can run aggregation on them. See
RelationalGroupedDataset
for all the available aggregate functions.
// Compute the average for all numeric columns grouped by department.
ds.groupBy($"department").avg()
// Compute the max age and average salary, grouped by department and gender.
ds.groupBy($"department", $"gender").agg(Map(
"salary" -> "avg",
"age" -> "max"
))
cols
- (undocumented)
Groups the Dataset using the specified columns, so that we can run aggregation on them. See
RelationalGroupedDataset
for all the available aggregate functions.
This is a variant of groupBy that can only group by existing columns using column names (i.e. cannot construct expressions).
// Compute the average for all numeric columns grouped by department.
ds.groupBy("department").avg()
// Compute the max age and average salary, grouped by department and gender.
ds.groupBy($"department", $"gender").agg(Map(
"salary" -> "avg",
"age" -> "max"
))
col1
- (undocumented)
cols
- (undocumented)
Groups the Dataset using the specified columns, so we can run aggregation on them. See
RelationalGroupedDataset
for all the available aggregate functions.
// Compute the average for all numeric columns grouped by department.
ds.groupBy($"department").avg()
// Compute the max age and average salary, grouped by department and gender.
ds.groupBy($"department", $"gender").agg(Map(
"salary" -> "avg",
"age" -> "max"
))
cols
- (undocumented)
Groups the Dataset using the specified columns, so that we can run aggregation on them. See
RelationalGroupedDataset
for all the available aggregate functions.
This is a variant of groupBy that can only group by existing columns using column names (i.e. cannot construct expressions).
// Compute the average for all numeric columns grouped by department.
ds.groupBy("department").avg()
// Compute the max age and average salary, grouped by department and gender.
ds.groupBy($"department", $"gender").agg(Map(
"salary" -> "avg",
"age" -> "max"
))
col1
- (undocumented)
cols
- (undocumented)
func
- (undocumented)
evidence$2
- (undocumented)
func
- (undocumented)
encoder
- (undocumented)
Create multi-dimensional aggregation for the current Dataset using the specified grouping sets, so we can run aggregation on them. See
RelationalGroupedDataset
for all the available aggregate functions.
// Compute the average for all numeric columns group by specific grouping sets.
ds.groupingSets(Seq(Seq($"department", $"group"), Seq()), $"department", $"group").avg()
// Compute the max age and average salary, group by specific grouping sets.
ds.groupingSets(Seq($"department", $"gender"), Seq()), $"department", $"group").agg(Map(
"salary" -> "avg",
"age" -> "max"
))
groupingSets
- (undocumented)
cols
- (undocumented)
Create multi-dimensional aggregation for the current Dataset using the specified grouping sets, so we can run aggregation on them. See
RelationalGroupedDataset
for all the available aggregate functions.
// Compute the average for all numeric columns group by specific grouping sets.
ds.groupingSets(Seq(Seq($"department", $"group"), Seq()), $"department", $"group").avg()
// Compute the max age and average salary, group by specific grouping sets.
ds.groupingSets(Seq($"department", $"gender"), Seq()), $"department", $"group").agg(Map(
"salary" -> "avg",
"age" -> "max"
))
groupingSets
- (undocumented)
cols
- (undocumented)
Returns the first
n
rows.
n
- (undocumented)
Specifies some hint on the current Dataset. As an example, the following code specifies that one of the plan can be broadcasted:
df1.join(df2.hint("broadcast"))
the following code specifies that this dataset could be rebalanced with given number of partitions:
df1.hint("rebalance", 10)
name
- the name of the hint
parameters
- the parameters of the hint, all the parameters should be a Column
or Expression
or Symbol
or could be converted into a Literal
Specifies some hint on the current Dataset. As an example, the following code specifies that one of the plan can be broadcasted:
df1.join(df2.hint("broadcast"))
the following code specifies that this dataset could be rebalanced with given number of partitions:
df1.hint("rebalance", 10)
name
- the name of the hint
parameters
- the parameters of the hint, all the parameters should be a Column
or Expression
or Symbol
or could be converted into a Literal
()
Returns a new Dataset containing rows only in both this Dataset and another Dataset. This is equivalent to
INTERSECT
in SQL.
other
- (undocumented)
equals
function defined on T
.
Returns a new Dataset containing rows only in both this Dataset and another Dataset while preserving the duplicates. This is equivalent to
INTERSECT ALL
in SQL.
other
- (undocumented)
equals
function defined on T
. Also as standard in SQL, this function resolves columns by position (not by name).
public abstract boolean isEmpty()
Returns true if the
Dataset
is empty.
public abstract boolean isLocal()
Returns true if the
collect
and
take
methods can be run locally (without any Spark executors).
public abstract boolean isStreaming()
Returns true if this Dataset contains one or more sources that continuously return data as it arrives. A Dataset that reads data from a streaming source must be executed as a
StreamingQuery
using the
start()
method in
DataStreamWriter
. Methods that return a single answer, e.g.
count()
or
collect()
, will throw an
AnalysisException
when there is a streaming source present.
Returns the content of the Dataset as a
JavaRDD
of
T
s.
Join with another
DataFrame
.
Behaves as an INNER JOIN and requires a subsequent join predicate.
right
- Right side of the join operation.
Inner equi-join with another
DataFrame
using the given column.
Different from other join functions, the join column will only appear once in the output, i.e. similar to SQL's JOIN USING
syntax.
// Joining df1 and df2 using the column "user_id"
df1.join(df2, "user_id")
right
- Right side of the join operation.
usingColumn
- Name of the column to join on. This column must exist on both sides.
DataFrame
s, you will NOT be able to reference any columns after the join, since there is no way to disambiguate which side of the join you would like to reference.
(Java-specific) Inner equi-join with another
DataFrame
using the given columns. See the Scala-specific overload for more details.
right
- Right side of the join operation.
usingColumns
- Names of the columns to join on. This columns must exist on both sides.
(Scala-specific) Inner equi-join with another
DataFrame
using the given columns.
Different from other join functions, the join columns will only appear once in the output, i.e. similar to SQL's JOIN USING
syntax.
// Joining df1 and df2 using the columns "user_id" and "user_name"
df1.join(df2, Seq("user_id", "user_name"))
right
- Right side of the join operation.
usingColumns
- Names of the columns to join on. This columns must exist on both sides.
DataFrame
s, you will NOT be able to reference any columns after the join, since there is no way to disambiguate which side of the join you would like to reference.
Equi-join with another
DataFrame
using the given column. A cross join with a predicate is specified as an inner join. If you would explicitly like to perform a cross join use the
crossJoin
method.
Different from other join functions, the join column will only appear once in the output, i.e. similar to SQL's JOIN USING
syntax.
right
- Right side of the join operation.
usingColumn
- Name of the column to join on. This column must exist on both sides.
joinType
- Type of join to perform. Default inner
. Must be one of: inner
, cross
, outer
, full
, fullouter
, full_outer
, left
, leftouter
, left_outer
, right
, rightouter
, right_outer
, semi
, leftsemi
, left_semi
, anti
, leftanti
, left_anti
.
DataFrame
s, you will NOT be able to reference any columns after the join, since there is no way to disambiguate which side of the join you would like to reference.
(Java-specific) Equi-join with another
DataFrame
using the given columns. See the Scala-specific overload for more details.
right
- Right side of the join operation.
usingColumns
- Names of the columns to join on. This columns must exist on both sides.
joinType
- Type of join to perform. Default inner
. Must be one of: inner
, cross
, outer
, full
, fullouter
, full_outer
, left
, leftouter
, left_outer
, right
, rightouter
, right_outer
, semi
, leftsemi
, left_semi
, anti
, leftanti
, left_anti
.
(Scala-specific) Equi-join with another
DataFrame
using the given columns. A cross join with a predicate is specified as an inner join. If you would explicitly like to perform a cross join use the
crossJoin
method.
Different from other join functions, the join columns will only appear once in the output, i.e. similar to SQL's JOIN USING
syntax.
right
- Right side of the join operation.
usingColumns
- Names of the columns to join on. This columns must exist on both sides.
joinType
- Type of join to perform. Default inner
. Must be one of: inner
, cross
, outer
, full
, fullouter
, full_outer
, left
, leftouter
, left_outer
, right
, rightouter
, right_outer
, semi
, leftsemi
, left_semi
, anti
, leftanti
, left_anti
.
DataFrame
s, you will NOT be able to reference any columns after the join, since there is no way to disambiguate which side of the join you would like to reference.
Inner join with another
DataFrame
, using the given join expression.
// The following two are equivalent:
df1.join(df2, $"df1Key" === $"df2Key")
df1.join(df2).where($"df1Key" === $"df2Key")
right
- (undocumented)
joinExprs
- (undocumented)
Join with another
DataFrame
, using the given join expression. The following performs a full outer join between
df1
and
df2
.
// Scala:
import org.apache.spark.sql.functions._
df1.join(df2, $"df1Key" === $"df2Key", "outer")
// Java:
import static org.apache.spark.sql.functions.*;
df1.join(df2, col("df1Key").equalTo(col("df2Key")), "outer");
right
- Right side of the join.
joinExprs
- Join expression.
joinType
- Type of join to perform. Default inner
. Must be one of: inner
, cross
, outer
, full
, fullouter
, full_outer
, left
, leftouter
, left_outer
, right
, rightouter
, right_outer
, semi
, leftsemi
, left_semi
, anti
, leftanti
, left_anti
.
Joins this Dataset returning a
Tuple2
for each pair where
condition
evaluates to true.
This is similar to the relation join
function with one important difference in the result schema. Since joinWith
preserves objects present on either side of the join, the result schema is similarly nested into a tuple under the column names _1
and _2
.
This type of join can be useful both for preserving type-safety with the original object types as well as working with relational data where either side of the join has column names in common.
other
- Right side of the join.
condition
- Join expression.
joinType
- Type of join to perform. Default inner
. Must be one of: inner
, cross
, outer
, full
, fullouter
,full_outer
, left
, leftouter
, left_outer
, right
, rightouter
, right_outer
.
Using inner equi-join to join this Dataset returning a
Tuple2
for each pair where
condition
evaluates to true.
other
- Right side of the join.
condition
- Join expression.
Lateral join with another
DataFrame
.
Behaves as an JOIN LATERAL.
right
- Right side of the join operation.
Lateral join with another
DataFrame
.
Behaves as an JOIN LATERAL.
right
- Right side of the join operation.
joinExprs
- Join expression.
Lateral join with another
DataFrame
.
right
- Right side of the join operation.
joinType
- Type of join to perform. Default inner
. Must be one of: inner
, cross
, left
, leftouter
, left_outer
.
Lateral join with another
DataFrame
.
right
- Right side of the join operation.
joinExprs
- Join expression.
joinType
- Type of join to perform. Default inner
. Must be one of: inner
, cross
, left
, leftouter
, left_outer
.
Returns a new Dataset by taking the first
n
rows. The difference between this function and
head
is that
head
is an action and returns an array (by triggering query execution) while
limit
returns a new Dataset.
n
- (undocumented)
eager
- Whether to checkpoint this dataframe immediately
eager
- Whether to checkpoint this dataframe immediately
storageLevel
- StorageLevel with which to checkpoint the data.
(Scala-specific) Returns a new Dataset that contains the result of applying
func
to each element.
func
- (undocumented)
evidence$5
- (undocumented)
(Java-specific) Returns a new Dataset that contains the result of applying
func
to each element.
func
- (undocumented)
encoder
- (undocumented)
(Scala-specific) Returns a new Dataset that contains the result of applying
func
to each partition.
func
- (undocumented)
evidence$6
- (undocumented)
(Java-specific) Returns a new Dataset that contains the result of applying
f
to each partition.
f
- (undocumented)
encoder
- (undocumented)
Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set. This is the reverse to
groupBy(...).pivot(...).agg(...)
, except for the aggregation, which cannot be reversed. This is an alias for
unpivot
.
ids
- Id columns
values
- Value columns to unpivot
variableColumnName
- Name of the variable column
valueColumnName
- Name of the value column
org.apache.spark.sql.Dataset.unpivot(Array, Array, String, String)
Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set. This is the reverse to
groupBy(...).pivot(...).agg(...)
, except for the aggregation, which cannot be reversed. This is an alias for
unpivot
.
ids
- Id columns
variableColumnName
- Name of the variable column
valueColumnName
- Name of the value column
org.apache.spark.sql.Dataset.unpivot(Array, Array, String, String)
This is equivalent to calling Dataset#unpivot(Array, Array, String, String)
where values
is set to all non-id columns that exist in the DataFrame.
Merges a set of updates, insertions, and deletions based on a source table into a target table.
Scala Examples:
spark.table("source")
.mergeInto("target", $"source.id" === $"target.id")
.whenMatched($"salary" === 100)
.delete()
.whenNotMatched()
.insertAll()
.whenNotMatchedBySource($"salary" === 100)
.update(Map(
"salary" -> lit(200)
))
.merge()
table
- (undocumented)
condition
- (undocumented)
Selects a metadata column based on its logical column name, and returns it as a
Column
.
A metadata column can be accessed this way even if the underlying data source defines a data column with a conflicting name.
colName
- (undocumented)
Returns a
DataFrameNaFunctions
for working with missing data.
// Dropping rows containing any null values.
ds.na.drop()
Define (named) metrics to observe on the Dataset. This method returns an 'observed' Dataset that returns the same result as the input, with the following guarantees:
Please note that continuous execution is currently not supported.
The metrics columns must either contain a literal (e.g. lit(42)), or should contain one or more aggregate functions (e.g. sum(a) or sum(a + b) + avg(c) - lit(1)). Expressions that contain references to the input Dataset's columns must always be wrapped in an aggregate function.
name
- (undocumented)
expr
- (undocumented)
exprs
- (undocumented)
Observe (named) metrics through an
org.apache.spark.sql.Observation
instance. This method does not support streaming datasets.
A user can retrieve the metrics by accessing org.apache.spark.sql.Observation.get
.
// Observe row count (rows) and highest id (maxid) in the Dataset while writing it
val observation = Observation("my_metrics")
val observed_ds = ds.observe(observation, count(lit(1)).as("rows"), max($"id").as("maxid"))
observed_ds.write.parquet("ds.parquet")
val metrics = observation.get
observation
- (undocumented)
expr
- (undocumented)
exprs
- (undocumented)
IllegalArgumentException
- If this is a streaming Dataset (this.isStreaming == true)
Define (named) metrics to observe on the Dataset. This method returns an 'observed' Dataset that returns the same result as the input, with the following guarantees:
Please note that continuous execution is currently not supported.
The metrics columns must either contain a literal (e.g. lit(42)), or should contain one or more aggregate functions (e.g. sum(a) or sum(a + b) + avg(c) - lit(1)). Expressions that contain references to the input Dataset's columns must always be wrapped in an aggregate function.
name
- (undocumented)
expr
- (undocumented)
exprs
- (undocumented)
Observe (named) metrics through an
org.apache.spark.sql.Observation
instance. This method does not support streaming datasets.
A user can retrieve the metrics by accessing org.apache.spark.sql.Observation.get
.
// Observe row count (rows) and highest id (maxid) in the Dataset while writing it
val observation = Observation("my_metrics")
val observed_ds = ds.observe(observation, count(lit(1)).as("rows"), max($"id").as("maxid"))
observed_ds.write.parquet("ds.parquet")
val metrics = observation.get
observation
- (undocumented)
expr
- (undocumented)
exprs
- (undocumented)
IllegalArgumentException
- If this is a streaming Dataset (this.isStreaming == true)
Returns a new Dataset by skipping the first
n
rows.
n
- (undocumented)
Returns a new Dataset sorted by the given expressions. This is an alias of the
sort
function.
sortCol
- (undocumented)
sortCols
- (undocumented)
Returns a new Dataset sorted by the given expressions. This is an alias of the
sort
function.
sortExprs
- (undocumented)
Returns a new Dataset sorted by the given expressions. This is an alias of the
sort
function.
sortCol
- (undocumented)
sortCols
- (undocumented)
Returns a new Dataset sorted by the given expressions. This is an alias of the
sort
function.
sortExprs
- (undocumented)
Persist this Dataset with the default storage level (
MEMORY_AND_DISK
).
newLevel
- One of: MEMORY_ONLY
, MEMORY_AND_DISK
, MEMORY_ONLY_SER
, MEMORY_AND_DISK_SER
, DISK_ONLY
, MEMORY_ONLY_2
, MEMORY_AND_DISK_2
, etc.
public void printSchema()
public void printSchema(int level)
level
- (undocumented)
public abstract org.apache.spark.sql.execution.QueryExecution queryExecution()
weights
- weights for splits, will be normalized if they don't sum to 1.
seed
- Seed for sampling.
For Java API, use randomSplitAsList(double[],long)
.
weights
- weights for splits, will be normalized if they don't sum to 1.
weights
- weights for splits, will be normalized if they don't sum to 1.
seed
- Seed for sampling.
()
Represents the content of the Dataset as an
RDD
of
T
.
(Scala-specific) Reduces the elements of this Dataset using the specified binary function. The given
func
must be commutative and associative or the result may be non-deterministic.
func
- (undocumented)
(Java-specific) Reduces the elements of this Dataset using the specified binary function. The given
func
must be commutative and associative or the result may be non-deterministic.
func
- (undocumented)
Registers this Dataset as a temporary table using the given name. The lifetime of this temporary table is tied to the
SparkSession
that was used to create this Dataset.
tableName
- (undocumented)
Returns a new Dataset partitioned by the given partitioning expressions into
numPartitions
. The resulting Dataset is hash partitioned.
This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).
numPartitions
- (undocumented)
partitionExprs
- (undocumented)
Returns a new Dataset partitioned by the given partitioning expressions, using
spark.sql.shuffle.partitions
as number of partitions. The resulting Dataset is hash partitioned.
This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).
partitionExprs
- (undocumented)
Returns a new Dataset that has exactly
numPartitions
partitions.
numPartitions
- (undocumented)
Returns a new Dataset partitioned by the given partitioning expressions into
numPartitions
. The resulting Dataset is hash partitioned.
This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).
numPartitions
- (undocumented)
partitionExprs
- (undocumented)
Returns a new Dataset partitioned by the given partitioning expressions, using
spark.sql.shuffle.partitions
as number of partitions. The resulting Dataset is hash partitioned.
This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).
partitionExprs
- (undocumented)
Returns a new Dataset partitioned by the given partitioning expressions into
numPartitions
. The resulting Dataset is range partitioned.
At least one partition-by expression must be specified. When no explicit sort order is specified, "ascending nulls first" is assumed. Note, the rows are not sorted in each partition of the resulting Dataset.
Note that due to performance reasons this method uses sampling to estimate the ranges. Hence, the output may not be consistent, since sampling can return different values. The sample size can be controlled by the config spark.sql.execution.rangeExchange.sampleSizePerPartition
.
numPartitions
- (undocumented)
partitionExprs
- (undocumented)
Returns a new Dataset partitioned by the given partitioning expressions, using
spark.sql.shuffle.partitions
as number of partitions. The resulting Dataset is range partitioned.
At least one partition-by expression must be specified. When no explicit sort order is specified, "ascending nulls first" is assumed. Note, the rows are not sorted in each partition of the resulting Dataset.
Note that due to performance reasons this method uses sampling to estimate the ranges. Hence, the output may not be consistent, since sampling can return different values. The sample size can be controlled by the config spark.sql.execution.rangeExchange.sampleSizePerPartition
.
partitionExprs
- (undocumented)
Returns a new Dataset partitioned by the given partitioning expressions into
numPartitions
. The resulting Dataset is range partitioned.
At least one partition-by expression must be specified. When no explicit sort order is specified, "ascending nulls first" is assumed. Note, the rows are not sorted in each partition of the resulting Dataset.
Note that due to performance reasons this method uses sampling to estimate the ranges. Hence, the output may not be consistent, since sampling can return different values. The sample size can be controlled by the config spark.sql.execution.rangeExchange.sampleSizePerPartition
.
numPartitions
- (undocumented)
partitionExprs
- (undocumented)
Returns a new Dataset partitioned by the given partitioning expressions, using
spark.sql.shuffle.partitions
as number of partitions. The resulting Dataset is range partitioned.
At least one partition-by expression must be specified. When no explicit sort order is specified, "ascending nulls first" is assumed. Note, the rows are not sorted in each partition of the resulting Dataset.
Note that due to performance reasons this method uses sampling to estimate the ranges. Hence, the output may not be consistent, since sampling can return different values. The sample size can be controlled by the config spark.sql.execution.rangeExchange.sampleSizePerPartition
.
partitionExprs
- (undocumented)
Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them. See
RelationalGroupedDataset
for all the available aggregate functions.
// Compute the average for all numeric columns rolled up by department and group.
ds.rollup($"department", $"group").avg()
// Compute the max age and average salary, rolled up by department and gender.
ds.rollup($"department", $"gender").agg(Map(
"salary" -> "avg",
"age" -> "max"
))
cols
- (undocumented)
Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them. See
RelationalGroupedDataset
for all the available aggregate functions.
This is a variant of rollup that can only group by existing columns using column names (i.e. cannot construct expressions).
// Compute the average for all numeric columns rolled up by department and group.
ds.rollup("department", "group").avg()
// Compute the max age and average salary, rolled up by department and gender.
ds.rollup($"department", $"gender").agg(Map(
"salary" -> "avg",
"age" -> "max"
))
col1
- (undocumented)
cols
- (undocumented)
Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them. See
RelationalGroupedDataset
for all the available aggregate functions.
// Compute the average for all numeric columns rolled up by department and group.
ds.rollup($"department", $"group").avg()
// Compute the max age and average salary, rolled up by department and gender.
ds.rollup($"department", $"gender").agg(Map(
"salary" -> "avg",
"age" -> "max"
))
cols
- (undocumented)
Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them. See
RelationalGroupedDataset
for all the available aggregate functions.
This is a variant of rollup that can only group by existing columns using column names (i.e. cannot construct expressions).
// Compute the average for all numeric columns rolled up by department and group.
ds.rollup("department", "group").avg()
// Compute the max age and average salary, rolled up by department and gender.
ds.rollup($"department", $"gender").agg(Map(
"salary" -> "avg",
"age" -> "max"
))
col1
- (undocumented)
cols
- (undocumented)
Returns
true
when the logical query plans inside both
Dataset
s are equal and therefore return same results.
other
- (undocumented)
Dataset
s very fast but can still return false
on the Dataset
that return the same results, for instance, from different plans. Such false negative semantic can be useful when caching as an example.
Returns a new
Dataset
by sampling a fraction of rows (without replacement), using a user-supplied seed.
fraction
- Fraction of rows to generate, range [0.0, 1.0].
seed
- Seed for sampling.
Dataset
.
Returns a new
Dataset
by sampling a fraction of rows (without replacement), using a random seed.
fraction
- Fraction of rows to generate, range [0.0, 1.0].
Dataset
.
Returns a new
Dataset
by sampling a fraction of rows, using a user-supplied seed.
withReplacement
- Sample with replacement or not.
fraction
- Fraction of rows to generate, range [0.0, 1.0].
seed
- Seed for sampling.
Dataset
.
Returns a new
Dataset
by sampling a fraction of rows, using a random seed.
withReplacement
- Sample with replacement or not.
fraction
- Fraction of rows to generate, range [0.0, 1.0].
Dataset
.
()
Return a
Column
object for a SCALAR Subquery containing exactly one row and one column.
The scalar()
method is useful for extracting a Column
object that represents a scalar value from a DataFrame, especially when the DataFrame results from an aggregation or single-value computation. This returned Column
can then be used directly in select
clauses or as predicates in filters on the outer DataFrame, enabling dynamic data filtering and calculations based on scalar values.
Selects a set of column based expressions.
ds.select($"colA", $"colB" + 1)
cols
- (undocumented)
Selects a set of columns. This is a variant of
select
that can only select existing columns using column names (i.e. cannot construct expressions).
// The following two are equivalent:
ds.select("colA", "colB")
ds.select($"colA", $"colB")
col
- (undocumented)
cols
- (undocumented)
Selects a set of column based expressions.
ds.select($"colA", $"colB" + 1)
cols
- (undocumented)
Selects a set of columns. This is a variant of
select
that can only select existing columns using column names (i.e. cannot construct expressions).
// The following two are equivalent:
ds.select("colA", "colB")
ds.select($"colA", $"colB")
col
- (undocumented)
cols
- (undocumented)
Returns a new Dataset by computing the given
Column
expression for each element.
val ds = Seq(1, 2, 3).toDS()
val newDS = ds.select(expr("value + 1").as[Int])
c1
- (undocumented)
Returns a new Dataset by computing the given
Column
expressions for each element.
c1
- (undocumented)
c2
- (undocumented)
Returns a new Dataset by computing the given
Column
expressions for each element.
c1
- (undocumented)
c2
- (undocumented)
c3
- (undocumented)
Returns a new Dataset by computing the given
Column
expressions for each element.
c1
- (undocumented)
c2
- (undocumented)
c3
- (undocumented)
c4
- (undocumented)
Returns a new Dataset by computing the given
Column
expressions for each element.
c1
- (undocumented)
c2
- (undocumented)
c3
- (undocumented)
c4
- (undocumented)
c5
- (undocumented)
Selects a set of SQL expressions. This is a variant of
select
that accepts SQL expressions.
// The following are equivalent:
ds.selectExpr("colA", "colB as newName", "abs(colC)")
ds.select(expr("colA"), expr("colB as newName"), expr("abs(colC)"))
exprs
- (undocumented)
Selects a set of SQL expressions. This is a variant of
select
that accepts SQL expressions.
// The following are equivalent:
ds.selectExpr("colA", "colB as newName", "abs(colC)")
ds.select(expr("colA"), expr("colB as newName"), expr("abs(colC)"))
exprs
- (undocumented)
public abstract int semanticHash()
Returns a
hashCode
of the logical query plan against this
Dataset
.
hashCode
, the hash is calculated against the query plan simplified by tolerating the cosmetic differences such as attribute names.
public void show(int numRows)
Displays the Dataset in a tabular form. Strings more than 20 characters will be truncated, and all cells will be aligned right. For example:
year month AVG('Adj Close) MAX('Adj Close)
1980 12 0.503218 0.595103
1981 01 0.523289 0.570307
1982 02 0.436504 0.475256
1983 03 0.410516 0.442194
1984 04 0.450090 0.483521
numRows
- Number of rows to show
public void show()
public void show(boolean truncate)
truncate
- Whether truncate long strings. If true, strings more than 20 characters will be truncated and all cells will be aligned right
public abstract void show(int numRows, boolean truncate)
Displays the Dataset in a tabular form. For example:
year month AVG('Adj Close) MAX('Adj Close)
1980 12 0.503218 0.595103
1981 01 0.523289 0.570307
1982 02 0.436504 0.475256
1983 03 0.410516 0.442194
1984 04 0.450090 0.483521
numRows
- Number of rows to show
truncate
- Whether truncate long strings. If true, strings more than 20 characters will be truncated and all cells will be aligned right
public void show(int numRows, int truncate)
Displays the Dataset in a tabular form. For example:
year month AVG('Adj Close) MAX('Adj Close)
1980 12 0.503218 0.595103
1981 01 0.523289 0.570307
1982 02 0.436504 0.475256
1983 03 0.410516 0.442194
1984 04 0.450090 0.483521
numRows
- Number of rows to show
truncate
- If set to more than 0, truncates strings to truncate
characters and all cells will be aligned right.
public abstract void show(int numRows, int truncate, boolean vertical)
Displays the Dataset in a tabular form. For example:
year month AVG('Adj Close) MAX('Adj Close)
1980 12 0.503218 0.595103
1981 01 0.523289 0.570307
1982 02 0.436504 0.475256
1983 03 0.410516 0.442194
1984 04 0.450090 0.483521
If vertical
enabled, this command prints output rows vertically (one line per column value)?
-RECORD 0-------------------
year | 1980
month | 12
AVG('Adj Close) | 0.503218
AVG('Adj Close) | 0.595103
-RECORD 1-------------------
year | 1981
month | 01
AVG('Adj Close) | 0.523289
AVG('Adj Close) | 0.570307
-RECORD 2-------------------
year | 1982
month | 02
AVG('Adj Close) | 0.436504
AVG('Adj Close) | 0.475256
-RECORD 3-------------------
year | 1983
month | 03
AVG('Adj Close) | 0.410516
AVG('Adj Close) | 0.442194
-RECORD 4-------------------
year | 1984
month | 04
AVG('Adj Close) | 0.450090
AVG('Adj Close) | 0.483521
numRows
- Number of rows to show
truncate
- If set to more than 0, truncates strings to truncate
characters and all cells will be aligned right.
vertical
- If set to true, prints output rows vertically (one line per column value).
Returns a new Dataset sorted by the specified column, all in ascending order.
// The following 3 are equivalent
ds.sort("sortcol")
ds.sort($"sortcol")
ds.sort($"sortcol".asc)
sortCol
- (undocumented)
sortCols
- (undocumented)
Returns a new Dataset sorted by the given expressions. For example:
ds.sort($"col1", $"col2".desc)
sortExprs
- (undocumented)
Returns a new Dataset sorted by the specified column, all in ascending order.
// The following 3 are equivalent
ds.sort("sortcol")
ds.sort($"sortcol")
ds.sort($"sortcol".asc)
sortCol
- (undocumented)
sortCols
- (undocumented)
Returns a new Dataset sorted by the given expressions. For example:
ds.sort($"col1", $"col2".desc)
sortExprs
- (undocumented)
Returns a new Dataset with each partition sorted by the given expressions.
This is the same operation as "SORT BY" in SQL (Hive QL).
sortCol
- (undocumented)
sortCols
- (undocumented)
Returns a new Dataset with each partition sorted by the given expressions.
This is the same operation as "SORT BY" in SQL (Hive QL).
sortExprs
- (undocumented)
Returns a new Dataset with each partition sorted by the given expressions.
This is the same operation as "SORT BY" in SQL (Hive QL).
sortCol
- (undocumented)
sortCols
- (undocumented)
Returns a new Dataset with each partition sorted by the given expressions.
This is the same operation as "SORT BY" in SQL (Hive QL).
sortExprs
- (undocumented)
Returns a
DataFrameStatFunctions
for working statistic functions support.
// Finding frequent items in column with name 'a'.
ds.stat.freqItems(Seq("a"))
Computes specified statistics for numeric and string columns. Available statistics are:
If no statistics are given, this function computes count, mean, stddev, min, approximate quartiles (percentiles at 25%, 50%, and 75%), and max.
This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting Dataset. If you want to programmatically compute summary statistics, use the agg
function instead.
ds.summary().show()
// output:
// summary age height
// count 10.0 10.0
// mean 53.3 178.05
// stddev 11.6 15.7
// min 18.0 163.0
// 25% 24.0 176.0
// 50% 24.0 176.0
// 75% 32.0 180.0
// max 92.0 192.0
ds.summary("count", "min", "25%", "75%", "max").show()
// output:
// summary age height
// count 10.0 10.0
// min 18.0 163.0
// 25% 24.0 176.0
// 75% 32.0 180.0
// max 92.0 192.0
To do a summary for specific columns first select them:
ds.select("age", "height").summary().show()
Specify statistics to output custom summaries:
ds.summary("count", "count_distinct").show()
The distinct count isn't included by default.
You can also run approximate distinct counts which are faster:
ds.summary("count", "approx_count_distinct").show()
See also describe(java.lang.String...)
for basic statistics.
statistics
- Statistics from above list to be computed.
Computes specified statistics for numeric and string columns. Available statistics are:
If no statistics are given, this function computes count, mean, stddev, min, approximate quartiles (percentiles at 25%, 50%, and 75%), and max.
This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting Dataset. If you want to programmatically compute summary statistics, use the agg
function instead.
ds.summary().show()
// output:
// summary age height
// count 10.0 10.0
// mean 53.3 178.05
// stddev 11.6 15.7
// min 18.0 163.0
// 25% 24.0 176.0
// 50% 24.0 176.0
// 75% 32.0 180.0
// max 92.0 192.0
ds.summary("count", "min", "25%", "75%", "max").show()
// output:
// summary age height
// count 10.0 10.0
// min 18.0 163.0
// 25% 24.0 176.0
// 75% 32.0 180.0
// max 92.0 192.0
To do a summary for specific columns first select them:
ds.select("age", "height").summary().show()
Specify statistics to output custom summaries:
ds.summary("count", "count_distinct").show()
The distinct count isn't included by default.
You can also run approximate distinct counts which are faster:
ds.summary("count", "approx_count_distinct").show()
See also describe(java.lang.String...)
for basic statistics.
statistics
- Statistics from above list to be computed.
Returns the last
n
rows in the Dataset.
Running tail requires moving data into the application's driver process, and doing so with a very large n
can crash the driver process with OutOfMemoryError.
n
- (undocumented)
Returns the first
n
rows in the Dataset.
Running take requires moving data into the application's driver process, and doing so with a very large n
can crash the driver process with OutOfMemoryError.
n
- (undocumented)
Returns the first
n
rows in the Dataset as a list.
Running take requires moving data into the application's driver process, and doing so with a very large n
can crash the driver process with OutOfMemoryError.
n
- (undocumented)
Returns a new DataFrame where each row is reconciled to match the specified schema. Spark will:
schema
- (undocumented)
Converts this strongly typed collection of data to generic
DataFrame
with columns renamed. This can be quite convenient in conversion from an RDD of tuples into a
DataFrame
with meaningful names. For example:
val rdd: RDD[(Int, String)] = ...
rdd.toDF() // this implicit conversion creates a DataFrame with column name `_1` and `_2`
rdd.toDF("id", "name") // this creates a DataFrame with column name "id" and "name"
colNames
- (undocumented)
Converts this strongly typed collection of data to generic Dataframe. In contrast to the strongly typed objects that Dataset operations work on, a Dataframe returns generic
Row
objects that allow fields to be accessed by ordinal or name.
Converts this strongly typed collection of data to generic
DataFrame
with columns renamed. This can be quite convenient in conversion from an RDD of tuples into a
DataFrame
with meaningful names. For example:
val rdd: RDD[(Int, String)] = ...
rdd.toDF() // this implicit conversion creates a DataFrame with column name `_1` and `_2`
rdd.toDF("id", "name") // this creates a DataFrame with column name "id" and "name"
colNames
- (undocumented)
Returns the content of the Dataset as a
JavaRDD
of
T
s.
()
Returns an iterator that contains all rows in this Dataset.
The iterator will consume as much memory as the largest partition in this Dataset.
Concise syntax for chaining custom transformations.
def featurize(ds: Dataset[T]): Dataset[U] = ...
ds
.transform(featurize)
.transform(...)
t
- (undocumented)
Transposes a DataFrame such that the values in the specified index column become the new columns of the DataFrame.
Please note: - All columns except the index column must share a least common data type. Unless they are the same data type, all columns are cast to the nearest common data type. - The name of the column into which the original column names are transposed defaults to "key". - null values in the index column are excluded from the column names for the transposed table, which are ordered in ascending order.
val df = Seq(("A", 1, 2), ("B", 3, 4)).toDF("id", "val1", "val2")
df.show()
// output:
// +---+----+----+
// | id|val1|val2|
// +---+----+----+
// | A| 1| 2|
// | B| 3| 4|
// +---+----+----+
df.transpose($"id").show()
// output:
// +----+---+---+
// | key| A| B|
// +----+---+---+
// |val1| 1| 3|
// |val2| 2| 4|
// +----+---+---+
// schema:
// root
// |-- key: string (nullable = false)
// |-- A: integer (nullable = true)
// |-- B: integer (nullable = true)
df.transpose().show()
// output:
// +----+---+---+
// | key| A| B|
// +----+---+---+
// |val1| 1| 3|
// |val2| 2| 4|
// +----+---+---+
// schema:
// root
// |-- key: string (nullable = false)
// |-- A: integer (nullable = true)
// |-- B: integer (nullable = true)
indexColumn
- The single column that will be treated as the index for the transpose operation. This column will be used to pivot the data, transforming the DataFrame such that the values of the indexColumn become the new columns in the transposed DataFrame.
Transposes a DataFrame, switching rows to columns. This function transforms the DataFrame such that the values in the first column become the new columns of the DataFrame.
This is equivalent to calling Dataset#transpose(Column)
where indexColumn
is set to the first column.
Please note: - All columns except the index column must share a least common data type. Unless they are the same data type, all columns are cast to the nearest common data type. - The name of the column into which the original column names are transposed defaults to "key". - Non-"key" column names for the transposed table are ordered in ascending order.
Returns a new Dataset containing union of rows in this Dataset and another Dataset.
This is equivalent to UNION ALL
in SQL. To do a SQL-style set union (that does deduplication of elements), use this function followed by a distinct()
.
Also as standard in SQL, this function resolves columns by position (not by name):
val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
val df2 = Seq((4, 5, 6)).toDF("col1", "col2", "col0")
df1.union(df2).show
// output:
// +----+----+----+
// |col0|col1|col2|
// +----+----+----+
// | 1| 2| 3|
// | 4| 5| 6|
// +----+----+----+
Notice that the column positions in the schema aren't necessarily matched with the fields in the strongly typed objects in a Dataset. This function resolves columns by their positions in the schema, not the fields in the strongly typed objects. Use unionByName(org.apache.spark.sql.Dataset<T>)
to resolve columns by field name in the typed objects.
other
- (undocumented)
Returns a new Dataset containing union of rows in this Dataset and another Dataset. This is an alias for
union
.
This is equivalent to UNION ALL
in SQL. To do a SQL-style set union (that does deduplication of elements), use this function followed by a distinct()
.
Also as standard in SQL, this function resolves columns by position (not by name).
other
- (undocumented)
Returns a new Dataset containing union of rows in this Dataset and another Dataset.
This is different from both UNION ALL
and UNION DISTINCT
in SQL. To do a SQL-style set union (that does deduplication of elements), use this function followed by a distinct()
.
The difference between this function and union(org.apache.spark.sql.Dataset<T>)
is that this function resolves columns by name (not by position):
val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
val df2 = Seq((4, 5, 6)).toDF("col1", "col2", "col0")
df1.unionByName(df2).show
// output:
// +----+----+----+
// |col0|col1|col2|
// +----+----+----+
// | 1| 2| 3|
// | 6| 4| 5|
// +----+----+----+
Note that this supports nested columns in struct and array types. Nested columns in map types are not currently supported.
other
- (undocumented)
Returns a new Dataset containing union of rows in this Dataset and another Dataset.
The difference between this function and union(org.apache.spark.sql.Dataset<T>)
is that this function resolves columns by name (not by position).
When the parameter allowMissingColumns
is true
, the set of column names in this and other Dataset
can differ; missing columns will be filled with null. Further, the missing columns of this Dataset
will be added at the end in the schema of the union result:
val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
val df2 = Seq((4, 5, 6)).toDF("col1", "col0", "col3")
df1.unionByName(df2, true).show
// output: "col3" is missing at left df1 and added at the end of schema.
// +----+----+----+----+
// |col0|col1|col2|col3|
// +----+----+----+----+
// | 1| 2| 3|NULL|
// | 5| 4|NULL| 6|
// +----+----+----+----+
df2.unionByName(df1, true).show
// output: "col2" is missing at left df2 and added at the end of schema.
// +----+----+----+----+
// |col1|col0|col3|col2|
// +----+----+----+----+
// | 4| 5| 6|NULL|
// | 2| 1|NULL| 3|
// +----+----+----+----+
Note that this supports nested columns in struct and array types. With allowMissingColumns
, missing nested columns of struct columns with the same name will also be filled with null values and added to the end of struct. Nested columns in map types are not currently supported.
other
- (undocumented)
allowMissingColumns
- (undocumented)
blocking
- Whether to block until all blocks are deleted.
Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set. This is the reverse to
groupBy(...).pivot(...).agg(...)
, except for the aggregation, which cannot be reversed.
This function is useful to massage a DataFrame into a format where some columns are identifier columns ("ids"), while all other columns ("values") are "unpivoted" to the rows, leaving just two non-id columns, named as given by variableColumnName
and valueColumnName
.
val df = Seq((1, 11, 12L), (2, 21, 22L)).toDF("id", "int", "long")
df.show()
// output:
// +---+---+----+
// | id|int|long|
// +---+---+----+
// | 1| 11| 12|
// | 2| 21| 22|
// +---+---+----+
df.unpivot(Array($"id"), Array($"int", $"long"), "variable", "value").show()
// output:
// +---+--------+-----+
// | id|variable|value|
// +---+--------+-----+
// | 1| int| 11|
// | 1| long| 12|
// | 2| int| 21|
// | 2| long| 22|
// +---+--------+-----+
// schema:
//root
// |-- id: integer (nullable = false)
// |-- variable: string (nullable = false)
// |-- value: long (nullable = true)
When no "id" columns are given, the unpivoted DataFrame consists of only the "variable" and "value" columns.
All "value" columns must share a least common data type. Unless they are the same data type, all "value" columns are cast to the nearest common data type. For instance, types IntegerType
and LongType
are cast to LongType
, while IntegerType
and StringType
do not have a common data type and unpivot
fails with an AnalysisException
.
ids
- Id columns
values
- Value columns to unpivot
variableColumnName
- Name of the variable column
valueColumnName
- Name of the value column
Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set. This is the reverse to
groupBy(...).pivot(...).agg(...)
, except for the aggregation, which cannot be reversed.
ids
- Id columns
variableColumnName
- Name of the variable column
valueColumnName
- Name of the value column
org.apache.spark.sql.Dataset.unpivot(Array, Array, String, String)
This is equivalent to calling Dataset#unpivot(Array, Array, String, String)
where values
is set to all non-id columns that exist in the DataFrame.
Filters rows using the given condition. This is an alias for
filter
.
// The following are equivalent:
peopleDs.filter($"age" > 15)
peopleDs.where($"age" > 15)
condition
- (undocumented)
Filters rows using the given SQL expression.
peopleDs.where("age > 15")
conditionExpr
- (undocumented)
Returns a new Dataset by adding a column or replacing the existing column that has the same name.
column
's expression must only refer to attributes supplied by this Dataset. It is an error to add a column that refers to some other Dataset.
colName
- (undocumented)
col
- (undocumented)
StackOverflowException
. To avoid this, use select
with the multiple columns at once.
existingName
- (undocumented)
newName
- (undocumented)
(Scala-specific) Returns a new Dataset by adding columns or replacing the existing columns that has the same names.
colsMap
is a map of column name and column, the column must only refer to attributes supplied by this Dataset. It is an error to add columns that refers to some other Dataset.
colsMap
- (undocumented)
(Java-specific) Returns a new Dataset by adding columns or replacing the existing columns that has the same names.
colsMap
is a map of column name and column, the column must only refer to attribute supplied by this Dataset. It is an error to add columns that refers to some other Dataset.
colsMap
- (undocumented)
(Scala-specific) Returns a new Dataset with a columns renamed. This is a no-op if schema doesn't contain existingName.
colsMap
is a map of existing column name and new column name.
colsMap
- (undocumented)
AnalysisException
- if there are duplicate names in resulting projection
(Java-specific) Returns a new Dataset with a columns renamed. This is a no-op if schema doesn't contain existingName.
colsMap
is a map of existing column name and new column name.
colsMap
- (undocumented)
columnName
- (undocumented)
metadata
- (undocumented)
Defines an event time watermark for this
Dataset
. A watermark tracks a point in time before which we assume no more late data is going to arrive.
Spark will use this watermark for several purposes:
mapGroupsWithState
and dropDuplicates
operators.The current watermark is computed by looking at the
MAX(eventTime)
seen across all of the partitions in the query minus a user specified
delayThreshold
. Due to the cost of coordinating this value across partitions, the actual watermark used is only guaranteed to be at least
delayThreshold
behind the actual event time. In some cases we may still process records that arrive more than
delayThreshold
late.
eventTime
- the name of the column that contains the event time of the row.
delayThreshold
- the minimum delay to wait to data to arrive late, relative to the latest record that has been processed in the form of an interval (e.g. "1 minute" or "5 hours"). NOTE: This should not be negative.
Create a write configuration builder for v2 sources.
This builder is used to configure and execute write operations. For example, to append to an existing table, run:
df.writeTo("catalog.db.table").append()
This can also be used to create or replace existing tables:
df.writeTo("catalog.db.table").partitionedBy($"col").createOrReplace()
table
- (undocumented)
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4