Spark SQL provides spark.read().xml("file_1_path","file_2_path")
to read a file or directory of files in XML format into a Spark DataFrame, and dataframe.write().xml("path")
to write to a xml file. The rowTag
option must be specified to indicate the XML element that maps to a DataFrame row
. The option() function can be used to customize the behavior of reading or writing, such as controlling behavior of the XML attributes, XSD validation, compression, and so on.
# Primitive types (Int, String, etc) and Product types (case classes) encoders are
# supported by importing this when creating a Dataset.
# An XML dataset is pointed to by path.
# The path can be either a single xml file or more xml files
path = "examples/src/main/resources/people.xml"
peopleDF = spark.read.option("rowTag", "person").format("xml").load(path)
# The inferred schema can be visualized using the printSchema() method
peopleDF.printSchema()
# root
# |-- age: long (nullable = true)
# |-- name: string (nullable = true)
# Creates a temporary view using the DataFrame
peopleDF.createOrReplaceTempView("people")
# SQL statements can be run by using the sql methods provided by spark
teenagerNamesDF = spark.sql("SELECT name FROM people WHERE age BETWEEN 13 AND 19")
teenagerNamesDF.show()
# +------+
# | name|
# +------+
# |Justin|
# +------+
# Alternatively, a DataFrame can be created for an XML dataset represented by a Dataset[String]
xmlStrings = ["""
<person>
<name>laglangyue</name>
<job>Developer</job>
<age>28</age>
</person>
"""]
xmlRDD = spark.sparkContext.parallelize(xmlStrings)
otherPeople = spark.read \
.option("rowTag", "person") \
.xml(xmlRDD)
otherPeople.show()
# +---+---------+----------+
# |age| job| name|
# +---+---------+----------+
# | 28|Developer|laglangyue|
# +---+---------+----------+
Find full example code at "examples/src/main/python/sql/datasource.py" in the Spark repo.
// Primitive types (Int, String, etc) and Product types (case classes) encoders are
// supported by importing this when creating a Dataset.
import spark.implicits._
// An XML dataset is pointed to by path.
// The path can be either a single xml file or more xml files
val path = "examples/src/main/resources/people.xml"
val peopleDF = spark.read.option("rowTag", "person").xml(path)
// The inferred schema can be visualized using the printSchema() method
peopleDF.printSchema()
// root
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)
// Creates a temporary view using the DataFrame
peopleDF.createOrReplaceTempView("people")
// SQL statements can be run by using the sql methods provided by spark
val teenagerNamesDF = spark.sql("SELECT name FROM people WHERE age BETWEEN 13 AND 19")
teenagerNamesDF.show()
// +------+
// | name|
// +------+
// |Justin|
// +------+
// Alternatively, a DataFrame can be created for a XML dataset represented by a Dataset[String]
val otherPeopleDataset = spark.createDataset(
"""
|<person>
| <name>laglangyue</name>
| <job>Developer</job>
| <age>28</age>
|</person>
|""".stripMargin :: Nil)
val otherPeople = spark.read
.option("rowTag", "person")
.xml(otherPeopleDataset)
otherPeople.show()
// +---+---------+----------+
// |age| job| name|
// +---+---------+----------+
// | 28|Developer|laglangyue|
// +---+---------+----------+
Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala" in the Spark repo.
// Primitive types (Int, String, etc) and Product types (case classes) encoders are
// supported by importing this when creating a Dataset.
// An XML dataset is pointed to by path.
// The path can be either a single xml file or more xml files
String path = "examples/src/main/resources/people.xml";
Dataset<Row> peopleDF = spark.read().option("rowTag", "person").xml(path);
// The inferred schema can be visualized using the printSchema() method
peopleDF.printSchema();
// root
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)
// Creates a temporary view using the DataFrame
peopleDF.createOrReplaceTempView("people");
// SQL statements can be run by using the sql methods provided by spark
Dataset<Row> teenagerNamesDF = spark.sql(
"SELECT name FROM people WHERE age BETWEEN 13 AND 19");
teenagerNamesDF.show();
// +------+
// | name|
// +------+
// |Justin|
// +------+
// Alternatively, a DataFrame can be created for an XML dataset represented by a Dataset[String]
List<String> xmlData = Collections.singletonList(
"<person>" +
"<name>laglangyue</name><job>Developer</job><age>28</age>" +
"</person>");
Dataset<String> otherPeopleDataset = spark.createDataset(Lists.newArrayList(xmlData),
Encoders.STRING());
Dataset<Row> otherPeople = spark.read()
.option("rowTag", "person")
.xml(otherPeopleDataset);
otherPeople.show();
// +---+---------+----------+
// |age| job| name|
// +---+---------+----------+
// | 28|Developer|laglangyue|
// +---+---------+----------+
Find full example code at "examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java" in the Spark repo.
Data Source OptionData source options of XML can be set via:
.option
/.options
methods of
DataFrameReader
DataFrameWriter
DataStreamReader
DataStreamWriter
from_xml
to_xml
schema_of_xml
OPTIONS
clause at CREATE TABLE USING DATA_SOURCErowTag
The row tag of your xml files to treat as a row. For example, in this xml: &lt;books&gt;&lt;book&gt;&lt;/book&gt;...&lt;/books&gt;
the appropriate value would be book. This is a required option for both read and write. read/write samplingRatio
1.0
Defines fraction of rows used for schema inferring. XML built-in functions ignore this option. read excludeAttribute
false
Whether to exclude attributes in elements. read mode
PERMISSIVE
Allows a mode for dealing with corrupt records during parsing.
PERMISSIVE
: when it meets a corrupted record, puts the malformed string into a field configured by columnNameOfCorruptRecord, and sets malformed fields to null. To keep corrupt records, an user can set a string type field named columnNameOfCorruptRecord in an user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. When inferring a schema, it implicitly adds a columnNameOfCorruptRecord field in an output schema.DROPMALFORMED
: ignores the whole corrupted records. This mode is unsupported in the XML built-in functions.FAILFAST
: throws an exception when it meets corrupted records.inferSchema
true
If true, attempts to infer an appropriate type for each resulting DataFrame column. If false, all resulting columns are of string type. read columnNameOfCorruptRecord
spark.sql.columnNameOfCorruptRecord
Allows renaming the new field having a malformed string created by PERMISSIVE mode. read attributePrefix
_
The prefix for attributes to differentiate attributes from elements. This will be the prefix for field names. Can be empty for reading XML, but not for writing. read/write valueTag
_VALUE
The tag used for the value when there are attributes in the element having no child. read/write encoding
UTF-8
For reading, decodes the XML files by the given encoding type. For writing, specifies encoding (charset) of saved XML files. XML built-in functions ignore this option. read/write ignoreSurroundingSpaces
true
Defines whether surrounding whitespaces from values being read should be skipped. read rowValidationXSDPath
null
Path to an optional XSD file that is used to validate the XML for each row individually. Rows that fail to validate are treated like parse errors as above. The XSD does not otherwise affect the schema provided, or inferred. read ignoreNamespace
false
If true, namespaces prefixes on XML elements and attributes are ignored. Tags <abc:author> and <def:author> would, for example, be treated as if both are just <author>. Note that, at the moment, namespaces cannot be ignored on the rowTag element, only its children. Note that XML parsing is in general not namespace-aware even if false. read timeZone
(value of spark.sql.session.timeZone
configuration) Sets the string that indicates a time zone ID to be used to format timestamps in the XML datasources or partition values. The following formats of timeZone are supported:
timestampFormat
yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]
Sets the string that indicates a timestamp format. Custom date formats follow the formats at datetime pattern. This applies to timestamp type. read/write timestampNTZFormat
yyyy-MM-dd'T'HH:mm:ss[.SSS] Sets the string that indicates a timestamp without timezone format. Custom date formats follow the formats at Datetime Patterns. This applies to timestamp without timezone type, note that zone-offset and time-zone components are not supported when writing or reading this data type. read/write dateFormat
yyyy-MM-dd
Sets the string that indicates a date format. Custom date formats follow the formats at datetime pattern. This applies to date type. read/write locale
en-US
Sets a locale as a language tag in IETF BCP 47 format. For instance, locale is used while parsing dates and timestamps. read/write rootTag
ROWS
Root tag of the xml files. For example, in this xml: &lt;books&gt;&lt;book&gt;&lt;/book&gt;...&lt;/books&gt;
the appropriate value would be books. It can include basic attributes by specifying a value like 'books' write declaration
version="1.0" encoding="UTF-8"
standalone="yes" Content of XML declaration to write at the start of every output XML file, before the rootTag. For example, a value of foo causes to be written. Set to empty string to suppress write arrayElementName
item
Name of XML element that encloses each element of an array-valued column when writing. write nullValue
null Sets the string representation of a null value. Default is string null. When this is null, it does not write attributes and elements for fields. read/write wildcardColName
xs_any
Name of a column existing in the provided schema which is interpreted as a 'wildcard'. It must have type string or array of strings. It will match any XML child element that is not otherwise matched by the schema. The XML of the child becomes the string value of the column. If an array, then all unmatched elements will be returned as an array of strings. As its name implies, it is meant to emulate XSD's xs:any type. read compression
none
Compression codec to use when saving to file. This can be one of the known case-insensitive shortened names (none, bzip2, gzip, lz4, snappy and deflate). XML built-in functions ignore this option. write validateName
true
If true, throws error on XML element name validation failure. For example, SQL field names can have spaces, but XML element names cannot. write
Other generic options can be found in Generic File Source Options.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4