Applies to: Databricks SQL Databricks Runtime 16.1 and above
A collation is a set of rules that determines how string comparisons are performed. Collations are used to compare strings in a case-insensitive, accent-insensitive, or trailing space insensitive manner, or to sort strings in a specific language-aware order.
Strings in Databricks are represented as UTF-8 encoded Unicode characters. By default Databricks compares strings by their binary UTF8 representation. This is known as UTF8_BINARY
collation. UTF8_BINARY
comparisons are fast and appropriate in many cases, but may not be suitable for all applications, especially those that require language-aware sorting or comparisons.
Aside from language-aware comparisons, a common use case is enforcing case-insensitivity. Databricks has the UTF8_LCASE
collation specifically for this purpose. It converts strings to lowercase before comparing them using the fast UTF8_BINARY
collation.
For language-aware comparisons, Databricks employs the following technologies:
These technologies are encapsulated in a set of named collations that can be used in SQL statements.
Collation namesâBecause identifying collations by their LDML specification can be complex and challenging to read, Databricks has a set of easier-to-use named system collations.
Syntaxâ{ UTF8_BINARY |
UTF8_LCASE |
{ UNICODE | locale } [ _ modifier [...] ] }
locale
language_code [ _ script_code ] [ _ country_code ]
modifier
{ CS | CI | AS | AI | RTRIM }
UTF8_BINARY
A meta-locale binary collation that compares strings byte by byte based on the UTF-8 byte representation. UTF8_BINARY
is the default and most lightweight collation for string comparison in Databricks.
In this collation 'A' (x'65') < 'B' (x'66') < ⦠< 'Z' (x'90').
However, 'Z' (x'90') < 'a' (x'97'), and 'A' (x'65') <> 'a' (x'97').
Further, characters such as 'Ã' (x'C384') are greater than 'Z' and 'z' in this collation.
UTF8_LCASE
A lightweight meta-locale case-insensitive collation that compares strings using their UTF-8 byte representation after converting the strings to lower case.
UTF8_LCASE
is the collation used for identifiers in Databricks.
For example:
SQL
ORDER BY col COLLATE UTF8_LCASE
is equivalent to
SQL
ORDER BY LOWER(col) COLLATE UTF8_BINARY
UNICODE
The ICU root locale.
This collation, known in CLDR as the 'root' locale (LDML specification: 'und-u') imposes a language agnostic order, which tries to be intuitive overall. In this collation, like characters are grouped. For example: 'a' < 'A' < 'Ã' < 'b'. 'A' is not considered equivalent to 'a'. Therefore, the collation is case-sensitive. 'a' is not considered equivalent to 'ä'. Therefore, the collation is accent-sensitive.
locale
A locale-aware collation based on the CLDR tables.
The locale is specified as a language code, an optional script code, and an optional country code. locale
is case-insensitive.
modifier
Specifies the collation behavior regarding case sensitivity and accent sensitivity.
Applies to: Databricks SQL Databricks Runtime 16.2 and above
Applies to: Databricks SQL Databricks Runtime 16.2 and above
You can specify RTRIM
, either CS
or CI
, and either AS
or AI
at most once and in any order. The modifiers themselves are case-insensitive.
When processing a collation, Databricks normalizes collation names by removing defaults. For example, SR_CYR_SRN_CS_AS
is normalized to SR
.
For a list of supported collations, see Supported collations.
ExamplesâSQL
system.builtin.unicode
unicode
DE
fr_CAN
zh_Hant_MAC
de_CI_AI
`UTF8_BINARY`
Default collationâ
The default collation applies when using STRING
literals, parameter markers, functions without STRING
parameters producing strings, and when defining column, field or variable types without a COLLATE
clause.
The default collation is derived in one of the following ways:
For DDL statements such as ALTER TABLE
, CREATE VIEW
, CREATE TABLE
, and CREATE FUNCTION
:
DEFAULT COLLATION
clause is specified, the default collation is UTF8_BINARY
.For DML (UPDATE, DELETE, INSERT, MERGE), and query statements the default collation is UTF8_BINARY
.
To decide which collation to use for a given string Databricks defines collation precedence rules.
The rules assign 4 levels of precedence to collations:
Explicit
The collation has been explicitly assigned to a string using COLLATE expression.
Examples
SQL
vin COLLATE UTF8_BINARY LIKE 'ZFF%'
ORDER BY vorname COLLATE DE
Implicit
The collation is implicitly assigned by the column, field, column-alias, variable, or routine parameter reference. This includes the result of a subquery as long as the collation is not None.
Examples
SQL
employee.name LIKE 'Mc%'
translate(session.tempvar, 'Z', ',')
Default
A STRING
literal, named or unnamed parameter marker, or a STRING
produced by a function from another type.
Examples
SQL
'Hello'
EXECUTE IMMEDIATE 'SELECT :parm1' USING 'Hello' AS parm1;
EXECUTE IMMEDIATE 'SELECT ?' USING 'Hello';
CAST(5 AS STRING)
to_char(DATE'2016-04-08', 'y')
session_user()
The assigned collation is the Default Collation.
None
A STRING
result of a function, operator or set operation (e.g. UNION
) that takes more than one STRING
argument which have different implicit collations.
Examples
SQL
SELECT fr || de AS freutsch FROM VALUES('Voulez vous ' COLLATE FR), 'Kartoffelsupp...' COLLATE DE) AS T(fr, de)
SELECT 'Voulez vous ' COLLATE FR UNION ALL SELECT 'Kartoffelsupp...' COLLATE DE
When deriving the collation for a STRING
result, the collation precedence rules are applied in the following ways:
If the expression:
matches the definitions above
The collation and precedence is as defined.
is a function or operator with a single STRING
parameter, returning a STRING
The collation and precedence is that of the STRING
parameter.
is a function or operator with two or more STRING
parameters
with the same collations and precedence
The collation and precedence is that of the STRING
parameters.
with different collations or precedence
Let C1
and C2
be distinct collations and let D
be the default collation. The precedence and the collation is determined by the following table:
SQL
> SELECT 'hello' = 'hello ' COLLATE UNICODE_RTRIM;
true
> CREATE TABLE words(fr STRING COLLATE FR, de STRING COLLATE DE, en STRING COLLATE EN);
> INSERT INTO words VALUES ('Salut', 'Hallo', 'Hello');
> SELECT collation('Ciao');
UTF8_BINARY
> SELECT collation(user());
UTF8_BINARY
> SELECT collation(upper('Ciao'));
UTF8_BINARY
> SELECT collation(fr || 'Ciao') FROM words;
FR
> SELECT collation('Salut' COLLATE FR || de) FROM words;
FR
> SELECT collation(de || fr) FROM words;
null
> SELECT collation('Salut' COLLATE FR || 'Ciao');
FR
> SELECT collation('Salut' COLLATE FR || 'Hallo' COLLATE DE);
COLLATION_MISMATCH.EXPLICIT
> SELECT collation('Ciao' COLLATE IT || (fr || de)) FROM words;
IT
> SELECT collation(en || (fr || de)) FROM words;
null
> SELECT collation((fr || ltrim('H' COLLATE EN, fr)) || fr) FROM words;
EN
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4