A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/best-practices-serverless-sql-pool below:

Best practices for serverless SQL pool - Azure Synapse Analytics

In this article, you'll find a collection of best practices for using serverless SQL pool. Serverless SQL pool is a resource in Azure Synapse Analytics. If you're working with a dedicated SQL pool, see Best practices for dedicated SQL pools for specific guidance.

Serverless SQL pool allows you to query files in your Azure Storage accounts. It doesn't have local storage or ingestion capabilities. All files that the queries target are external to serverless SQL pool. Everything related to reading files from storage might affect query performance.

Some generic guidelines are:

Client applications and network connections

Make sure your client application is connected to the closest possible Azure Synapse workspace with the optimal connection.

Storage and content layout

Here are best practices for storage and content layout in serverless SQL pool.

Colocate your storage and serverless SQL pool

To minimize latency, colocate your Azure Storage account or Azure Cosmos DB analytic storage and your serverless SQL pool endpoint. Storage accounts and endpoints provisioned during workspace creation are located in the same region.

For optimal performance, if you access other storage accounts with serverless SQL pool, make sure they're in the same region. If they aren't in the same region, there will be increased latency for the data's network transfer between the remote region and the endpoint's region.

Colocate your Azure Cosmos DB analytical storage and serverless SQL pool

Make sure your Azure Cosmos DB analytical storage is placed in the same region as an Azure Synapse workspace. Cross-region queries might cause huge latencies. Use the region property in the connection string to explicitly specify the region where the analytical store is placed (see Query Azure Cosmos DB by using serverless SQL pool): account=<database account name>;database=<database name>;region=<region name>'

Azure Storage throttling

Multiple applications and services might access your storage account. Storage throttling occurs when the combined IOPS or throughput generated by applications, services, and serverless SQL pool workloads exceeds the limits of the storage account. As a result, you'll experience a significant negative effect on query performance.

When throttling is detected, serverless SQL pool has built-in handling to resolve it. Serverless SQL pool makes requests to storage at a slower pace until throttling is resolved.

Tip

For optimal query execution, don't stress the storage account with other workloads during query execution.

Prepare files for querying

If possible, you can prepare files for better performance:

CSV optimizations

Here are best practices for using CSV files in serverless SQL pool.

Use PARSER_VERSION 2.0 to query CSV files

You can use a performance-optimized parser when you query CSV files. For details, see PARSER_VERSION.

Manually create statistics for CSV files

Serverless SQL pool relies on statistics to generate optimal query execution plans. Statistics are automatically created for columns using sampling and in most cases sampling percentage will be less than 100%. This flow is the same for every file format. Have in mind that when reading CSV with parser version 1.0 sampling isn't supported and automatic creation of statistics won't happen with sampling percentage less than 100%. For small tables with estimated low cardinality (number of rows) automatic statistics creation will be triggered with sampling percentage of 100%. That means that fullscan is triggered and automatic statistics are created even for CSV with parser version 1.0. In case statistics aren't automatically created, create statistics manually for columns that you use in queries, particularly those used in DISTINCT, JOIN, WHERE, ORDER BY, and GROUP BY. Check statistics in serverless SQL pool for details.

Delta Lake optimizations

Here are best practices for using Delta Lake files in serverless SQL pool.

Optimize checkpoints

Query performance of Delta Lake format is influenced by the number of JSON files in the _delta_log directory. To ensure optimal performance, avoid accumulating too many JSON files. Ideally, the log should contain only the latest Parquet checkpoint file with no additional JSON files. However, this setup may not be optimal for write-heavy workloads.

A balanced approach is to maintain around 10 JSON files between checkpoints, which typically offers good performance for both readers and writers. Be cautious of configurations that delay checkpoint creation, as they can lead to excessive JSON file accumulation and degrade query performance.

Set the following table property to ensure a checkpoint is created after every 10 JSON log files:

ALTER TABLE tableName SET TBLPROPERTIES ('delta.checkpointInterval' = '10')
Data types

Here are best practices for using data types in serverless SQL pool.

Use appropriate data types

The data types you use in your query affect performance and concurrency. You can get better performance if you follow these guidelines:

Check inferred data types

Schema inference helps you quickly write queries and explore data without knowing file schemas. The cost of this convenience is that inferred data types might be larger than the actual data types. This discrepancy happens when there isn't enough information in the source files to make sure the appropriate data type is used. For example, Parquet files don't contain metadata about maximum character column length. So serverless SQL pool infers it as varchar(8000).

Have in mind that the situation can be different in case of the shareable managed and external Spark tables exposed in the SQL engine as external tables. Spark tables provide different data types than the Synapse SQL engines. Mapping between Spark table data types and SQL types can be found here.

You can use the system stored procedure sp_describe_first_results_set to check the resulting data types of your query.

The following example shows how you can optimize inferred data types. This procedure is used to show the inferred data types:

EXEC sp_describe_first_result_set N'
    SELECT
        vendor_id, pickup_datetime, passenger_count
    FROM  
        OPENROWSET(
            BULK ''https://sqlondemandstorage.blob.core.windows.net/parquet/taxi/*/*/*'',
            FORMAT=''PARQUET''
        ) AS nyc';

Here's the result set:

is_hidden column_ordinal name system_type_name max_length 0 1 vendor_id varchar(8000) 8000 0 2 pickup_datetime datetime2(7) 8 0 3 passenger_count int 4

After you know the inferred data types for the query, you can specify appropriate data types:

SELECT
    vendorID, tpepPickupDateTime, passengerCount
FROM  
    OPENROWSET(
        BULK 'https://azureopendatastorage.blob.core.windows.net/nyctlc/yellow/puYear=2018/puMonth=*/*.snappy.parquet',
        FORMAT='PARQUET'
    )  
    WITH (
        vendorID varchar(4), -- we used length of 4 instead of the inferred 8000
        tpepPickupDateTime datetime2,
        passengerCount int
    ) AS nyc;
Filter optimization

Here are best practices for using queries in serverless SQL pool.

Push wildcards to lower levels in the path

You can use wildcards in your path to query multiple files and folders. Serverless SQL pool lists files in your storage account, starting from the first asterisk (*), by using the storage API. It eliminates files that don't match the specified path. Reducing the initial list of files can improve performance if there are many files that match the specified path up to the first wildcard.

Use filename and filepath functions to target specific partitions

Data is often organized in partitions. You can instruct serverless SQL pool to query particular folders and files. Doing so reduces the number of files and the amount of data the query needs to read and process. An added bonus is that you'll achieve better performance.

For more information, read about the filename and filepath functions and see the examples for querying specific files.

Tip

Always cast the results of the filepath and filename functions to appropriate data types. If you use character data types, be sure to use the appropriate length.

Functions used for partition elimination, filepath, and filename, aren't currently supported for external tables, other than those created automatically for each table created in Apache Spark for Azure Synapse Analytics.

If your stored data isn't partitioned, consider partitioning it. That way you can use these functions to optimize queries that target those files. When you query partitioned Apache Spark for Azure Synapse tables from serverless SQL pool, the query automatically targets only the necessary files.

Use proper collation to utilize predicate pushdown for character columns

Data in a Parquet file is organized in row groups. Serverless SQL pool skips row groups based on the specified predicate in the WHERE clause, which reduces IO. The result is increased query performance.

Predicate pushdown for character columns in Parquet files is supported for Latin1_General_100_BIN2_UTF8 collation only. You can specify collation for a particular column by using a WITH clause. If you don't specify this collation by using a WITH clause, the database collation is used.

Optimize repeating queries

Here are best practices for using CETAS in serverless SQL pool.

Use CETAS to enhance query performance and joins

CETAS is one of the most important features available in serverless SQL pool. CETAS is a parallel operation that creates external table metadata and exports the SELECT query results to a set of files in your storage account.

You can use CETAS to materialize frequently used parts of queries, like joined reference tables, to a new set of files. You can then join to this single external table instead of repeating common joins in multiple queries.

As CETAS generates Parquet files, statistics are automatically created when the first query targets this external table. The result is improved performance for subsequent queries targeting table generated with CETAS.

Query Azure data

Serverless SQL pools enable you to query data in Azure Storage or Azure Cosmos DB by using external tables and the OPENROWSET function. Make sure that you have proper permission set up on your storage.

Query CSV data

Learn how to query a single CSV file or folders and multiple CSV files. You can also query partitioned files

Query Parquet data

Learn how to query Parquet files with nested types. You can also query partitioned files.

Query Delta Lake

Learn how to query Delta Lake files with nested types.

Query Azure Cosmos DB data

Learn how to query Azure Cosmos DB analytical store. You can use an online generator to generate the WITH clause based on a sample Azure Cosmos DB document. You can create views on top of Azure Cosmos DB containers.

Query JSON data

Learn how to query JSON files. You can also query partitioned files.

Create views, tables, and other database objects

Learn how to create and use views and external tables or set up row-level security. If you have partitioned files, make sure you use partitioned views.

Copy and transform data (CETAS)

Learn how to store query results to storage by using the CETAS command.

Next steps

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4