APPLIES TO: Azure Cosmos DB for PostgreSQL (powered by the Citus database extension to PostgreSQL)
Choosing each table's distribution column is one of the most important modeling decisions you'll make. Azure Cosmos DB for PostgreSQL stores rows in shards based on the value of the rows' distribution column.
The correct choice groups related data together on the same physical nodes, which makes queries fast and adds support for all SQL features. An incorrect choice makes the system run slowly.
General tipsHere are four criteria for choosing the ideal distribution column for your distributed tables.
Pick a column that is a central piece in the application workload.
You might think of this column as the "heart," "central piece," or "natural dimension" for partitioning data.
Examples:
device_id
in an IoT workloadsecurity_id
for a financial app that tracks securitiesuser_id
in user analyticstenant_id
for a multi-tenant SaaS applicationPick a column with decent cardinality, and an even statistical distribution.
The column should have many values, and distribute thoroughly and evenly between all shards.
Examples:
Pick a column that benefits your existing queries.
For a transactional or operational workload (where most queries take only a few milliseconds), pick a column that appears as a filter in WHERE
clauses for at least 80% of queries. For instance, the device_id
column in SELECT * FROM events WHERE device_id=1
.
For an analytical workload (where most queries take 1-2 seconds), pick a column that enables queries to be parallelized across worker nodes. For instance, a column frequently occurring in GROUP BY clauses, or queried over multiple values at once.
Pick a column that is present in the majority of large tables.
Tables over 50 GB should be distributed. Picking the same distribution column for all of them enables you to co-locate data for that column on worker nodes. Co-location makes it efficient to run JOINs and rollups, and enforce foreign keys.
The other (smaller) tables can be local or reference tables. If the smaller table needs to JOIN with distributed tables, make it a reference table.
We've seen general criteria for picking the distribution column. Now let's see how they apply to common use cases.
Multi-tenant appsThe multi-tenant architecture uses a form of hierarchical database modeling to distribute queries across nodes in the cluster. The top of the data hierarchy is known as the tenant ID and needs to be stored in a column on each table.
Azure Cosmos DB for PostgreSQL inspects queries to see which tenant ID they involve and finds the matching table shard. It routes the query to a single worker node that contains the shard. Running a query with all relevant data placed on the same node is called colocation.
The following diagram illustrates colocation in the multi-tenant data model. It contains two tables, Accounts and Campaigns, each distributed by account_id
. The shaded boxes represent shards. Green shards are stored together on one worker node, and blue shards are stored on another worker node. Notice how a join query between Accounts and Campaigns has all the necessary data together on one node when both tables are restricted to the same account_id.
To apply this design in your own schema, identify what constitutes a tenant in your application. Common instances include company, account, organization, or customer. The column name will be something like company_id
or customer_id
. Examine each of your queries and ask yourself, would it work if it had more WHERE clauses to restrict all tables involved to rows with the same tenant ID? Queries in the multi-tenant model are scoped to a tenant. For instance, queries on sales or inventory are scoped within a certain store.
Read the multi-tenant tutorial for an example of how to build this kind of application.
Real-time appsThe multi-tenant architecture introduces a hierarchical structure and uses data colocation to route queries per tenant. By contrast, real-time architectures depend on specific distribution properties of their data to achieve highly parallel processing.
We use "entity ID" as a term for distribution columns in the real-time model. Typical entities are users, hosts, or devices.
Real-time queries typically ask for numeric aggregates grouped by date or category. Azure Cosmos DB for PostgreSQL sends these queries to each shard for partial results and assembles the final answer on the coordinator node. Queries run fastest when as many nodes contribute as possible, and when no single node must do a disproportionate amount of work.
Best practicesRead the real-time dashboard tutorial for an example of how to build this kind of application.
Time-series dataIn a time-series workload, applications query recent information while they archive old information.
The most common mistake in modeling time-series information in Azure Cosmos DB for PostgreSQL is to use the timestamp itself as a distribution column. A hash distribution based on time distributes times seemingly at random into different shards rather than keeping ranges of time together in shards. Queries that involve time generally reference ranges of time, for example, the most recent data. This type of hash distribution leads to network overhead.
Best practicesRetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4