RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://github.com/ClickHouse/ClickHouse/issues/47333 below:

Unexpected scan when using projections + order by on replacingmergetree · Issue #47333 · ClickHouse/ClickHouse · GitHub

Describe the unexpected behaviour
Clickhouse version: Clickhouse cloud (version 22.13)

We have the following test table setup:

CREATE TABLE test_table (
    `timestamp` DateTime64(3),
    `user_id` LowCardinality(String),
    PROJECTION by_user_id
    (
        SELECT *
        ORDER BY 
            `user_id`,
            `timestamp`
    )
)
ENGINE = ReplicatedReplacingMergeTree()
PARTITION BY toYYYYMM(timestamp)
PRIMARY KEY timestamp
ORDER BY (timestamp, user_id)
SETTINGS index_granularity = 1024;

We have created the additional projection to support filtering the events by user_id.

We populate the table with sample data

INSERT INTO test_table(timestamp, user_id)
SELECT 
	now() - randUniform(1, 5000000.)
	, 'user_a' AS user_id
FROM numbers(1000000)	
	;
	
INSERT INTO test_table(timestamp, user_id)
SELECT 
	now() - randUniform(1, 5000000.)
	, 'user_b' AS user_id
FROM numbers(1000000)	
	;

SELECT user_id, COUNT(), MIN(timestamp) AS timestamp_min, MAX(timestamp) AS timestamp_max  FROM test_table GROUP BY user_id;

┌─user_id─┬─count()─┬───────────timestamp_min─┬───────────timestamp_max─┐
│ user_b  │  906779 │ 2023-01-09 14:47:57.000 │ 2023-03-08 11:41:15.000 │
│ user_a  │  906420 │ 2023-01-09 14:47:56.000 │ 2023-03-08 11:41:11.000 │
└─────────┴─────────┴─────────────────────────┴─────────────────────────┘

Everything is fine if we run

SELECT * FROM test_table ORDER BY timestamp DESC LIMIT 10;

10 rows in set. Elapsed: 0.003 sec. Processed 2.77 thousand rows, 25.02 KB (1.10 million rows/s., 9.91 MB/s.)

But things get weird if we start writing queries that use the projection.

Much higher number of rows read with WHERE clause

SELECT * FROM test_table WHERE user_id='user_a' LIMIT 10;

10 rows in set. Elapsed: 0.009 sec. Processed 182.27 thousand rows, 1.64 MB (21.16 million rows/s., 190.44 MB/s.)

Shouldn't this still read ~2k rows since the projection is ordered by user_id, timestamp ?

"Full scan" when using WHERE clause with ORDER BY

SELECT * FROM test_table WHERE user_id='user_a' ORDER BY timestamp DESC LIMIT 10;

10 rows in set. Elapsed: 0.025 sec. Processed 906.77 thousand rows, 8.16 MB (36.33 million rows/s., 326.97 MB/s.)

Same as above, this should read only the first ~2k rows since it's exactly along the projection ORDER BY, instead it's reading all of the entries for user user_a.

Additional context

EXPLAINs for the above queries

EXPLAIN plan actions=1,indexes=1 SELECT * FROM test_table WHERE user_id='user_a' LIMIT 10;
┌─explain────────────────────────────────────────────────────────────┐
│ Expression ((Projection + Before ORDER BY))                        │
│ Actions: INPUT :: 0 -> user_id LowCardinality(String) : 0          │
│          INPUT :: 1 -> timestamp DateTime64(3) : 1                 │
│ Positions: 1 0                                                     │
│   Limit (preliminary LIMIT (without OFFSET))                       │
│   Limit 10                                                         │
│   Offset 0                                                         │
│     ReadFromStorage (MergeTree(with Normal projection by_user_id)) │
└────────────────────────────────────────────────────────────────────┘

EXPLAIN plan actions=1,indexes=1 SELECT * FROM test_table WHERE user_id='user_a' ORDER BY timestamp DESC LIMIT 10;
┌─explain────────────────────────────────────────────────────────────────┐
│ Expression (Projection)                                                │
│ Actions: INPUT :: 0 -> user_id LowCardinality(String) : 0              │
│          INPUT :: 1 -> timestamp DateTime64(3) : 1                     │
│ Positions: 1 0                                                         │
│   Limit (preliminary LIMIT (without OFFSET))                           │
│   Limit 10                                                             │
│   Offset 0                                                             │
│     Sorting (Sorting for ORDER BY)                                     │
│     Sort description: timestamp DESC                                   │
│     Limit 10                                                           │
│       Expression (Before ORDER BY)                                     │
│       Actions: INPUT :: 0 -> user_id LowCardinality(String) : 0        │
│                INPUT :: 1 -> timestamp DateTime64(3) : 1               │
│       Positions: 0 1                                                   │
│         ReadFromStorage (MergeTree(with Normal projection by_user_id)) │
└────────────────────────────────────────────────────────────────────────┘

We are currently using materialized views to achieve the same goal but I was trying out if projection could be a fit for our use case (as it looked much more ergonomic in terms of maintenance and consistency), after going through https://clickhouse.com/blog/clickhouse-faster-queries-with-projections-and-primary-indexes

Could be linked to:

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4