RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://github.com/ClickHouse/ClickBench/issues/384 below:

[Discussion] How should Cassandra and ScyllaDB (API-compatible with Cassandra) be benchmarked? · Issue #384 · ClickHouse/ClickBench · GitHub

Disclaimer: I am not an expert in these two systems. I am just sharing my short analysis and thoughts. If my analysis contains errors, I would be glad if someone points them out!

The README lists Cassandra and ScyllaDB (API-compatible with Cassandra) in the Systems Included section, but there are no benchmarks for these systems available yet.

TL;DR

How can Cassandra and ScyllaDB, despite their limitations, be benchmarked on ClickBench?

Long version

I am wondering how Cassandra and ScyllaDB could be benchmarked on ClickBench in a way that the results are comparable to other systems?

While there are workarounds for some limitations, e.g., replacing the NOT EQUAL operator with NOT IN in predicates, and replacing arithmetic functions and the REGEX_REPLACE function by user-defined functions (Lua, WASM), I do think that many queries cannot be ran like intended.

Issue 1: There seems to be no support for

COUNT(DISTINCT <column>)

OFFSET <offset>

This affects queries Q5, Q6, Q9, Q10, Q11, Q12, Q14, Q23 and Q39, Q40, Q41, Q42, Q43, respectively.

Furthermore, there are also limitations on ORDER BY.

Issue 2: Queries containing an ORDER BY need to be restricted by an EQ or IN on the partition key.

Some queries use derived fields in the ORDER BY. Hence, these queries may not be runnable on these systems at all, or at least not in the intended way with on-the-fly calculation of the derived fields. This affects queries Q8, Q9, Q10, Q11, Q12, Q13, Q14, Q15, Q16, Q17, Q19, Q22, Q28, Q29, Q31, Q32, Q33, Q34, Q35, Q36, Q37, Q38, Q39, Q40, Q41, Q42, Q43.

In general, Cassandra and ScyllaDB seem to heavily rely on the primary/partition/clustering key for data querying. Data modeling should be (has to be? 🤔) done according to the expected query patterns. In case of heterogeneous query patterns, tables need to be duplicated and stored with different primary/partition/clustering keys.

In addition to the restrictions on ORDER BY, there are also restrictions on GROUP BY.

Issue 3: GROUP BY arguments must form a prefix of the primary key.

For example, Q17 and Q18 (GROUP BY UserID, SearchPhrase) could be ran on the same table by setting the primary key accordingly (at least in theory, because the combination of UserID, SearchPhrase might not be unique). However, then it is impossible to run Q9 and Q10 (GROUP BY RegionID), Q11 (GROUP BY MobilePhoneModel), Q12 (GROUP BY MobilePhone, MobilePhoneModel), etc. on the same table.

Any thoughts on this? Am I missing something?

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4