chDB is an in-process SQL OLAP Engine powered by ClickHouse 1 For more details: The birth of chDB
Get started with chdb using our Installation and Usage Examples
Currently, chDB supports Python 3.8+ on macOS and Linux (x86_64 and ARM64).
python3 -m chdb SQL [OutputFormat]
python3 -m chdb "SELECT 1,'abc'" Pretty
The following methods are available to access on-disk and in-memory data formats:
🗂️ Connection based API (recommended)import chdb # Create a connection (in-memory by default) conn = chdb.connect(":memory:") # Or use file-based: conn = chdb.connect("test.db") # Create a cursor cur = conn.cursor() # Execute queries cur.execute("SELECT number, toString(number) as str FROM system.numbers LIMIT 3") # Fetch data in different ways print(cur.fetchone()) # Single row: (0, '0') print(cur.fetchmany(2)) # Multiple rows: ((1, '1'), (2, '2')) # Get column information print(cur.column_names()) # ['number', 'str'] print(cur.column_types()) # ['UInt64', 'String'] # Use the cursor as an iterator cur.execute("SELECT number FROM system.numbers LIMIT 3") for row in cur: print(row) # Always close resources when done cur.close() conn.close()
For more details, see examples/connect.py.
(Parquet, CSV, JSON, Arrow, ORC and 60+)You can execute SQL and return desired format data.
import chdb res = chdb.query('select version()', 'Pretty'); print(res)
# See more data type format in tests/format_output.py res = chdb.query('select * from file("data.parquet", Parquet)', 'JSON'); print(res) res = chdb.query('select * from file("data.csv", CSV)', 'CSV'); print(res) print(f"SQL read {res.rows_read()} rows, {res.bytes_read()} bytes, storage read {res.storage_rows_read()} rows, {res.storage_bytes_read()} bytes, elapsed {res.elapsed()} seconds")
# See more in https://clickhouse.com/docs/en/interfaces/formats chdb.query('select * from file("data.parquet", Parquet)', 'Dataframe')(Pandas DataFrame, Parquet file/bytes, Arrow bytes) Query On Pandas DataFrame
import chdb.dataframe as cdf import pandas as pd # Join 2 DataFrames df1 = pd.DataFrame({'a': [1, 2, 3], 'b': ["one", "two", "three"]}) df2 = pd.DataFrame({'c': [1, 2, 3], 'd': ["①", "②", "③"]}) ret_tbl = cdf.query(sql="select * from __tbl1__ t1 join __tbl2__ t2 on t1.a = t2.c", tbl1=df1, tbl2=df2) print(ret_tbl) # Query on the DataFrame Table print(ret_tbl.query('select b, sum(a) from __table__ group by b')) # Pandas DataFrames are automatically registered as temporary tables in ClickHouse chdb.query("SELECT * FROM Python(df1) t1 JOIN Python(df2) t2 ON t1.a = t2.c").show()🗂️ Query with Stateful Session
from chdb import session as chs ## Create DB, Table, View in temp session, auto cleanup when session is deleted. sess = chs.Session() sess.query("CREATE DATABASE IF NOT EXISTS db_xxx ENGINE = Atomic") sess.query("CREATE TABLE IF NOT EXISTS db_xxx.log_table_xxx (x String, y Int) ENGINE = Log;") sess.query("INSERT INTO db_xxx.log_table_xxx VALUES ('a', 1), ('b', 3), ('c', 2), ('d', 5);") sess.query( "CREATE VIEW db_xxx.view_xxx AS SELECT * FROM db_xxx.log_table_xxx LIMIT 4;" ) print("Select from view:\n") print(sess.query("SELECT * FROM db_xxx.view_xxx", "Pretty"))
see also: test_stateful.py.
🗂️ Query with Python DB-API 2.0import chdb.dbapi as dbapi print("chdb driver version: {0}".format(dbapi.get_client_info())) conn1 = dbapi.connect() cur1 = conn1.cursor() cur1.execute('select version()') print("description: ", cur1.description) print("data: ", cur1.fetchone()) cur1.close() conn1.close()🗂️ Query with UDF (User Defined Functions)
from chdb.udf import chdb_udf from chdb import query @chdb_udf() def sum_udf(lhs, rhs): return int(lhs) + int(rhs) print(query("select sum_udf(12,22)"))
Some notes on chDB Python UDF(User Defined Function) decorator.
def sum_udf(lhs, rhs):
return int(lhs) + int(rhs)
for line in sys.stdin:
args = line.strip().split('\t')
lhs = args[0]
rhs = args[1]
print(sum_udf(lhs, rhs))
sys.stdout.flush()
def func_use_json(arg):
import json
...
sys.executable
see also: test_udf.py.
Process large datasets with constant memory usage through chunked streaming.
from chdb import session as chs sess = chs.Session() # Example 1: Basic example of using streaming query rows_cnt = 0 with sess.send_query("SELECT * FROM numbers(200000)", "CSV") as stream_result: for chunk in stream_result: rows_cnt += chunk.rows_read() print(rows_cnt) # 200000 # Example 2: Manual iteration with fetch() rows_cnt = 0 stream_result = sess.send_query("SELECT * FROM numbers(200000)", "CSV") while True: chunk = stream_result.fetch() if chunk is None: break rows_cnt += chunk.rows_read() print(rows_cnt) # 200000 # Example 3: Early cancellation demo rows_cnt = 0 stream_result = sess.send_query("SELECT * FROM numbers(200000)", "CSV") while True: chunk = stream_result.fetch() if chunk is None: break if rows_cnt > 0: stream_result.close() break rows_cnt += chunk.rows_read() print(rows_cnt) # 65409 # Example 4: Using PyArrow RecordBatchReader for batch export and integration with other libraries import pyarrow as pa from deltalake import write_deltalake # Get streaming result in arrow format stream_result = sess.send_query("SELECT * FROM numbers(100000)", "Arrow") # Create RecordBatchReader with custom batch size (default rows_per_batch=1000000) batch_reader = stream_result.record_batch(rows_per_batch=10000) # Use RecordBatchReader with external libraries like Delta Lake write_deltalake( table_or_uri="./my_delta_table", data=batch_reader, mode="overwrite" ) stream_result.close() sess.close()
Important Note: When using streaming queries, if the StreamingResult
is not fully consumed (due to errors or early termination), you must explicitly call stream_result.close()
to release resources, or use the with
statement for automatic cleanup. Failure to do so may block subsequent queries.
For more details, see test_streaming_query.py and test_arrow_record_reader_deltalake.py.
Query on Pandas DataFrameimport chdb import pandas as pd df = pd.DataFrame( { "a": [1, 2, 3, 4, 5, 6], "b": ["tom", "jerry", "auxten", "tom", "jerry", "auxten"], "dict_col": [ {'id': 1, 'tags': ['urgent', 'important'], 'metadata': {'created': '2024-01-01'}}, {'id': 2, 'tags': ['normal'], 'metadata': {'created': '2024-02-01'}}, {'id': 3, 'name': 'tom'}, {'id': 4, 'value': '100'}, {'id': 5, 'value': 101}, {'id': 6, 'value': 102}, ], } ) chdb.query("SELECT b, sum(a) FROM Python(df) GROUP BY b ORDER BY b").show() chdb.query("SELECT dict_col.id FROM Python(df) WHERE dict_col.value='100'").show()
import chdb import pyarrow as pa arrow_table = pa.table( { "a": [1, 2, 3, 4, 5, 6], "b": ["tom", "jerry", "auxten", "tom", "jerry", "auxten"], "dict_col": [ {'id': 1, 'value': 'tom'}, {'id': 2, 'value': 'jerry'}, {'id': 3, 'value': 'auxten'}, {'id': 4, 'value': 'tom'}, {'id': 5, 'value': 'jerry'}, {'id': 6, 'value': 'auxten'}, ], } ) chdb.query("SELECT b, sum(a) FROM Python(arrow_table) GROUP BY b ORDER BY b").show() chdb.query("SELECT dict_col.id FROM Python(arrow_table) WHERE dict_col.value='tom'").show()Query on chdb.PyReader class instance
read
method.read
method should:
col_names
of read
.read
method.get_schema
method can be implemented to return the schema of the table. The prototype is def get_schema(self) -> List[Tuple[str, str]]:
, the return value is a list of tuples, each tuple contains the column name and the column type. The column type should be one of the following: https://clickhouse.com/docs/en/sql-reference/data-typesimport chdb class myReader(chdb.PyReader): def __init__(self, data): self.data = data self.cursor = 0 super().__init__(data) def read(self, col_names, count): print("Python func read", col_names, count, self.cursor) if self.cursor >= len(self.data["a"]): self.cursor = 0 return [] block = [self.data[col] for col in col_names] self.cursor += len(block[0]) return block def get_schema(self): return [ ("a", "int"), ("b", "str"), ("dict_col", "json") ] reader = myReader( { "a": [1, 2, 3, 4, 5, 6], "b": ["tom", "jerry", "auxten", "tom", "jerry", "auxten"], "dict_col": [ {'id': 1, 'tags': ['urgent', 'important'], 'metadata': {'created': '2024-01-01'}}, {'id': 2, 'tags': ['normal'], 'metadata': {'created': '2024-02-01'}}, {'id': 3, 'name': 'tom'}, {'id': 4, 'value': '100'}, {'id': 5, 'value': 101}, {'id': 6, 'value': 102} ], } ) chdb.query("SELECT b, sum(a) FROM Python(reader) GROUP BY b ORDER BY b").show() chdb.query("SELECT dict_col.id FROM Python(reader) WHERE dict_col.value='100'").show()
see also: test_query_py.py and test_query_json.py.
chDB automatically converts Python dictionary objects to ClickHouse JSON types from these sources:
Pandas DataFrame
object
dtype are sampled (default 10,000 rows) to detect JSON structures.SET pandas_analyze_sample = 10000 -- Default sampling SET pandas_analyze_sample = 0 -- Force String type SET pandas_analyze_sample = -1 -- Force JSON type
String
if sampling finds non-dictionary values.Arrow Table
struct
type columns are automatically mapped to JSON columns.chdb.PyReader
get_schema()
:
def get_schema(self): return [ ("c1", "JSON"), # Explicit JSON mapping ("c2", "String") ]
When converting Python dictionary objects to JSON columns:
Nested Structures
Primitive Types
Complex Objects
For more examples, see examples and tests.
Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated. There are something you can help:
We welcome bindings for other languages, please refer to bindings for more details.
Please refer to VERSION-GUIDE.md for more details.
Apache 2.0, see LICENSE for more information.
chDB is mainly based on ClickHouse 1 for trade mark and other reasons, I named it chDB.
ClickHouse® is a trademark of ClickHouse Inc. All trademarks, service marks, and logos mentioned or depicted are the property of their respective owners. The use of any third-party trademarks, brand names, product names, and company names does not imply endorsement, affiliation, or association with the respective owners. ↩ ↩2
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4