A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://dba.stackexchange.com/questions/255022/efficient-key-value-store-in-postgres/255227 below:

postgresql - Efficient key value store in Postgres

I want to use Postgres(11 in CloudSQL) as an efficient key value store. I have about 200GB of dictionaries (average size is 10kB, the structure can be different and nested). I'm thinking about leveraging improved hash indexes. Here is the schema:

        CREATE EXTENSION IF NOT EXISTS "uuid-ossp";

        CREATE TABLE IF NOT EXISTS key_val (
            id uuid DEFAULT uuid_generate_v4(),
            value jsonb,
            EXCLUDE using hash (id with =)
        );

        CREATE INDEX IF NOT EXISTS idx_key_val ON key_val USING hash (id);

Get, Update and Insert are quite simple, but I'm not sure how to implement efficient upsert.

INSERT INTO key_val VALUES ($1, $2) ON CONFLICT ON CONSTRAINT key_val_id_excl DO UPDATE SET value = ($2)

results in WrongObjectTypeError ON CONFLICT DO UPDATE not supported with exclusion constraints

Possible solution:

  1. Upsert https://stackoverflow.com/a/8702291/336186
  2. Use hstore, but I'd lose json query improvements and a single dictionary can have 20kB+
  3. Use btree - Lookups would be slower (estimate is 2x-3x) and the index would be much larger.
  4. Another option

EDIT

Info: Mac 2,3 GHz Intel Core i9; 16GB RAM (PostgreSQL) 11.4

\d+
public | user_profiles | table | postgres | 16 GB
# num of records
SELECT COUNT(*) FROM user_profiles -> 3 095 348
# hash index
SELECT pg_table_size('idx_user_profiles');
87 334 912
SELECT pg_table_size('idx_user_profiles_btree')
97 705 984

For Btree

postgres=# \d user_profiles
                Table "public.user_profiles"
 Column | Type  | Collation | Nullable |      Default
--------+-------+-----------+----------+--------------------
 key    | uuid  |           | not null | uuid_generate_v4()
 value  | jsonb |           |          |
Indexes:
    "user_profiles_pkey" PRIMARY KEY, btree (key)


postgres=# SELECT * FROM user_profiles WHERE key = '2cfc4dbf-a1b9-46b3-8c15-a03f51dde890';
Time: 3.126 ms

INSERT INTO user_profiles (value) VALUES ('{"type": "_app_retail","user_id": "a628.......    0  ]}');
INSERT 0 1
Time: 4.496 ms # with standard btree index, no optimization

Hash index

\d+ user_profiles
                                    Table "public.user_profiles"
 Column | Type  | Collation | Nullable |      Default       | Storage  | Stats target | Description
--------+-------+-----------+----------+--------------------+----------+--------------+-------------
 key    | uuid  |           | not null | uuid_generate_v4() | plain    |              |
 value  | jsonb |           |          |                    | extended |              |
Indexes:
    "idx_user_profiles" hash (key)


INSERT INTO user_profiles...);
INSERT 0 1
Time: 1.690 ms
# doesnt exists
SELECT * FROM user_profiles WHERE key = '2cfc4dbf-a1b9-46b3-8c15-a03f51dde891'; 
Time: 0.514 ms
# exists
postgres=# SELECT * FROM user_profiles WHERE key = '2cfc4dbf-a1b9-46b3-8c15-a03f51dde890';
Time: 1.747 ms

To confirm I made a python script using asyncpg.

import asyncio
import uuid
from time import time
import os

import asyncpg
from flask import json
# pip(env) install flask asyncpg


TABLE_NAME = "user_profiles"
EXAMPLE_PROFILE = {
    "type": "whatever",
    "user_id": "8378c54f-3a39-41af-b4ab-5a514aa1b941",
    "same_action_count": 8,
    "active_time": 156.36,
    "is_premium": 0.0,
    "early_premium": 0.0,
    "referral_type": "empty",
    "age": 200,
    "age_is_minor": 0,
    "age_is_student": 0,
    "age_is_young_adult": 0,
    "age_is_mid": 1,
    "age_is_senior": 0,
    "integral_balance_w": 0.0,
    "average_balance": 0.0,
    "integral_balance": 0.0,
    "sum_total_in": 0.0,
    "third_party_company": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
}

async def create_connection():
    conn = await asyncpg.connect(
        user="postgres",
        password=os.environ.get("SECRET_DB_PASSWORD", "postgres"),
        database="postgres",
        host=os.environ.get("DB_HOST", "127.0.0.1"),
    )
    await conn.set_type_codec("json", encoder=json.dumps, decoder=json.loads, schema="pg_catalog")
    return conn


async def create_table(conn, table_name):
    await conn.execute(
        f"""
        CREATE EXTENSION IF NOT EXISTS "uuid-ossp";

        CREATE TABLE IF NOT EXISTS {table_name} (
            id uuid DEFAULT uuid_generate_v4(),
            value jsonb,
            EXCLUDE using hash (id with =)
        );

        CREATE INDEX IF NOT EXISTS idx_{table_name} ON {table_name} USING hash (id);
    """
    )


async def run():
    conn = await create_connection()
    await create_table(conn, TABLE_NAME)
    profile = EXAMPLE_PROFILE.copy()
    for i in range(100):
        profile["user_id"] = str(uuid.uuid4())
        hash_id = str(uuid.uuid4())
        str_profile = json.dumps(profile)
        tic = time()
        _ = await conn.execute(f"INSERT INTO {TABLE_NAME} VALUES ($1, $2)", hash_id, str_profile)
        toc = time()
        print(i, (toc - tic) * 1000)

    await conn.close()


loop = asyncio.get_event_loop()
loop.run_until_complete(run())

Hash index has 1,2 ms on average for inserts, where btree has about 3ms.

We can see here that the hash index performs better than the btree index and the performance difference is in the range of 10 to 22%. In some other workloads we have seen a better performance like with hash index on varchar columns and even in the community, it has been reported that there is performance improvement in the range of 40-60% when hash indexes are used for unique index columns.

http://amitkapila16.blogspot.com/2017/03/hash-indexes-are-faster-than-btree.html

Here is where Hash indexes start to shine, a simple table with 2 columns a serial and a text column, and 319,894,674 records, the table size is 23 GB, the serial column is the primary key (no good reason, just added to it) and the size of the PK 6852 MB. A query without index in the sha1 column, the execution time is 4 minutes (thanks to the parallel workers). The size of the B-Tree index: 20 GB. The size of the Hash index: 8192 MB (8 GB) more than a half than B-Tree :-) and the execution time is roughly the same as b-tree. Another advantage of smaller indexes are that they can fit best in memory and less reads of disk, “Buffers: shared hit=2” vs “Buffers: shared hit=6”.

https://medium.com/@jorsol/postgresql-10-features-hash-indexes-484f319db281


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4