Yellowbrick is an elastic, massively parallel processing (MPP) SQL database that runs in the cloud and on-premises, using kubernetes for scale, resilience and cloud portability. Yellowbrick is designed to address the largest and most complex business-critical data warehousing use cases. The efficiency at scale that Yellowbrick provides also enables it to be used as a high performance and scalable vector database to store and search vectors with SQL.
Using Yellowbrick as the vector store for ChatGptThis tutorial demonstrates how to create a simple chatbot backed by ChatGpt that uses Yellowbrick as a vector store to support Retrieval Augmented Generation (RAG). What you'll need:
The tutorial is divided into five parts. First we'll use langchain to create a baseline chatbot to interact with ChatGpt without a vector store. Second, we'll create an embeddings table in Yellowbrick that will represent the vector store. Third, we'll load a series of documents (the Administration chapter of the Yellowbrick Manual). Fourth, we'll create the vector representation of those documents and store in a Yellowbrick table. Lastly, we'll send the same queries to the improved chatbox to see the results.
%pip install --upgrade --quiet langchain
%pip install --upgrade --quiet langchain-openai langchain-community
%pip install --upgrade --quiet psycopg2-binary
%pip install --upgrade --quiet tiktoken
Setup: Enter the information used to connect to Yellowbrick and OpenAI API
Our chatbot integrates with ChatGpt via the langchain library, so you'll need an API key from OpenAI first:
To get an api key for OpenAI:
You'll also need your Username, Password, and Database name from the welcome email when you sign up for the Yellowbrick Sandbox Account.
The following should be modified to include the information for your Yellowbrick database and OpenAPI Key
YBUSER = "[SANDBOX USER]"
YBPASSWORD = "[SANDBOX PASSWORD]"
YBDATABASE = "[SANDBOX_DATABASE]"
YBHOST = "trialsandbox.sandbox.aws.yellowbrickcloud.com"
OPENAI_API_KEY = "[OPENAI API KEY]"
import os
import pathlib
import re
import sys
import urllib.parse as urlparse
from getpass import getpass
import psycopg2
from IPython.display import Markdown, display
from langchain.chains import LLMChain, RetrievalQAWithSourcesChain
from langchain_community.vectorstores import Yellowbrick
from langchain_core.documents import Document
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
yellowbrick_connection_string = (
f"postgres://{urlparse.quote(YBUSER)}:{YBPASSWORD}@{YBHOST}:5432/{YBDATABASE}"
)
YB_DOC_DATABASE = "sample_data"
YB_DOC_TABLE = "yellowbrick_documentation"
embedding_table = "my_embeddings"
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
from langchain_core.prompts.chat import (
ChatPromptTemplate,
HumanMessagePromptTemplate,
SystemMessagePromptTemplate,
)
Part 1: Creating a baseline chatbot backed by ChatGpt without a Vector Store
We will use langchain to query ChatGPT. As there is no Vector Store, ChatGPT will have no context in which to answer the question.
system_template = """If you don't know the answer, Make up your best guess."""
messages = [
SystemMessagePromptTemplate.from_template(system_template),
HumanMessagePromptTemplate.from_template("{question}"),
]
prompt = ChatPromptTemplate.from_messages(messages)
chain_type_kwargs = {"prompt": prompt}
llm = ChatOpenAI(
model_name="gpt-3.5-turbo",
temperature=0,
max_tokens=256,
)
chain = LLMChain(
llm=llm,
prompt=prompt,
verbose=False,
)
def print_result_simple(query):
result = chain(query)
output_text = f"""### Question:
{query}
### Answer:
{result["text"]}
"""
display(Markdown(output_text))
print_result_simple("How many databases can be in a Yellowbrick Instance?")
print_result_simple("What's an easy way to add users in bulk to Yellowbrick?")
Part 2: Connect to Yellowbrick and create the embedding tables
To load your document embeddings into Yellowbrick, you should create your own table for storing them in. Note that the Yellowbrick database that the table is in has to be UTF-8 encoded.
Create a table in a UTF-8 database with the following schema, providing a table name of your choice:
try:
conn = psycopg2.connect(yellowbrick_connection_string)
except psycopg2.Error as e:
print(f"Error connecting to the database: {e}")
exit(1)
cursor = conn.cursor()
create_table_query = f"""
CREATE TABLE IF NOT EXISTS {embedding_table} (
doc_id uuid NOT NULL,
embedding_id smallint NOT NULL,
embedding double precision NOT NULL
)
DISTRIBUTE ON (doc_id);
truncate table {embedding_table};
"""
try:
cursor.execute(create_table_query)
print(f"Table '{embedding_table}' created successfully!")
except psycopg2.Error as e:
print(f"Error creating table: {e}")
conn.rollback()
conn.commit()
cursor.close()
conn.close()
Extract document paths and contents from an existing Yellowbrick table. We'll use these documents to create embeddings from in the next step.
yellowbrick_doc_connection_string = (
f"postgres://{urlparse.quote(YBUSER)}:{YBPASSWORD}@{YBHOST}:5432/{YB_DOC_DATABASE}"
)
print(yellowbrick_doc_connection_string)
conn = psycopg2.connect(yellowbrick_doc_connection_string)
cursor = conn.cursor()
query = f"SELECT path, document FROM {YB_DOC_TABLE}"
cursor.execute(query)
yellowbrick_documents = cursor.fetchall()
print(f"Extracted {len(yellowbrick_documents)} documents successfully!")
cursor.close()
conn.close()
Part 4: Load the Yellowbrick Vector Store with Documents
Go through documents, split them into digestable chunks, create the embedding and insert into the Yellowbrick table. This takes around 5 minutes.
DOCUMENT_BASE_URL = "https://docs.yellowbrick.com/6.7.1/"
separator = "\n## "
chunk_size_limit = 2000
max_chunk_overlap = 200
documents = [
Document(
page_content=document[1],
metadata={"source": DOCUMENT_BASE_URL + document[0].replace(".md", ".html")},
)
for document in yellowbrick_documents
]
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size_limit,
chunk_overlap=max_chunk_overlap,
separators=[separator, "\nn", "\n", ",", " ", ""],
)
split_docs = text_splitter.split_documents(documents)
docs_text = [doc.page_content for doc in split_docs]
embeddings = OpenAIEmbeddings()
vector_store = Yellowbrick.from_documents(
documents=split_docs,
embedding=embeddings,
connection_string=yellowbrick_connection_string,
table=embedding_table,
)
print(f"Created vector store with {len(documents)} documents")
Part 5: Creating a chatbot that uses Yellowbrick as the vector store
Next, we add Yellowbrick as a vector store. The vector store has been populated with embeddings representing the administrative chapter of the Yellowbrick product documentation.
We'll send the same queries as above to see the impoved responses.
system_template = """Use the following pieces of context to answer the users question.
Take note of the sources and include them in the answer in the format: "SOURCES: source1 source2", use "SOURCES" in capital letters regardless of the number of sources.
If you don't know the answer, just say that "I don't know", don't try to make up an answer.
----------------
{summaries}"""
messages = [
SystemMessagePromptTemplate.from_template(system_template),
HumanMessagePromptTemplate.from_template("{question}"),
]
prompt = ChatPromptTemplate.from_messages(messages)
vector_store = Yellowbrick(
OpenAIEmbeddings(),
yellowbrick_connection_string,
embedding_table,
)
chain_type_kwargs = {"prompt": prompt}
llm = ChatOpenAI(
model_name="gpt-3.5-turbo",
temperature=0,
max_tokens=256,
)
chain = RetrievalQAWithSourcesChain.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vector_store.as_retriever(search_kwargs={"k": 5}),
return_source_documents=True,
chain_type_kwargs=chain_type_kwargs,
)
def print_result_sources(query):
result = chain(query)
output_text = f"""### Question:
{query}
### Answer:
{result["answer"]}
### Sources:
{result["sources"]}
### All relevant sources:
{", ".join(list(set([doc.metadata["source"] for doc in result["source_documents"]])))}
"""
display(Markdown(output_text))
print_result_sources("How many databases can be in a Yellowbrick Instance?")
print_result_sources("Whats an easy way to add users in bulk to Yellowbrick?")
Part 6: Introducing an Index to Increase Performance
Yellowbrick also supports indexing using the Locality-Sensitive Hashing approach. This is an approximate nearest-neighbor search technique, and allows one to trade off similarity search time at the expense of accuracy. The index introduces two new tunable parameters:
create_lsh_index(num_hyperplanes)
. The more documents, the more hyperplanes are needed. LSH is a form of dimensionality reduction. The original embeddings are transformed into lower dimensional vectors where the number of components is the same as the number of hyperplanes.Here's how you can create an index on the embeddings we loaded into Yellowbrick. We'll also re-run the previous chat session, but this time the retrieval will use the index. Note that for such a small number of documents, you won't see the benefit of indexing in terms of performance.
system_template = """Use the following pieces of context to answer the users question.
Take note of the sources and include them in the answer in the format: "SOURCES: source1 source2", use "SOURCES" in capital letters regardless of the number of sources.
If you don't know the answer, just say that "I don't know", don't try to make up an answer.
----------------
{summaries}"""
messages = [
SystemMessagePromptTemplate.from_template(system_template),
HumanMessagePromptTemplate.from_template("{question}"),
]
prompt = ChatPromptTemplate.from_messages(messages)
vector_store = Yellowbrick(
OpenAIEmbeddings(),
yellowbrick_connection_string,
embedding_table,
)
lsh_params = Yellowbrick.IndexParams(
Yellowbrick.IndexType.LSH, {"num_hyperplanes": 8, "hamming_distance": 2}
)
vector_store.create_index(lsh_params)
chain_type_kwargs = {"prompt": prompt}
llm = ChatOpenAI(
model_name="gpt-3.5-turbo",
temperature=0,
max_tokens=256,
)
chain = RetrievalQAWithSourcesChain.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vector_store.as_retriever(
k=5, search_kwargs={"index_params": lsh_params}
),
return_source_documents=True,
chain_type_kwargs=chain_type_kwargs,
)
def print_result_sources(query):
result = chain(query)
output_text = f"""### Question:
{query}
### Answer:
{result["answer"]}
### Sources:
{result["sources"]}
### All relevant sources:
{", ".join(list(set([doc.metadata["source"] for doc in result["source_documents"]])))}
"""
display(Markdown(output_text))
print_result_sources("How many databases can be in a Yellowbrick Instance?")
print_result_sources("Whats an easy way to add users in bulk to Yellowbrick?")
Next Steps:
This code can be modified to ask different questions. You can also load your own documents into the vector store. The langchain module is very flexible and can parse a large variety of files (including HTML, PDF, etc).
You can also modify this to use Huggingface embeddings models and Meta's Llama 2 LLM for a completely private chatbox experience.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4