A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://docs.oracle.com/en/database/oracle/oracle-database/23/arpls/dbms_vector_chain1.html below:

DBMS_VECTOR_CHAIN

Use the DBMS_VECTOR_CHAIN.CREATE_PREFERENCE helper procedure to create a vectorizer preference, to be used when creating or updating hybrid vector indexes.

Purpose

To create a vectorizer preference.

This allows you to customize vector search parameters of a hybrid vector indexing pipeline. The goal of a vectorizer preference is to provide you with a straightforward way to configure how to chunk or embed your documents, without requiring a deep understanding of various chunking or embedding strategies.

Usage Notes

A vectorizer preference is a JSON object that collectively holds user-specified values related to the following chunking, embedding, or vector index creation parameters:

All vector index preferences follow the same JSON syntax as defined for their corresponding DBMS_VECTOR and DBMS_VECTOR_CHAIN APIs.

After creating a vectorizer preference, you can use the VECTORIZER parameter to pass this preference name in the paramstring of the PARAMETERS clause for CREATE_HYBRID_VECTOR_INDEX and ALTER_INDEX SQL statements.

Creating a preference is optional. If you do not specify any optional preference, then the index is created with system defaults.

Syntax

DBMS_VECTOR_CHAIN.CREATE_PREFERENCE (
    PREF_NAME     IN VARCHAR2,
    PREF_TYPE     IN VARCHAR2,
    PARAMS        IN JSON default NULL
);

PREF_NAME

Specify the name of the vectorizer preference to create.

PREF_TYPE

Type of preference. The only supported preference type is:

DBMS_VECTOR_CHAIN.VECTORIZER

PARAMS

Specify vector search-specific parameters in JSON format:

Embedding Parameter:

{ "model" : <embedding_model_for_vector_generation> }

For example:

{ "model" : MY_INDB_MODEL }

model specifies the name under which your ONNX embedding model is stored in the database.

If you do not have an in-database embedding model in ONNX format, then perform the steps listed in Oracle Database AI Vector Search User's Guide.

Chunking Parameters:

{
    "by"           :  mode,
    "max"          :  max,
    "overlap"      :  overlap,
    "split"        :  split_condition,
    "vocabulary"   :  vocabulary_name,
    "language"     :  nls_language,
    "normalize"    :  normalize_mode,
    "extended"     :  boolean
}

For example:

JSON(
 '{ "by"           :    "vocabulary",
    "max"          :    "100",
    "overlap"      :    "0",
    "split"        :    "none",
    "vocabulary"   :    "myvocab",
    "language"     :    "american",
    "normalize"    :    "all" 
  }')

If you specify split as custom and normalize as options, then you must additionally specify the custom_list and norm_options parameters, respectively:

JSON(
 '{ "by"           :    "vocabulary",
    "max"          :    "100",
    "overlap"      :    "0",
    "split"        :    "custom",
    "custom_list"  :    [ "<p>" , "<s>" ],
    "vocabulary"   :    "myvocab",
    "language"     :    "american",
    "normalize"    :    "options",
    "norm_options" :    [ "whitespace" ] 
  }')

The following table describes all the chunking parameters:

Parameter Description and Acceptable Values

by

Specify a mode for splitting your data, that is, to split by counting the number of characters, words, or vocabulary tokens.

Valid values:

Default value: words

max

Specify a limit on the maximum size of each chunk. This setting splits the input text at a fixed point where the maximum limit occurs in the larger text. The units of max correspond to the by mode, that is, to split data when it reaches the maximum size limit of a certain number of characters, words, numbers, punctuation marks, or vocabulary tokens.

Valid values:

Default value: 100

split [by]

Specify where to split the input text when it reaches the maximum size limit. This helps to keep related data together by defining appropriate boundaries for chunks.

Valid values:

Default value: recursively

overlap

Specify the amount (as a positive integer literal or zero) of the preceding text that the chunk should contain, if any. This helps in logically splitting up related text (such as a sentence) by including some amount of the preceding chunk text.

The amount of overlap depends on how the maximum size of the chunk is measured (in characters, words, or vocabulary tokens). The overlap begins at the specified split condition (for example, at newline).

Valid value: 5% to 20% of max

Default value: 0

language

Specify the language of your input data.

This clause is important, especially when your text contains certain characters (for example, punctuations or abbreviations) that may be interpreted differently in another language.

Valid values:

Note: You must use escape characters with any language abbreviation that is also a SQL reserved word (for example, language abbreviations such as IN, AS, OR, IS).

For example:

SELECT dbms_vector_chain.utl_to_chunks('this is an example', 
   JSON('{ "language" : "\"in\"" }')) 
from dual;
SELECT dbms_vector_chain.utl_to_chunks('this is an example', 
   JSON_OBJECT('language' value '"in"' RETURNING JSON)) 
from dual;

Default value: NLS_LANGUAGE from session

normalize

Automatically pre-processes or post-processes issues (such as multiple consecutive spaces and smart quotes) that may arise when documents are converted into text. Oracle recommends you to use a normalization mode to extract high-quality chunks.

Valid values:

Default value: none

extended

Increases the output limit of a VARCHAR2 string to 32767 bytes, without requiring you to set the max_string_size parameter to extended.

Default value: 4000 or 32767 (when max_string_size=extended)

Vector Index Parameters:

{ 
  "distance"        :  <vector_distance>,
  "accuracy"        :  <vector_accuracy>, 
  "vector_idxtype"  :  <vector_idxtype>
}

For example:

{ 
  "distance"        :  COSINE,
  "accuracy"        :  95, 
  "vector_idxtype"  :  HNSW
}
Parameter Description

distance

Distance metric or mathematical function used to compute the distance between vectors:

Note: Currently, the HAMMING and JACCARD vector distance metrics are not supported with hybrid vector indexes.

For detailed information on each of these metrics, see Vector Distance Functions and Operators.

Default value: COSINE

accuracy

Target accuracy at which the approximate search should be performed when running an approximate search query using vector indexes.

As explained in Understand Approximate Similarity Search Using Vector Indexes, you can specify non-default target accuracy values either by specifying a percentage value or by specifying internal parameters values, depending on the index type you are using.

Valid range for both HNSW and IVF vector indexes is:

> 0 and <= 100

Default value: None

vector_idxtype

Type of vector index to create:

For detailed information on each of these index types, see Manage the Different Categories of Vector Indexes.

Default value: IVF

Paths Parameter:

This field allows specification of an array of path objects. There can be many path objects and each path object must specify a

type

and a

path_list

.

Note:

If the user does not specify the

paths

field, the whole document would be considered.

"paths":[
        {"type"      : "<path_type>",
         "path_list" : ["<path_list_array>"]
        }
        ]

Let us consider a sample JSON document:

 {
      "person": 
       {
         "bio": "James is a data scientist who specializes in natural language .. ",
         "profile":
          {
              "text" : "James is a data scientist with expertise in Python programming...",
              "embedding" : [1.60541728E-001,5.76677322E-002,4.0473938E-003,1.2037459E-001,-5.98970801E-004, ..]
          },
         "avatar": "https://example.com/images/James.jpg"
       },
      "product": 
      {
        "description": "A social media analytics tool.", "It helps brands track...",
        "image": "https://example.com/images/data_tool.jpg",
        "embedding" : [1.60541728E-001,5.76677322E-002,4.0473938E-003,1.2037459E-001,-5.98970801E-004, ..]
      }
 }

And a path_list corresponding to the above JSON is provided here:

"paths": [
         {"type"       : "VECTOR",
          "path_list" : ["$.person.profile.embedding", "$.product.embedding"]
         },
         {"type"       : "STRING",
          "path_list" : ["$.person.bio", "$.product.description"]
         }
         ] 

The following table describes the details of paths parameter:

Parameter

Accepted Values

type

The possible values for this field are:

path_list

Accepts an array of paths with at least one path in valid JSON format - ($.a.b.c.d).

Note: For the VECTOR option, Oracle currently accepts one vector array per path.

Example

begin
  DBMS_VECTOR_CHAIN.CREATE_PREFERENCE(
    'my_vec_spec',
     DBMS_VECTOR_CHAIN.VECTORIZER,
     json('{ "vector_idxtype" :  "hnsw",
             "model"          :  "my_doc_model",
             "by"             :  "words",
             "max"            :  100,
             "overlap"        :  10,
             "split"          :  "recursively,
             "language"       :  "english",
             "paths":          : [
                                 {
                                  "type"      : "VECTOR",
                                  "path_list" : ["$.person.profile.embedding"]
                                  }
                                 ]  
              }'));
end;
/

CREATE HYBRID VECTOR INDEX my_hybrid_idx on 
    doc_table(text_column) 
    parameters('VECTORIZER my_vec_spec');

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4