A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://pathway.com/developers/api-docs/pathway-xpacks-llm/splitters below:

pw.xpacks.llm.splitters | Pathway

pw.xpacks.llm.splitters

A library of text spliiters - routines which slit a long text into smaller chunks.

class BaseSplitter(*, return_type=..., deterministic=False, propagate_none=False, executor=AutoExecutor(), cache_strategy=None, max_batch_size=None)[source]

Abstract base class for splitters that split a long text into smaller chunks.

__call__(text, **kwargs)

source Split a given text into smaller chunks. Preserves metadata and propagates it to all chunks that are created from the same input string.

class NullSplitter(*, return_type=..., deterministic=False, propagate_none=False, executor=AutoExecutor(), cache_strategy=None, max_batch_size=None)[source]

A splitter which returns its argument as one long text ith null metadata.

The null splitter always return a list of length one containing the full text and empty metadata.

__call__(text, **kwargs)

source Split a given text into smaller chunks. Preserves metadata and propagates it to all chunks that are created from the same input string.

class RecursiveSplitter(chunk_size=500, chunk_overlap=0, separators=SEPARATORS, is_separator_regex=False, encoding_name=None, model_name=None, hf_tokenizer=None)[source]

Splitter that splits a long text into smaller chunks based on a set of separators. Chunking is performed recursively using first separator in the list and then second separator in the list and so on, until the text is split into chunks of length smaller than chunk_size. Length of the chunks is measured by the number of characters in the text if none of encoding_name, model_name or hf_tokenizer is provided. Otherwise, the length of the chunks is measured by the number of tokens that particular tokenizer would output.

Under the hood it is a wrapper around langchain_text_splitters.RecursiveTextSplitter (MIT license).

__call__(text, **kwargs)

source Split a given text into smaller chunks. Preserves metadata and propagates it to all chunks that are created from the same input string.

class TokenCountSplitter(min_tokens=50, max_tokens=500, encoding_name='cl100k_base')[source]

Splits a given string or a list of strings into chunks based on token count.

This splitter tokenizes the input texts and splits them into smaller parts (“chunks”) ensuring that each chunk has a token count between min_tokens and max_tokens. It also attempts to break chunks at sensible points such as punctuation marks. Splitter expects input to be a Pathway column of strings OR pairs of strings and dict metadata.

All default arguments may be overridden in the UDF call

Example:

from pathway.xpacks.llm.splitters import TokenCountSplitter
import pathway as pw
t  = pw.debug.table_from_markdown(
    '''| text
1| cooltext'''
)
splitter = TokenCountSplitter(min_tokens=1, max_tokens=1)
t += t.select(chunks = splitter(pw.this.text))
pw.debug.compute_and_print(t, include_id=False)
text     | chunks
cooltext | (('cool', pw.Json({})), ('text', pw.Json({})))
__call__(text, **kwargs)

source Split a given text into smaller chunks. Preserves metadata and propagates it to all chunks that are created from the same input string.

chunk(text, metadata={}, **kwargs)

source Split a given string into smaller chunks. Preserves metadata and propagates it to all chunks that are created from the same input string.


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4