A library of text spliiters - routines which slit a long text into smaller chunks.
class BaseSplitter(*, return_type=..., deterministic=False, propagate_none=False, executor=AutoExecutor(), cache_strategy=None, max_batch_size=None)[source]Abstract base class for splitters that split a long text into smaller chunks.
__call__(text, **kwargs)source Split a given text into smaller chunks. Preserves metadata and propagates it to all chunks that are created from the same input string.
ColumnExpression
) â input column containing text to be splittext
, metadata
).
Metadata are propagated to all chunks created from the same input string.
If no metadata is provided, an empty dictionary is used.
A splitter which returns its argument as one long text ith null metadata.
The null splitter always return a list of length one containing the full text and empty metadata.
__call__(text, **kwargs)source Split a given text into smaller chunks. Preserves metadata and propagates it to all chunks that are created from the same input string.
ColumnExpression
) â input column containing text to be splittext
, metadata
).
Metadata are propagated to all chunks created from the same input string.
If no metadata is provided, an empty dictionary is used.
Splitter that splits a long text into smaller chunks based on a set of separators. Chunking is performed recursively using first separator in the list and then second separator in the list and so on, until the text is split into chunks of length smaller than chunk_size
. Length of the chunks is measured by the number of characters in the text if none of encoding_name
, model_name
or hf_tokenizer
is provided. Otherwise, the length of the chunks is measured by the number of tokens that particular tokenizer would output.
Under the hood it is a wrapper around langchain_text_splitters.RecursiveTextSplitter
(MIT license).
int
) â maximum size of a chunk in characters/tokens.int
) â number of characters/tokens to overlap between chunks.list
[str
]) â list of strings to split the text on.bool
) â whether the separators are regular expressions.str
| None
) â name of the encoding from tiktoken
. For the list of available encodings please refer to tiktoken documentation: https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken str
| None
) â name of the model from tiktoken
. See the link above for more details.source Split a given text into smaller chunks. Preserves metadata and propagates it to all chunks that are created from the same input string.
ColumnExpression
) â input column containing text to be splittext
, metadata
).
Metadata are propagated to all chunks created from the same input string.
If no metadata is provided, an empty dictionary is used.
Splits a given string or a list of strings into chunks based on token count.
This splitter tokenizes the input texts and splits them into smaller parts (âchunksâ) ensuring that each chunk has a token count between min_tokens and max_tokens. It also attempts to break chunks at sensible points such as punctuation marks. Splitter expects input to be a Pathway column of strings OR pairs of strings and dict metadata.
All default arguments may be overridden in the UDF call
int
) â minimum tokens in a chunk of text.int
) â maximum size of a chunk in tokens.str
) â name of the encoding from tiktoken. For a list of available encodings please refer to the tiktoken documentation: https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken Example:
from pathway.xpacks.llm.splitters import TokenCountSplitter
import pathway as pw
t = pw.debug.table_from_markdown(
'''| text
1| cooltext'''
)
splitter = TokenCountSplitter(min_tokens=1, max_tokens=1)
t += t.select(chunks = splitter(pw.this.text))
pw.debug.compute_and_print(t, include_id=False)
text | chunks
cooltext | (('cool', pw.Json({})), ('text', pw.Json({})))
__call__(text, **kwargs)
source Split a given text into smaller chunks. Preserves metadata and propagates it to all chunks that are created from the same input string.
ColumnExpression
) â input column containing text to be splittext
, metadata
).
Metadata are propagated to all chunks created from the same input string.
If no metadata is provided, an empty dictionary is used.
source Split a given string into smaller chunks. Preserves metadata and propagates it to all chunks that are created from the same input string.
dict
) â metadata associated with the input texttext
, metadata
).
Metadata are propagated to all chunks created from the same input string.
If no metadata is provided an empty dictionary is used.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4