A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://python.langchain.com/docs/how_to/recursive_text_splitter/ below:

How to recursively split text by characters

How to recursively split text by characters

This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

  1. How the text is split: by list of characters.
  2. How the chunk size is measured: by number of characters.

Below we show example usage.

To obtain the string content directly, use .split_text.

To create LangChain Document objects (e.g., for use in downstream tasks), use .create_documents.

%pip install -qU langchain-text-splitters
from langchain_text_splitters import RecursiveCharacterTextSplitter


with open("state_of_the_union.txt") as f:
state_of_the_union = f.read()

text_splitter = RecursiveCharacterTextSplitter(

chunk_size=100,
chunk_overlap=20,
length_function=len,
is_separator_regex=False,
)
texts = text_splitter.create_documents([state_of_the_union])
print(texts[0])
print(texts[1])
page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and'
page_content='of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.'
text_splitter.split_text(state_of_the_union)[:2]
['Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and',
'of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.']

Let's go through the parameters set above for RecursiveCharacterTextSplitter:

Splitting text from languages without word boundaries

Some writing systems do not have word boundaries, for example Chinese, Japanese, and Thai. Splitting text with the default separator list of ["\n\n", "\n", " ", ""] can cause words to be split between chunks. To keep words together, you can override the list of separators to include additional punctuation:

text_splitter = RecursiveCharacterTextSplitter(
separators=[
"\n\n",
"\n",
" ",
".",
",",
"\u200b",
"\uff0c",
"\u3001",
"\uff0e",
"\u3002",
"",
],

)

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4