Using Unstructured open source
Chunking functions in unstructured
use metadata and document elements detected with partition
functions to post-process elements into more useful “chunks” for uses cases such as retrieval-augmented generation (RAG).
unstructured
differs from other chunking mechanisms you may be familiar with. Typical approaches start with the text extracted from the document and form chunks based on plain-text features, character sequences like "\n\n"
or "\n"
that might indicate a paragraph boundary or list-item boundary. Because unstructured
uses specific knowledge about each document format to partition the document into semantic units (document elements), we only need to resort to text-splitting when a single element exceeds the desired maximum chunk size. Except in that case, all chunks contain one or more whole elements, preserving the coherence of semantic units established during partitioning. A few concepts about chunking are worth introducing before discussing the details.
CompositeElement
, Table
, or TableChunk
elements. Each “chunk” is an instance of one of these three types.by_title
) may have additional options.
max_characters: int (default=500)
- the hard maximum size for a chunk. No chunk will exceed this number of characters. A single element that by itself exceeds this size will be divided into two or more chunks using text-splitting.new_after_n_chars: int (default=max_characters)
- the “soft” maximum size for a chunk. A chunk that already exceeds this number of characters will not be extended, even if the next element would fit without exceeding the specified hard maximum. This can be used in conjunction with max_characters
to set a “preferred” size, like “I prefer chunks of around 1000 characters, but I’d rather have a chunk of 1500 (max_characters) than resort to text-splitting”. This would be specified with (..., max_characters=1500, new_after_n_chars=1000)
.overlap: int (default=0)
- only when using text-splitting to break up an oversized chunk, include this number of characters from the end of the prior chunk as a prefix on the next. This can mitigate the effect of splitting the semantic unit represented by the oversized element at an arbitrary position based on text length.overlap_all: bool (default=False)
- also apply overlap between “normal” chunks, not just when text-splitting to break up an oversized element. Because normal chunks are formed from whole elements that each have a clean semantic boundary, this option may “pollute” normal chunks. You’ll need to decide based on your use-case whether this option is right for you.chunking_strategy
argument. The current options are basic
and by_title
(described below).
from unstructured.partition.html import partition_html
chunks = partition_html(url=url, chunking_strategy="basic")
Calling a chunking functionChunking can also be performed separately from partitioning by calling a chunking function directly. This may be convenient, for example, when tuning chunking parameters. Chunking is typically faster than partitioning, especially when OCR or inference is used, so a faster feedback loop is possible by doing these separately:
from unstructured.chunking.basic import chunk_elements
from unstructured.partition.html import partition_html
url = "https://understandingwar.org/backgrounder/russian-offensive-campaign-assessment-august-27-2023-0"
elements = partition_html(url=url)
chunks = chunk_elements(elements)
# -- OR --
from unstructured.chunking.title import chunk_by_title
chunks = chunk_by_title(elements)
for chunk in chunks:
print(chunk)
print("\n\n" + "-"*80)
input()
Chunking StrategiesThere are currently two chunking strategies, basic and by_title. The by_title
strategy shares most behaviors with the basic strategy so we’ll describe the baseline strategy first: “basic” chunking strategy
max_characters
(hard-max) and new_after_n_chars
(soft-max) option values.Table
element is always isolated and never combined with another element. A Table
can be oversized, like any other text element, and in that case is divided into two or more TableChunk
elements using text-splitting.overlap
is applied between chunks formed by splitting oversized elements and is also applied between other chunks when overlap_all
is True
.by_title
chunking strategy preserves section boundaries and optionally page boundaries as well. “Preserving” here means that a single chunk will never contain text that occurred in two different sections. When a new section starts, the existing chunk is closed and a new one started, even if the next element would fit in the prior chunk. In addition to the behaviors of the basic
strategy above, the by_title
strategy has the following behaviors:
Title
element is considered to start a new section. When a Title
element is encountered, the prior chunk is closed and a new chunk started, even if the Title
element would fit in the prior chunk.multipage_sections
argument. This defaults to True
meaning that a page break does not start a new chunk. Setting this to False
will separate elements that occur on different pages into distinct chunks.Title
element even though it does not serve as a section heading. This can produce chunks substantially smaller than desired. This behavior can be mitigated using the combine_text_under_n_chars
argument. This defaults to the same value as max_characters
such that sequential small sections are combined to maximally fill the chunking window. Setting this to 0
will disable section combining.>>> elements = [
... Title("Lorem Ipsum"),
... NarrativeText("Lorem ipsum dolor sit."),
... ]
>>> chunk = chunk_elements(elements)[0]
>>> print(chunk.text)
'Lorem Ipsum\n\nLorem ipsum dolor sit.'
>>> print(chunk.metadata.orig_elements)
[Title("Lorem Ipsum"), NarrativeText("Lorem ipsum dolor sit.")]
These elements will contain all their original metadata so can be used to access metadata that cannot reliably be consolidated, for example:
>>> {e.metadata.page_number for e in chunk.metadata.orig_elements}
{2, 3}
>>> [e.metadata.coordinates for e in chunk.metadata.orig_elements]
[<CoordinatesMetadata ...>, <CoordinatesMetadata ...>, ...]
>>> [
e.metadata.image_path
for e in chunk.metadata.orig_elements
if e.metadata.image_path is not None
]
['/tmp/lorem.jpg', '/tmp/ipsum.png']
During serialization, .metadata.orig_elements
is compressed into Base64 gzipped format. To deserialize .metadata.orig_elements
, you can use the elements_from_base64_gzipped_json
. For example:
from unstructured.partition.auto import partition
from unstructured.staging.base import elements_from_base64_gzipped_json
elements = partition('local-ingest-source/fake-email.eml', chunking_strategy='basic', include_orig_elements=True)
print("Before:\n")
for element in elements:
metadata = element.metadata.to_dict()
print(f"Element ID: {element.id}")
print(f" Compressed orig_elements: {metadata["orig_elements"]}")
print("\n")
# Output:
# -------
# Before:
#
# Element ID: 083776ca703b1925e5fef69bb2635f1f
# Compressed orig_elements: eJztlM1uGyEUhV8Fzbq2GGbwDFZUZdNFpDaLyOomiix+Lg7tABbcaRtFfffCJFXSyiuvLbFA3HsOHD7B/XMDE3gIuHem2ZJGmLaDnmnBGTfajF0/aqm6zSi4gnEwzQfSeEBpJMrS/9xYN8HeuAQaY3qqFlPUclq5cICMqxznpKGqamOQHmqLld9hBV66aQ1+qtVJhsMsi6SU7xsIh+ZhWc2499E462A5HqOMrdoyNrt22NJ+225WlG8prR65xrAp+sXji0R8hJ/kLioXcgzkyqfX6fUcMqZZ45zArF38uGy2yDEu4tuIf/VXb/PrEPEfqY7+VTurb+UG6hF3JTb5VLM1v0sF4dfL8qPLpAxJsDYs4QlGMmcgNiYyB4dLKa9rFnw6Ljd1K1OS6H7ArvoUw/+BjZveMt5zxQSVgo2S8kGr1ggtKLOivwA7E9iNP8aEMiA5Rhcwb99j2Tmc4BQOBtyOYuwko5J3RnLRtszowYpOWj5ccJyL4y5mKK8nASnC9yg+u4w3CP4UDWWMVEpT3fZUmb63RvORgR3o2Ovy011onEnjq4sT4AsPNc1wGsjDH93mEjs=
print ("After:\n")
for element in elements:
metadata = element.metadata.to_dict()
print(f"Element ID: {element.id}")
orig_elements = elements_from_base64_gzipped_json(metadata["orig_elements"])
print(f" Uncompressed orig_elements:")
for orig_element in orig_elements:
print(f" {orig_element.category}: {orig_element.text}")
print("\n")
# Output:
# -------
# After:
#
# Element ID: 083776ca703b1925e5fef69bb2635f1f
# Uncompressed orig_elements:
# NarrativeText: This is a test email to use for unit tests.
# Title: Important points:
# ListItem: Roses are red
# ListItem: Violets are blue
Learn more Chunking for RAG: best practices
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4