A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://docs.unstructured.io/open-source/core-functionality/chunking below:

Chunking - Unstructured

Using Unstructured open source

Chunking functions in unstructured use metadata and document elements detected with partition functions to post-process elements into more useful “chunks” for uses cases such as retrieval-augmented generation (RAG).

Chunking BasicsChunking in unstructured differs from other chunking mechanisms you may be familiar with. Typical approaches start with the text extracted from the document and form chunks based on plain-text features, character sequences like "\n\n" or "\n" that might indicate a paragraph boundary or list-item boundary. Because unstructured uses specific knowledge about each document format to partition the document into semantic units (document elements), we only need to resort to text-splitting when a single element exceeds the desired maximum chunk size. Except in that case, all chunks contain one or more whole elements, preserving the coherence of semantic units established during partitioning. A few concepts about chunking are worth introducing before discussing the details. Chunking OptionsThe following options are available to tune chunking behaviors. These are keyword arguments that can be used in a partitioning or chunking function call. All these options have defaults and need only be specified when a non-default setting is required. Specific chunking strategies (such as by_title) may have additional options. ChunkingChunking can be performed as part of partitioning or as a separate step after partitioning: Specifying a chunking strategy while partitioningChunking can be performed as part of partitioning by specifying a value for the chunking_strategy argument. The current options are basic and by_title (described below).
from unstructured.partition.html import partition_html

chunks = partition_html(url=url, chunking_strategy="basic")

Calling a chunking functionChunking can also be performed separately from partitioning by calling a chunking function directly. This may be convenient, for example, when tuning chunking parameters. Chunking is typically faster than partitioning, especially when OCR or inference is used, so a faster feedback loop is possible by doing these separately:
from unstructured.chunking.basic import chunk_elements
from unstructured.partition.html import partition_html

url = "https://understandingwar.org/backgrounder/russian-offensive-campaign-assessment-august-27-2023-0"
elements = partition_html(url=url)
chunks = chunk_elements(elements)

# -- OR --

from unstructured.chunking.title import chunk_by_title

chunks = chunk_by_title(elements)

for chunk in chunks:
    print(chunk)
    print("\n\n" + "-"*80)
    input()

Chunking StrategiesThere are currently two chunking strategies, basic and by_title. The by_title strategy shares most behaviors with the basic strategy so we’ll describe the baseline strategy first: “basic” chunking strategy ”by_title” chunking strategyThe by_title chunking strategy preserves section boundaries and optionally page boundaries as well. “Preserving” here means that a single chunk will never contain text that occurred in two different sections. When a new section starts, the existing chunk is closed and a new one started, even if the next element would fit in the prior chunk. In addition to the behaviors of the basic strategy above, the by_title strategy has the following behaviors: Recovering Chunk ElementsIn general, a chunk consolidates multiple document elements to maximally fill a chunk of the desired size. Information is naturally lost in this consolidation, for example which element a portion of the text came from and certain metadata like page-number and coordinates which cannot always be resolved to a single value. The original elements combined to make a chunk can be accessed using the .metadata.orig_elements field on the chunk:
>>> elements = [
...     Title("Lorem Ipsum"),
...     NarrativeText("Lorem ipsum dolor sit."),
... ]
>>> chunk = chunk_elements(elements)[0]
>>> print(chunk.text)
'Lorem Ipsum\n\nLorem ipsum dolor sit.'
>>> print(chunk.metadata.orig_elements)
[Title("Lorem Ipsum"), NarrativeText("Lorem ipsum dolor sit.")]

These elements will contain all their original metadata so can be used to access metadata that cannot reliably be consolidated, for example:
>>> {e.metadata.page_number for e in chunk.metadata.orig_elements}
{2, 3}

>>> [e.metadata.coordinates for e in chunk.metadata.orig_elements]
[<CoordinatesMetadata ...>, <CoordinatesMetadata ...>, ...]

>>> [
    e.metadata.image_path
    for e in chunk.metadata.orig_elements
    if e.metadata.image_path is not None
]
['/tmp/lorem.jpg', '/tmp/ipsum.png']

During serialization, .metadata.orig_elements is compressed into Base64 gzipped format. To deserialize .metadata.orig_elements, you can use the elements_from_base64_gzipped_json. For example:
from unstructured.partition.auto import partition
from unstructured.staging.base import elements_from_base64_gzipped_json

elements = partition('local-ingest-source/fake-email.eml', chunking_strategy='basic', include_orig_elements=True)

print("Before:\n")

for element in elements:
    metadata = element.metadata.to_dict()
    print(f"Element ID: {element.id}")
    print(f"  Compressed orig_elements: {metadata["orig_elements"]}")
    print("\n")

# Output:
# -------
# Before:
#
# Element ID: 083776ca703b1925e5fef69bb2635f1f
#   Compressed orig_elements: eJztlM1uGyEUhV8Fzbq2GGbwDFZUZdNFpDaLyOomiix+Lg7tABbcaRtFfffCJFXSyiuvLbFA3HsOHD7B/XMDE3gIuHem2ZJGmLaDnmnBGTfajF0/aqm6zSi4gnEwzQfSeEBpJMrS/9xYN8HeuAQaY3qqFlPUclq5cICMqxznpKGqamOQHmqLld9hBV66aQ1+qtVJhsMsi6SU7xsIh+ZhWc2499E462A5HqOMrdoyNrt22NJ+225WlG8prR65xrAp+sXji0R8hJ/kLioXcgzkyqfX6fUcMqZZ45zArF38uGy2yDEu4tuIf/VXb/PrEPEfqY7+VTurb+UG6hF3JTb5VLM1v0sF4dfL8qPLpAxJsDYs4QlGMmcgNiYyB4dLKa9rFnw6Ljd1K1OS6H7ArvoUw/+BjZveMt5zxQSVgo2S8kGr1ggtKLOivwA7E9iNP8aEMiA5Rhcwb99j2Tmc4BQOBtyOYuwko5J3RnLRtszowYpOWj5ccJyL4y5mKK8nASnC9yg+u4w3CP4UDWWMVEpT3fZUmb63RvORgR3o2Ovy011onEnjq4sT4AsPNc1wGsjDH93mEjs=

print ("After:\n")

for element in elements:
    metadata = element.metadata.to_dict()
    print(f"Element ID: {element.id}")
    orig_elements = elements_from_base64_gzipped_json(metadata["orig_elements"])
    print(f"  Uncompressed orig_elements:")
    for orig_element in orig_elements:
        print(f"    {orig_element.category}: {orig_element.text}")
    print("\n")

# Output:
# -------
# After:
#
# Element ID: 083776ca703b1925e5fef69bb2635f1f
#   Uncompressed orig_elements:
#     NarrativeText: This is a test email to use for unit tests.
#     Title: Important points:
#     ListItem: Roses are red
#     ListItem: Violets are blue
Learn more   Chunking for RAG: best practices

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4