RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://pymupdf.readthedocs.io/en/latest/pymupdf4llm below:

PyMuPDF4LLM - PyMuPDF 1.26.3 documentation

Back to top

Toggle table of contents sidebar

PyMuPDF4LLM#

PyMuPDF4LLM is aimed to make it easier to extract PDF content in the format you need for LLM & RAG environments. It supports Markdown extraction as well as LlamaIndex document output.

Features#

Support for multi-column pages

Support for image and vector graphics extraction (and inclusion of references in the MD text)

Support for page chunking output.

Direct support for output as LlamaIndex Documents.

Functionality#

This package converts the pages of a file to text in Markdown format using PyMuPDF.
Standard text and tables are detected, brought in the right reading sequence and then together converted to GitHub-compatible Markdown text.
Header lines are identified via the font size and appropriately prefixed with one or more # tags.
Bold, italic, mono-spaced text and code blocks are detected and formatted accordingly. Similar applies to ordered and unordered lists.
By default, all document pages are processed. If desired, a subset of pages can be specified by providing a list of 0-based page numbers.

Installation#

Install the package via pip with:

Extracting a file as a LlamaIndex document#

PyMuPDF4LLM supports direct conversion to a LLamaIndex document. A document is first converted into Markdown format and then a LlamaIndex document is returned as follows:

import pymupdf4llm
llama_reader = pymupdf4llm.LlamaMarkdownReader()
llama_docs = llama_reader.load_data("input.pdf")

Using with PyMuPDF Pro#

For Office document support, PyMuPDF4LLM works seamlessly with PyMuPDF Pro. Assuming you have PyMuPDF Pro installed you will be able to work with Office documents as expected:

import pymupdf4llm
import pymupdf.pro
pymupdf.pro.unlock()
md_text = pymupdf4llm.to_markdown("sample.doc")

As you can see PyMuPDF Pro functionality will be available within the PyMuPDF4LLM context!

API#

See the PyMuPDF4LLM API.

Further Resources# Sample code#

Blogs#

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4