Toggle table of contents sidebar
PyMuPDF4LLM#PyMuPDF4LLM is aimed to make it easier to extract PDF content in the format you need for LLM & RAG environments. It supports Markdown extraction as well as LlamaIndex document output.
Features#Functionality#
Support for multi-column pages
Support for image and vector graphics extraction (and inclusion of references in the MD text)
Support for page chunking output.
Direct support for output as LlamaIndex Documents.
This package converts the pages of a file to text in Markdown format using PyMuPDF.
Standard text and tables are detected, brought in the right reading sequence and then together converted to GitHub-compatible Markdown text.
Header lines are identified via the font size and appropriately prefixed with one or more #
tags.
Bold, italic, mono-spaced text and code blocks are detected and formatted accordingly. Similar applies to ordered and unordered lists.
By default, all document pages are processed. If desired, a subset of pages can be specified by providing a list of 0
-based page numbers.
Install the package via pip with:
Extracting a file as a LlamaIndex document#PyMuPDF4LLM supports direct conversion to a LLamaIndex document. A document is first converted into Markdown format and then a LlamaIndex document is returned as follows:
import pymupdf4llm llama_reader = pymupdf4llm.LlamaMarkdownReader() llama_docs = llama_reader.load_data("input.pdf")Using with PyMuPDF Pro#
For Office document support, PyMuPDF4LLM works seamlessly with PyMuPDF Pro. Assuming you have PyMuPDF Pro installed you will be able to work with Office documents as expected:
import pymupdf4llm import pymupdf.pro pymupdf.pro.unlock() md_text = pymupdf4llm.to_markdown("sample.doc")
As you can see PyMuPDF Pro functionality will be available within the PyMuPDF4LLM context!
API#See the PyMuPDF4LLM API.
Further Resources# Sample code# Blogs#RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4