A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://pymupdf.readthedocs.io/en/latest/pymupdf4llm below:

PyMuPDF4LLM - PyMuPDF 1.26.3 documentation

Back to top

Toggle table of contents sidebar

PyMuPDF4LLM#

PyMuPDF4LLM is aimed to make it easier to extract PDF content in the format you need for LLM & RAG environments. It supports Markdown extraction as well as LlamaIndex document output.

Features#
Functionality# Installation#

Install the package via pip with:

Extracting a file as a LlamaIndex document#

PyMuPDF4LLM supports direct conversion to a LLamaIndex document. A document is first converted into Markdown format and then a LlamaIndex document is returned as follows:

import pymupdf4llm
llama_reader = pymupdf4llm.LlamaMarkdownReader()
llama_docs = llama_reader.load_data("input.pdf")
Using with PyMuPDF Pro#

For Office document support, PyMuPDF4LLM works seamlessly with PyMuPDF Pro. Assuming you have PyMuPDF Pro installed you will be able to work with Office documents as expected:

import pymupdf4llm
import pymupdf.pro
pymupdf.pro.unlock()
md_text = pymupdf4llm.to_markdown("sample.doc")

As you can see PyMuPDF Pro functionality will be available within the PyMuPDF4LLM context!

API#

See the PyMuPDF4LLM API.

Further Resources# Sample code# Blogs#

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4