I am trying to extract text from various PDF documents to use in an NLP project. While using page.extractText() random whitespace is appearing in the outputted words when there are no spaces in the pdf document.
EnvironmentUsing VS code and running via command prompt.
$ python -m platform Windows-10-10.0.22621-SP0 $ python -c "import PyPDF2;print(PyPDF2.__version__)" 2.12.1Code + PDF
This is a minimal, complete example that shows the issue:
test_doc.pdf
(PDF was generated using default settings in Microsoft word). It looks like this:
The code is:
import os from PyPDF2 import PdfReader, __version__ pdf = PdfReader(os.path.join(os.getcwd(), "test_doc.pdf")) print(f"PyPDF2=={__version__}") text = "" for page in pdf.pages: page_content = page.extract_text() text = text + page_content print(text)Output
PyPDF2==2.12.1
This is a test document by Ethan Nelson.
Tuesday was a good time to call ( 000) 000-0000 . This is his ph one mu mber . This is a random address for
testing purposes : 341 Maple st Paytonville Maine 45681.
Anyway, there are random whitespaces here .
KimBenjaminTang and dethosMartinThoma and dethos
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4