A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://github.com/sdtblck/PDFextract below:

sdtblck/PDFextract: Extracting pdfs using pdfminer.six and pyPDF2

Extracting text from pdfs using pdfminer.six and pyPDF2

pip install -r requirements.txt

python pdf_extract.py

the above will default to parsing all pdfs in 'samples' and save output txt files to 'output'. Pass a path to a folder containing pdfs with --path_to_folder & change output folder with --out_path args

E.G python pdf_extract.py --path_to_folder /Users/user/my_pdfs --out_path /Users/documents/parsed_pdfs

usage: pdf_extract.py [-h] [--path_to_folder PATH_TO_FOLDER]
                      [--out_path OUT_PATH] [-nf] [--size SIZE]

CLI for PDFextract - extracts plaintext from PDF files

optional arguments:
  -h, --help            show this help message and exit
  --path_to_folder PATH_TO_FOLDER
                        Path to folder containing pdfs
  --out_path OUT_PATH   Output location for final .txt file
  -nf, --no_filter      turn off cleaning & filtering resulting txt files
  --size SIZE           Do not process files larger than this size per page in
                        bytes (mostly images) - default 300000

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4