RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://www.npmjs.com/package/pdf-text-extract below:

pdf-text-extract - npm

PDF Text Extract

Extract text from pdfs that contain searchable pdf text. The module is wrapper that calls the pdftotext command to perform the actual extraction

Installation

npmÂ installÂ --saveÂ pdf-text-extract

You will need the pdftotext binary available on your path. There are packages available for many different operating systems

See https://github.com/nisaacson/pdf-extract#osx for how to install the pdftotext command

Usage As a module

extract(filePath, [options], [pdftotextcommand], callback)

Options and pdftotextcommand are not required.

varÂ pathÂ =Â require('path')
varÂ filePathÂ =Â path.join(__dirname,Â 'test/data/multipage.pdf')
varÂ extractÂ =Â require('pdf-text-extract')
extract(filePath,Â functionÂ (err,Â pages)Â {
Â Â ifÂ (err)Â {
Â Â Â Â console.dir(err)
Â Â Â Â return
Â Â }
Â Â console.dir(pages)
})

The output will be an array of where each entry is a page of text. If you want just a string of all pages you can set the option to splitPages: false.

varÂ filePathÂ =Â path.join(__dirname,Â 'test/data/multipage.pdf')
varÂ extractÂ =Â require('pdf-text-extract')
extract(filePath,Â {Â splitPages:Â falseÂ },Â functionÂ (err,Â text)Â {
Â Â ifÂ (err)Â {
Â Â Â Â console.dir(err)
Â Â Â Â return
Â Â }
Â Â console.dir(text)
})

You can set the following options:

firstPage: First page to extract
lastPage: Last page to extract
resolution: in dpi, as is specified by pdftotext -r
crop: Should be an object { x:x, y:y, w:w, h:h }
layout: Should be either layout, raw or htmlmeta. Default: layout
encoding: Should be either UCS-2, ASCII7, Latin1, UTF-8, ZapfDingbats or Symbol. Default: UTF-8
eol: End of line convention. One of either: unix, dos or mac
ownerPassword: Owner password (for encrypted files)
userPassword: User password (for encrypted files)
splitPages: If true, the result will be and array of pages. Default: true.

If needed you can pass an optional arguments to the extract function. These will be passed to the child_process.spawn call.

varÂ filePathÂ =Â path.join(__dirname,Â 'test/data/multipage.pdf')
varÂ extractÂ =Â require('pdf-text-extract')
varÂ optionsÂ =Â {
Â Â cwd:Â "./"
}
extract(filePath,Â options,Â functionÂ (err,Â pages)Â {
Â Â ifÂ (err)Â {
Â Â Â Â console.dir(err)
Â Â Â Â return
Â Â }
Â Â console.dir('extractedÂ pages',Â pages)
})

You can also override the command for pdftotext if it is installed in a location that is not available in the PATH environment variable

varÂ filePathÂ =Â path.join(__dirname,Â 'test/data/multipage.pdf')
varÂ pdfToTextCommandÂ =Â '/opt/bin/pdftotext'
varÂ extractÂ =Â require('pdf-text-extract')
varÂ optionsÂ =Â {
Â Â cwd:Â "./"
}
extract(filePath,Â options,Â pdfToTextCommand,Â functionÂ (err,Â pages)Â {
Â Â ifÂ (err)Â {
Â Â Â Â console.dir(err)
Â Â Â Â return
Â Â }
Â Â console.dir('extractedÂ pages',Â pages)
})

As a command line tool

npmÂ installÂ -gÂ pdf-text-extract

Execute with the filePath as an argument. Output will be json-formatted array of pages

pdf-text-extractÂ ./test/data/multipage.pdf

Test

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4