The "Pages" epic has been promised for some time. I think it is important to make some progress, preferably for 12.0. The approach I am proposing for the short term is to concentrate on aspects that would be of benefit to pdfarranger. This provides some focus while at the same time providing an opportunity to test drive any enhancements to see how they are performing from a user perspective.
Taking this approach, the initial targets would be
with the next target of
make page assembly and transformations using job JSON more practical
There are two aspects to this
allow the input sources to be specified outside of the --pages option
See QPDF "pages epic" qpdf#1104 (comment) . This would also be a step towards allowing memory buffers as an alternative input source for programmatic use of the job interface.
allow transformations to be specified inside the --pages option
See QPDF "pages epic" qpdf#1104 (comment). The current way of eg specifying rotations is not practical from job JSON and it is currently easier to assemble the output QPDF using job JSON and then apply the rotations to the QPDF object before using job JSON to write the output. Having this option would also be useful for CLI - given --pages file1.pdf 3-z file2.pdf
, what are the page numbers to use to rotate file2?
For both items the next step is to refactor QPDFJob::handlePageSpecs
. This should also provide an opportunity to substantially reduce qpdf's footprint when combining pages from multiple files, where qpdf can be very memory hungry. For example, to extract a single page from the pdf-spec test file uses over 22.5MB and grows pretty much linearly with the number of input files. By the time we reach the default keep-files-open threshold we are looking at a 4.5GB footprint.
Splitting the work of:handlePageSpecs
into two stages - copying the page objects into the primary input followed by inserting the pages into the page tree in the required order - would allow us to have only two QPDF objects in memory at a time - the primary input and one other.
preserve hyperlinks from foreign pdfs
The next main step is to import named destinations from foreign files. The main issue seems to be dealing with name clashes.
preserve outlines from foreign pdfs
This will also rely on the import of named destinations and therefore logically follows preservation of links.
The main difficulty I see is how outlines from multiple pdfs should be merged. One idea is to simply concatenate the top-level entries from the various inputs and provide a facility to dump the outlines to a JSON file for editing and reloading. Trying to come up with an interface that allows users to specify how outlines should be combined feels challenging. My gut feeling is that in most of the more complex cases it will require user interaction and therefore should not be handled by qpdf directly.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4