-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
For PDF-1.5 files with compressed trailer
and compressed object streams (/ObjStm
); all other PDF software including qpdf
and pdfinfo
use the correct page dimensions read from /MediaBox
:
$ pdfinfo 'x.pdf' | grep pts
Page size: 596 x 842 pts (A4)
However, when using Pandoc and outputting to .docx
:
$ echo '' > x.md &&
pandoc x.md -o x.docx
the image dimensions are incorrectly detected, and the image is inserted into the .docx
with incorrect aspect ratio and size.
Manually walking through the PDF objects using qpdf
shows a compressed Cross-reference (xref
) and compressed object streams (\ObjStm
) in use. A compressed object stream has multiple objects in a single /Flate
compression stream:
$ for object in trailer 18 40 1 2 ; do qpdf --show-object=$object x.pdf -- ; done
<< /Filter /FlateDecode /Info 39 0 R /Length 155 /Root 40 0 R /Size 43 /Type /XRef /W [ 1 2 2 ] >>
% Object is stream. Dictionary:
<< /Filter /FlateDecode /First 17 /Length 41 0 R /N 3 /Type /ObjStm >>
<< /Pages 1 0 R /Type /Catalog >>
<< /Count 2 /Kids [ 2 0 R 12 0 R ] /Type /Pages >>
<< /Contents 4 0 R /Group << /CS /DeviceRGB /I true /S /Transparency /Type /Group >> /MediaBox [ 0 0 596 842 ] /Parent 1 0 R /Resources 3 0 R /Type /Page >>
The /Page
object with its /MediaBox[left bottom right top]
is there, but only obtainable by (fully) walking/parsing/unpacking the PDF file.
These PDF-1.5 files in question are not particularly exotic, they are PDF-1.5 files printed from Firefox, via Cairo:
obj 39 0
<<
/CreationDate (D:20240611224431+02'00)
/Creator (Mozilla Firefox 126.0)
/Producer (cairo 1.17.4 \(https://cairographics.org\))
>>
(See previous issue "Wrong image size for pdf images in docx" #4322, newly opened here per request of @jgm )