[cl-pdf-devel] Some problems with pdf-parser
Piotr Chamera
piotr_chamera at poczta.onet.pl
Fri Jun 10 16:48:25 UTC 2011
Hi,
I just started with cl-pdf and it works great for me :)
but I found some problems in pdf-parser and need advice
how to fix it. I am rather novice Lisper so I can be wrong
in my guesses below...
1. In file cl-pdf, function find-cross-reference-start
function searches for 'startxref' in buffer _from beginning_
and can find incorrect place if at end of file (in buffer)
are two such sections (eg small incremental change at end of file).
Proposition: change
(let ((position (search "startxref" buffer)))
to
(let ((position (search "startxref" buffer :from-end t)))
2. In file cl-pdf, function make-indirect-object:
(defun make-indirect-object (obj-number gen-number position)
(let ((object (or (car (gethash (cons obj-number gen-number)
*indirect-objects*))
(make-instance 'indirect-object
:obj-number obj-number
:gen-number gen-number
:content :unread
:no-link t))))
(setf (gethash (cons obj-number gen-number) *indirect-objects*)
(cons object position))
object))
I am working on file generated from Adobe Acrobat Distiller
and then cropped in Adobe Acrobat so at end of file there are
few modified objects with duplicate numbers (and generations �
whih is maybe bug in Acrobat?). When indirect-object objects
are read from file (in order from cross reference tables which
a read from newest to oldest) then newer one are overwritten
by older one with the same number. We end with readable pdf
but with some object revisions dropped.
I have added some print for debuggind in above function (and some
others) and for sample file got such a reading order:
startxref position: 89502
xref position: 89502
making obj: 4 0 position 85386
making obj: 5 0 position 89106
making obj: 8 0 position 89309
making obj: 7 0 position 0
xref position: 116
making obj: 6 0 position 16
making obj: 7 0 position 1150
making obj: 8 0 position 1227
making obj: 9 0 position 1411
making obj: 10 0 position 1554
(..)
making obj: 37 0 position 936
xref position: 85210
making obj: 1 0 position 81250
making obj: 2 0 position 81284
making obj: 3 0 position 81308
making obj: 4 0 position 81359
making obj: 5 0 position 85007
Which shows that in file are 4 duplicated objects and
they are overwritten by older versions (4 0, 5 0, 8 0, 7 0).
I think that solution would be to drop older objects when
newer wersion with the same number and generation was already read?
Something like this:
(defun make-indirect-object (obj-number gen-number position)
(let ((object (gethash (cons obj-number gen-number) *indirect-objects*)))
(if object
(progn
(format T "obj alredy present: ~s ~s at position ~s (dropped older
one at position ~s)~%"
obj-number gen-number
(cdr object) position)
(car object))
(progn
(format T "making obj: ~s ~s position ~s ~%" obj-number gen-number
position)
(let ((new-object (make-instance 'indirect-object
:obj-number obj-number
:gen-number gen-number
:content :unread
:no-link t)))
(setf (gethash (cons obj-number gen-number) *indirect-objects*)
(cons new-object position))
new-object)))))
Which gives on the same example file
startxref position: 89502
xref position: 89502
making obj: 4 0 position 85386
making obj: 5 0 position 89106
making obj: 8 0 position 89309
making obj: 7 0 position 0
xref position: 116
making obj: 6 0 position 16
obj alredy present: 7 0 at position 0 (dropped older one at position 1150)
obj alredy present: 8 0 at position 89309 (dropped older one at position
1227)
making obj: 9 0 position 1411
making obj: 10 0 position 1554
(...)
making obj: 37 0 position 936
xref position: 85210
making obj: 1 0 position 81250
making obj: 2 0 position 81284
making obj: 3 0 position 81308
obj alredy present: 4 0 at position 85386 (dropped older one at position
81359)
obj alredy present: 5 0 at position 89106 (dropped older one at position
85007)
But this reveals another problem in read-xref-and-trailer
(defun read-xref-and-trailer (position)
(let (first-trailer)
(loop
(format T "xref position: ~s~%" position)
(read-cross-reference-subsections position)
(let* ((trailer (read-trailer)))
(unless first-trailer (setf first-trailer trailer))
(let ((prev-position (get-dict-value trailer "/Prev")))
(if prev-position
(setq position prev-position)
(return first-trailer)))))))
If I correctly read it, it reads trailers from most recent to older
and returns oldest instead of first read? So in read-pdf document gets
incorrect information.
Can someone rewiew above and tell me if I search in good direction
or I am entirely wrong...
--
pozdrawiam
Piotr Chamera
More information about the cl-pdf-devel
mailing list