[cl-typesetting-devel] Serious hyphenation bug
Peter Heslin
pj at heslin.eclipse.co.uk
Wed Mar 29 22:27:47 UTC 2006
While playing with cl-typesetting, I have had the vague feeling that
it did not find as many hyphenation points as TeX, which seemed
strange, since it uses the same patterns. I tested this, using this
function:
(defun show-hyphens (string)
(concatenate 'string
(loop
for char across string
for i upfrom 0
for points = (tt::hyphenate-string string)
appending (if (member i points)
(list #\- char)
(list char)))))
Here is the output from TeX' showhyphens command for a piece of text:
but tor-ture with-out end still urges, and a fiery del-uge, fed with
ever-burning sul-phur un-con-sumed. such place eter-nal jus-tice had
pre-pared for those re-bel-lious? here their prison or-dained in
ut-ter dark-ness, and their por-tion set as far re-moved from god and
light of heaven, as from the cen-tre thrice to the ut-most pole. oh,
how un-like the place from whence they fell.
Here is cl-typesetting, using the show-hyphens function defined above:
but tor-ture without end still urges, and a fiery deluge, fed with
ever-burning sulphur un-con-sumed. such place eter-nal justice had
pre-pared for those re-bellious? here their pri-son or-dained in
utter dark-ness, and their por-tion set as far re-moved from god and
light of heaven, as from the centre thrice to the ut-most pole. oh,
how unlike the place from whence they fell.
Note that cl-typesetting finds only about half of the hyphenation
points that TeX does, despite using similar patterns.
I think I have discovered the cause of this bug.
In the file hyphenation-fp.lisp, the function hyphen-make-trie has
this comment:
;; Build a trie out of a sorted list
;; of pairs (word, hyph-points)
So it is important that the input list is sorted. This is done by this
line in the function read-hyphen-file:
(setq patterns (sort patterns #'hyphen-cmp-char-lists)
Here is that sort predicate:
(defun hyphen-cmp-char-lists (l1 l2)
(let (result done)
(loop for c1 = (pop l1)
for c2 = (pop l2)
while (and (characterp c1) (characterp c2) (not done))
do
(if (char< c1 c2)
(setq result t done t)
(if (char> c1 c2)
(setq done t)
))
finally (if done result nil))))
It seems to me that this function will fail to sort the lists of chars
correctly when one of the lists represents an initial substring of the
other string, which is not uncommon.
The result of this bug is that the contents of the patterns variable
are only partially sorted, and so hyphen-make-trie generates a trie
that excludes many of the patterns. In fact, if you want to check it,
you can see that, at least for some initial letters, the trie
generated only includes patterns that correspond to a line in the
hyphenation file that begins with a dot or a number.
Here is my revised version of the sort predicate:
(defun nix::hyphen-cmp-char-lists (l1 l2)
(loop
for c1 = (pop l1)
for c2 = (pop l2)
do (cond
((and (characterp c1) (not (characterp c2)))
(return nil))
((and (not (characterp c1)) (characterp c2))
(return t))
((and (not (characterp c1)) (not (characterp c2)))
(return))
((char< c1 c2)
(return t))
((char> c1 c2)
(return nil)))))
When I use this function and re-load the american hyphen file, and run
my show-hyphens function again, I get a correct result, like so:
but tor-ture with-out end still urges, and a fiery del-uge, fed with
ever-burn-ing sul-phur un-con-sumed. such place eter-nal jus-tice had
pre-pared for those re-bel-lious? here their prison or-dained in
ut-ter dark-ness, and their por-tion set as far re-moved from god and
light of heaven, as from the cen-tre thrice to the ut-most pole. oh,
how un-like the place from whence they fell.
Now cl-typesetting has found all of the same hyphenation points that
TeX did.
So it looks from a very superficial test that this is the correct fix,
but I should emphasize that I do not understand most of the code in
hyphenation-fp.lisp, so I'm not entirely sure that my sort function is
correct. It would nice if someone who understands the code fully
could check this out.
--
Peter Heslin (http://www.dur.ac.uk/p.j.heslin)
More information about the cl-typesetting-devel
mailing list