Proposed improvement to the HTML parser

Elias Mårtenson lokedhs at gmail.com
Mon Sep 5 05:40:36 UTC 2016


A few years back, I posted this request, but at the time I got no reply.
Since there has recently been some activity here, I'm asking this again.

Currently, closure-html throws an error when trying to parse HTML generated
by Microsoft Outlook. The reason for this is that Outlook generates a lot
of tags with a colon (:) in them, which closure-html considers a syntax
error.

I don't know if it actually is a syntax error, but closure-html should be
lenient here and simply ignore these tags. That's what my fix does.
Currently, I had to advice my users to manually patch closure-html. It
would be very nice if this was integrated in the official version.

Here is the patch:

diff --git a/src/parse/html-parser.lisp b/src/parse/html-parser.lisp
index 1fdd457..4e45b81 100644
--- a/src/parse/html-parser.lisp
+++ b/src/parse/html-parser.lisp
@@ -106,7 +106,10 @@
      for (name value) on plist by #'cddr
      unless
        ;; better don't emit as HAX what would be bogus as SAX anyway
-       (string-equal name "xmlns")
+       (let ((s (string name))
+             (prefix "xmlns:"))
+         (or (string-equal s "xmlns")
+             (string-equal s prefix :end1 (min (length s) (length prefix)))))
      collect
      (let* ((n #+rune-is-character (coerce (symbol-name name) 'rod)
 	       #-rune-is-character (symbol-name name))
diff --git a/src/parse/sgml-parse.lisp b/src/parse/sgml-parse.lisp
index faa9029..a277ece 100644
--- a/src/parse/sgml-parse.lisp
+++ b/src/parse/sgml-parse.lisp
@@ -182,7 +182,8 @@
   (or (name-start-rune-p char)
       (digit-rune-p char)
       (rune= char #/.)
-      (rune= char #/-)))
+      (rune= char #/-)
+      (rune= char #/:)))

 (definline sloopy-name-rune-p (char)
   (or (name-rune-p char)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.common-lisp.net/pipermail/closure-devel/attachments/20160905/eda32bd8/attachment.html>


More information about the closure-devel mailing list