Proposed improvement to the HTML parser
Elias Mårtenson
lokedhs at gmail.com
Mon Sep 5 05:40:36 UTC 2016
A few years back, I posted this request, but at the time I got no reply.
Since there has recently been some activity here, I'm asking this again.
Currently, closure-html throws an error when trying to parse HTML generated
by Microsoft Outlook. The reason for this is that Outlook generates a lot
of tags with a colon (:) in them, which closure-html considers a syntax
error.
I don't know if it actually is a syntax error, but closure-html should be
lenient here and simply ignore these tags. That's what my fix does.
Currently, I had to advice my users to manually patch closure-html. It
would be very nice if this was integrated in the official version.
Here is the patch:
diff --git a/src/parse/html-parser.lisp b/src/parse/html-parser.lisp
index 1fdd457..4e45b81 100644
--- a/src/parse/html-parser.lisp
+++ b/src/parse/html-parser.lisp
@@ -106,7 +106,10 @@
for (name value) on plist by #'cddr
unless
;; better don't emit as HAX what would be bogus as SAX anyway
- (string-equal name "xmlns")
+ (let ((s (string name))
+ (prefix "xmlns:"))
+ (or (string-equal s "xmlns")
+ (string-equal s prefix :end1 (min (length s) (length prefix)))))
collect
(let* ((n #+rune-is-character (coerce (symbol-name name) 'rod)
#-rune-is-character (symbol-name name))
diff --git a/src/parse/sgml-parse.lisp b/src/parse/sgml-parse.lisp
index faa9029..a277ece 100644
--- a/src/parse/sgml-parse.lisp
+++ b/src/parse/sgml-parse.lisp
@@ -182,7 +182,8 @@
(or (name-start-rune-p char)
(digit-rune-p char)
(rune= char #/.)
- (rune= char #/-)))
+ (rune= char #/-)
+ (rune= char #/:)))
(definline sloopy-name-rune-p (char)
(or (name-rune-p char)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.common-lisp.net/pipermail/closure-devel/attachments/20160905/eda32bd8/attachment.html>
More information about the closure-devel
mailing list