element with the same > content slightly modified. > Along with the was using some additional functions to cl-who to do the job. It works, > but I'd like to find a cleaner solution with one package (I hope that > closure-html is the one) that I could do parsing + serializing. Yes, closure-html can also serialize. I worked on the closure-html release based on the patches in cl-html-parser, and I don't think I forgot any features. So closure-html would be able to do everything that cl-html-parser could do (and slightly more perhaps, due to the cxml integration). The actual parser is unchanged between the two, and hasn't really changed since Gilbert wrote it, so there should be no difference in that regard. > I see that to parse the page I need to use event methods like > start-element, characters and end-elemet. Theoretically by using only > those methods it's possible for me to find the parse tree under the , modify it, and insert it as a block to some > other html page. > To be able to do that there needs to be some event with gives the > whole pt along with it's name and attrs. I wouldn't define HAX methods like start-element in your kind of application. You can do it, of course, but I don't see the benefit. This gives you the PT: (chtml:parse "

nada

" nil) This gives you the LHTML: (chtml:parse "

nada

" (chtml:make-lhtml-builder)) This gives you cxml-stp's representation: (chtml:parse "

nada

" (stp:make-builder)) I recommend using LHTML or STP, unless you are very familiar with PT. > Second, I couldn't find how I can convert

nada

to its lhtml > form and back to

nada

form (I always end up with those extra > html and head blocks around it). That's true. The parser currently follows the HTML DTD and "repairs" input whereever it doesn't match the DTD. It probably wouldn't be hard to change the parser so that it can accept any elements as-is instead of discarding or augmenting them. Unfortunately I haven't found an opportunity to work on that yet, so for now, you would have to take the "repaired" LHTML or STP representation and extract the child node under BODY after parsing. For others reading this (you've probably already used it), here is how to serialize LHTML back to a string: (chtml:serialize-lhtml '(:p () "nada") (chtml:make-string-sink)) => "

nada

" > Please, let me know if there is a better solution to my problem and > maybe I am missing some functionality or misunderstand the philosophy > of the library. Other than the suggestions above, I can only suggest to try various cxml-related libraries to make those steps easier. For example, here is how to extract the child of BODY from the repaired HTML using Plexippus XPath: (defun first-child-of-body (document) (xpath:with-namespaces (("xhtml" "http://www.w3.org/1999/xhtml")) (xpath:first-node (xpath:evaluate "//xhtml:body/*" document)))) CL-USER> (first-child-of-body (chtml:parse "

nada

" (stp:make-builder))) => #.(CXML-STP:ELEMENT :LOCAL-NAME "p" ...) Going further, have you considered using XSLT? I know many people aren't XSLT fans, but for the kind of HTML processing you are describing I have found it very helpful. Here is how to convert your two HTML documents to XHTML: (defun html2xml (in out) (with-open-file (s out :direction :output :if-exists :supersede :element-type '(unsigned-byte 8)) (chtml:parse in (cxml:make-octet-stream-sink s)))) (html2xml "

this is page1.html

" #p"page1.xml") (html2xml "

this is page2.html

" #p"page2.xml") Then you can use Xuriella XSLT to combine them into a single HTML document: (xuriella:apply-stylesheet #p"test.xsl" #p"page2.xml") => "

this is page2.html

this is page1.html" Note that Xuriella uses closure-html's serializer to generate the HTML. The stylesheet test.xsl could look like this: ----------------------------------------------------------------------

---------------------------------------------------------------------- d. From lispercat at gmail.com Thu Apr 2 19:47:39 2009 From: lispercat at gmail.com (Andrei Stebakov) Date: Thu, 2 Apr 2009 15:47:39 -0400 Subject: [closure-devel] Converting html files In-Reply-To: <20090402192343.GF2218@radon> References: <20090402192343.GF2218@radon> Message-ID: Thank you David, for the detailed explanation. I'll definitely give a try to STP and XSLT. If start-element characters and end-element couldn't handle my case, what's the typilcal case they are desiged for? Something like finding some element in an html and trigger some action based on it? Maybe we need to extend it so an html doc could be parsed and we had some events that gave us access to the whole sub-tree under the current node? Andrei On Thu, Apr 2, 2009 at 3:23 PM, David Lichteblau wrote: > Hi, > > Quoting Andrei Stebakov (lispercat at gmail.com): >> I am new to the library so I have a couple of questions to make sure I >> am on the right track. >> I have a task of converting html. Say I have an html page (page1.html) >> it its depth of html code it has a sertain table with a > id='content ...>. >> This element holds the important data that I want to include in >> another html page (page2.html) as a

element with the same >> content slightly modified. >> Along with the > was using some additional functions to cl-who to do the job. It works, >> but I'd like to find a cleaner solution with one package (I hope that >> closure-html is the one) that I could do parsing + serializing. > > Yes, closure-html can also serialize. > > I worked on the closure-html release based on the patches in > cl-html-parser, and I don't think I forgot any features. ?So > closure-html would be able to do everything that cl-html-parser could do > (and slightly more perhaps, due to the cxml integration). > > The actual parser is unchanged between the two, and hasn't really > changed since Gilbert wrote it, so there should be no difference in that > regard. > >> I see that to parse the page I need to use event methods like >> start-element, characters and end-elemet. Theoretically by using only >> those methods it's possible for me to find the > parse tree under the , modify it, and insert it as a block to some >> other html page. >> To be able to do that there needs to be some event with gives the >> whole pt along with it's name and attrs. > > I wouldn't define HAX methods like start-element in your kind of > application. ?You can do it, of course, but I don't see the benefit. > > This gives you the PT: > ?(chtml:parse "

nada

" nil) > > This gives you the LHTML: > ?(chtml:parse "

nada

" (chtml:make-lhtml-builder)) > > This gives you cxml-stp's representation: > ?(chtml:parse "

nada

" (stp:make-builder)) > > I recommend using LHTML or STP, unless you are very familiar with PT. > >> Second, I couldn't find how I can convert

nada

to its lhtml >> form and back to

nada

form (I always end up with those extra >> html and head blocks around it). > > That's true. ?The parser currently follows the HTML DTD and "repairs" > input whereever it doesn't match the DTD. > > It probably wouldn't be hard to change the parser so that it can accept > any elements as-is instead of discarding or augmenting them. > > Unfortunately I haven't found an opportunity to work on that yet, so for > now, you would have to take the "repaired" LHTML or STP representation > and extract the child node under BODY after parsing. > > > For others reading this (you've probably already used it), here is how > to serialize LHTML back to a string: > > ?(chtml:serialize-lhtml '(:p () "nada") (chtml:make-string-sink)) > ?=> "

nada

" > >> Please, let me know if there is a better solution to my problem and >> maybe I am missing some functionality or misunderstand the philosophy >> of the library. > > Other than the suggestions above, I can only suggest to try various > cxml-related libraries to make those steps easier. > > > For example, here is how to extract the child of BODY from the repaired > HTML using Plexippus XPath: > > ?(defun first-child-of-body (document) > ? ?(xpath:with-namespaces (("xhtml" "http://www.w3.org/1999/xhtml")) > ? ? ?(xpath:first-node (xpath:evaluate "//xhtml:body/*" document)))) > > CL-USER> (first-child-of-body (chtml:parse "

nada

" (stp:make-builder))) > > => #.(CXML-STP:ELEMENT :LOCAL-NAME "p" ...) > > > Going further, have you considered using XSLT? ?I know many people > aren't XSLT fans, but for the kind of HTML processing you are describing > I have found it very helpful. > > Here is how to convert your two HTML documents to XHTML: > > (defun html2xml (in out) > ?(with-open-file (s out > ? ? ? ? ? ? ? ? ? ? :direction :output > ? ? ? ? ? ? ? ? ? ? :if-exists :supersede > ? ? ? ? ? ? ? ? ? ? :element-type '(unsigned-byte 8)) > ? ?(chtml:parse in (cxml:make-octet-stream-sink s)))) > > (html2xml "

this is page1.html

" > ? ? ? ? ?#p"page1.xml") > (html2xml "

this is page2.html

" > ? ? ? ? ?#p"page2.xml") > > Then you can use Xuriella XSLT to combine them into a single HTML > document: > > (xuriella:apply-stylesheet #p"test.xsl" #p"page2.xml") > > => "

this is page2.html

this is page1.html" > > Note that Xuriella uses closure-html's serializer to generate the HTML. > > > The stylesheet test.xsl could look like this: > > ---------------------------------------------------------------------- > ? ? ? ? ? ? ? xmlns:xhtml="http://www.w3.org/1999/xhtml" > ? ? ? ? ? ? ? version="1.0"> > > ? > ? > ? ? > ? ? ? > ? ? > ? > > ? > ? ? > ? ? ? > ? ? > ? > > ? > ? > ? ? > ? ? ? > ? ? ? > ? ? > ? > > ---------------------------------------------------------------------- > > > d. > From david at lichteblau.com Fri Apr 3 06:34:05 2009 From: david at lichteblau.com (David Lichteblau) Date: Fri, 3 Apr 2009 08:34:05 +0200 Subject: [closure-devel] Converting html files In-Reply-To: <20090402192343.GF2218@radon> References: <20090402192343.GF2218@radon> Message-ID: <20090403063405.GA12756@radon> Quoting David Lichteblau (david at lichteblau.com): > I worked on the closure-html release based on the patches in > cl-html-parser, and I don't think I forgot any features. So > closure-html would be able to do everything that cl-html-parser could do > (and slightly more perhaps, due to the cxml integration). Sorry, I was wrong about this. When I read "cl-html-parser", I assumed that it was the Closure repackaging work by Ignas Mikalaj#nas. Turns out that his work was called "trivial-html-parser", and you probably meant "cl-html-parse", which is just yet another repackaging of the parser written by Franz, this time packaged by Gary King. So these are actually two entirely different parsers. If you like the API of the Franz parser better, I suggest that you use this for parsing, and use closure-html for serialization. d.