[closure-devel] Converting html files

Thu Apr 2 19:47:39 UTC 2009

Thank you David, for the detailed explanation. I'll definitely give a
try to STP and XSLT.
If start-element characters and end-element couldn't handle my case,
what's the typilcal case they are desiged for? Something like finding
some element in an html and trigger some action based on it? Maybe we
need to extend it so an html doc could be parsed and we had some
events that gave us access to the whole sub-tree under the current
node?

Andrei

On Thu, Apr 2, 2009 at 3:23 PM, David Lichteblau <david at lichteblau.com> wrote:
> Hi,
>
> Quoting Andrei Stebakov (lispercat at gmail.com):
>> I am new to the library so I have a couple of questions to make sure I
>> am on the right track.
>> I have a task of converting html. Say I have an html page (page1.html)
>> it its depth of html code it has a sertain table with a <td
>> id='content ...>.
>> This <td> element holds the important data that I want to include in
>> another html page (page2.html) as a <div> element with the same
>> content slightly modified.
>> Along with the <td id='content ...> I need to borrow some fragments of
>> javascript from page1.html to page2.html.
>
> okay.
>
>> Right now I am doing this job with cl-html-parser, but the problem
>> with html parser is that it can't serialize lhtml back to html, so I
>> was using some additional functions to cl-who to do the job. It works,
>> but I'd like to find a cleaner solution with one package (I hope that
>> closure-html is the one) that I could do parsing + serializing.
>
> Yes, closure-html can also serialize.
>
> I worked on the closure-html release based on the patches in
> cl-html-parser, and I don't think I forgot any features.  So
> closure-html would be able to do everything that cl-html-parser could do
> (and slightly more perhaps, due to the cxml integration).
>
> The actual parser is unchanged between the two, and hasn't really
> changed since Gilbert wrote it, so there should be no difference in that
> regard.
>
>> I see that to parse the page I need to use event methods like
>> start-element, characters and end-elemet. Theoretically by using only
>> those methods it's possible for me to find the <td id='content ...>
>> element, set a marker in my-lhtml-builder and start to collect all the
>> data inside the <td> elemet, modify what I need so I come up with the
>> html string at the end of parsing. Along with that I woud do the same
>> with bits of javascript.
>> What seems a bit awkward here is that I can't just take the whole
>> parse tree under the <td>, modify it, and insert it as a block to some
>> other html page.
>> To be able to do that there needs to be some event with gives the
>> whole pt along with it's name and attrs.
>
> I wouldn't define HAX methods like start-element in your kind of
> application.  You can do it, of course, but I don't see the benefit.
>
> This gives you the PT:
>  (chtml:parse "<p>nada</p>" nil)
>
> This gives you the LHTML:
>  (chtml:parse "<p>nada</p>" (chtml:make-lhtml-builder))
>
> This gives you cxml-stp's representation:
>  (chtml:parse "<p>nada</p>" (stp:make-builder))
>
> I recommend using LHTML or STP, unless you are very familiar with PT.
>
>> Second, I couldn't find how I can convert <p>nada</p> to its lhtml
>> form and back to <p>nada</p> form (I always end up with those extra
>> html and head blocks around it).
>
> That's true.  The parser currently follows the HTML DTD and "repairs"
> input whereever it doesn't match the DTD.
>
> It probably wouldn't be hard to change the parser so that it can accept
> any elements as-is instead of discarding or augmenting them.
>
> Unfortunately I haven't found an opportunity to work on that yet, so for
> now, you would have to take the "repaired" LHTML or STP representation
> and extract the child node under BODY after parsing.
>
>
> For others reading this (you've probably already used it), here is how
> to serialize LHTML back to a string:
>
>  (chtml:serialize-lhtml '(:p () "nada") (chtml:make-string-sink))
>  => "<P>nada</P>"
>
>> Please, let me know if there is a better solution to my problem and
>> maybe I am missing some functionality or misunderstand the philosophy
>> of the library.
>
> Other than the suggestions above, I can only suggest to try various
> cxml-related libraries to make those steps easier.
>
>
> For example, here is how to extract the child of BODY from the repaired
> HTML using Plexippus XPath:
>
>  (defun first-child-of-body (document)
>    (xpath:with-namespaces (("xhtml" "http://www.w3.org/1999/xhtml"))
>      (xpath:first-node (xpath:evaluate "//xhtml:body/*" document))))
>
> CL-USER> (first-child-of-body (chtml:parse "<p>nada</p>" (stp:make-builder)))
>
> => #.(CXML-STP:ELEMENT :LOCAL-NAME "p" ...)
>
>
> Going further, have you considered using XSLT?  I know many people
> aren't XSLT fans, but for the kind of HTML processing you are describing
> I have found it very helpful.
>
> Here is how to convert your two HTML documents to XHTML:
>
> (defun html2xml (in out)
>  (with-open-file (s out
>                     :direction :output
>                     :if-exists :supersede
>                     :element-type '(unsigned-byte 8))
>    (chtml:parse in (cxml:make-octet-stream-sink s))))
>
> (html2xml "<table><td id='content'>this is page1.html</td></table>"
>          #p"page1.xml")
> (html2xml "<p>this is page2.html</p>"
>          #p"page2.xml")
>
> Then you can use Xuriella XSLT to combine them into a single HTML
> document:
>
> (xuriella:apply-stylesheet #p"test.xsl" #p"page2.xml")
>
> => "<html><head><meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\"></head><body><p>this is page2.html</p><td id=\"content\">this is page1.html</td></body></html>"
>
> Note that Xuriella uses closure-html's serializer to generate the HTML.
>
>
> The stylesheet test.xsl could look like this:
>
> ----------------------------------------------------------------------
> <xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
>               xmlns:xhtml="http://www.w3.org/1999/xhtml"
>               version="1.0">
>
>  <!-- copy everything, but strip the XHTML namespace -->
>  <xsl:template match="*">
>    <xsl:element name="{local-name()}">
>      <xsl:apply-templates select="@*|node()"/>
>    </xsl:element>
>  </xsl:template>
>
>  <xsl:template match="@*">
>    <xsl:attribute name="{local-name()}">
>      <xsl:value-of select="."/>
>    </xsl:attribute>
>  </xsl:template>
>
>  <!-- insert the TD from page1 into body -->
>  <xsl:template match="xhtml:body">
>    <body>
>      <xsl:apply-templates select="@*|node()"/>
>      <xsl:apply-templates select="document('page1.xml', .)
>                                   //xhtml:td[@id = 'content']"/>
>    </body>
>  </xsl:template>
> </xsl:transform>
> ----------------------------------------------------------------------
>
>
> d.
>