[cxml-devel] Problems parsing HTML with embeded JS which itself embeds HTML

Ben Hyde bhyde at pobox.com
Wed Dec 23 16:06:57 UTC 2009


FYI: that page is undergoing abundant transformation by it's  
javascript.  But this works, in a manner.  (Match is from fare-matcher.)

(defun foobar ()
   (let* ((url "http://yellow.local.ch/de/q?ext=1&name=&company=Berufsschule,+Fachschule&street=&city=&area=Bern+%28Kanton%29&phone=&suchen=Suchen# 
")
          (page (http-request-with-cache url))
          (doc (chtml:parse page (chtml:make-lhtml-builder)))
          (result ()))
     (labels ((recure (node)
                (typecase node
                  (list
                   (match node
                          (`(:div ((:class ,c)) , at children)
                            (when (string= "entrybox" c)
                              (push node result))))
                   (map nil #'recure (cddr node))))))
       (recure doc)
       (nreverse result))))

Possibly you were using an XML parser rather than an HTML parser.

On Dec 23, 2009, at 8:16 AM, Plamen . wrote:

> Hello all,
>
> for a project I need the possibility to extract some info from a web
> page to link it with a database and have tried several html parsers
> for CL. I would like to use Closure HTML for the task because of the
> add ons for XPATH, but I have a problem parsing the HTML source I have
> with it. An example page describing the problem is for example HTML
> source of
>
> http://yellow.local.ch/de/q?ext=1&name=&company=Berufsschule,+Fachschule&street=&city=&area=Bern+%28Kanton%29&phone=&suchen=Suchen#start 
> =1
>
> from which I need to extract some of the address/street/phone data. It
> seems, that all HTML/XML parsers for CL can't parse it correctly and
> most of the missing parts in the parsed representation are the ones
> which deal with the HTML-source which defines a Javascript element
> which itself includes HTML as a string parameter in the embedded JS.
> Which is of course exactly the text which I need from the site :) Of
> course I could extract the data using some regexps but it's really
> clumpsy and if possible, it would be nice to can stay in the
> HTML/JS-parse/STP/XPATH data representation. I've looked in the source
> of Closure HTML to try to help, but it seems that the project has
> pretty old and deep roots in SGML, where I don't want to introduce
> errors - I don't know SGML and may be - even if I find the places
> needed to be corrected for HTML, that could brake something for the
> SGML part of Closure-HTML. Also, I think I would need not so short
> time to get the inner workings of the parsers in Closure-HTML, so I
> would greatly appreciate any help to get it working with the described
> site.
>
> With best regards
> Plamen
>





More information about the cxml-devel mailing list