[cxml-devel] Problems parsing HTML with embeded JS which itself embeds HTML

Wed Dec 23 13:16:26 UTC 2009

Hello all,

for a project I need the possibility to extract some info from a web
page to link it with a database and have tried several html parsers
for CL. I would like to use Closure HTML for the task because of the
add ons for XPATH, but I have a problem parsing the HTML source I have
with it. An example page describing the problem is for example HTML
source of

http://yellow.local.ch/de/q?ext=1&name=&company=Berufsschule,+Fachschule&street=&city=&area=Bern+%28Kanton%29&phone=&suchen=Suchen#start=1

from which I need to extract some of the address/street/phone data. It
seems, that all HTML/XML parsers for CL can't parse it correctly and
most of the missing parts in the parsed representation are the ones
which deal with the HTML-source which defines a Javascript element
which itself includes HTML as a string parameter in the embedded JS.
Which is of course exactly the text which I need from the site :) Of
course I could extract the data using some regexps but it's really
clumpsy and if possible, it would be nice to can stay in the
HTML/JS-parse/STP/XPATH data representation. I've looked in the source
of Closure HTML to try to help, but it seems that the project has
pretty old and deep roots in SGML, where I don't want to introduce
errors - I don't know SGML and may be - even if I find the places
needed to be corrected for HTML, that could brake something for the
SGML part of Closure-HTML. Also, I think I would need not so short
time to get the inner workings of the parsers in Closure-HTML, so I
would greatly appreciate any help to get it working with the described
site.

With best regards
Plamen