[cl-openid-devel] html parsing for html-based discovery

Tue Jun 3 00:02:55 UTC 2008

Hello Maciek.

I've checked two java implementations for how they parse html

- hte joid library by Verisign is uses very simple approach:
  they read html page line by line. If line contains
  "openid.server" string, they search value of the href
  attribute on the same or one of following lines, just
  by scanning for "href=" string.

  http://code.google.com/p/joid/source/browse/trunk/src/org/verisign/joid/consumer/Discoverer.java

- the openid4java uses more thorough approach: they defined
  interface for html parser, provide a mechanism to plug
  a parser implementation and created default implementation
  based on some external HTML parser library

  http://code.google.com/p/openid4java/source/browse/trunk/src/org/openid4java/discovery/html/HtmlResolver.java

I personally like the joid approach. Although in theory
it may fail on a valid html document, it will work in almost any
real life scenario. It's pleasant to read their simple code.

If it takes some difficulties/uncertainty to decide on the html
parsing problem right now, we may create the simplest variant of
parser: just scanning for "openid2.provider", etc. It will be
sufficient for our initial experiments and I almost sure it will
work for all the popular providers. We may create a ticket to
improve html parsing and fix the ticket in the future, according
it's priority.

What do you think?

Best regards,
-Anton