[drakma-devel] Bug handling bad html?

Jeffrey Cunningham jeffrey at cunningham.net
Sun Feb 25 16:26:45 UTC 2007


On Sun Feb 25, 2007 at 11:25:04AM +0100, Edi Weitz wrote:
> On Sat, 24 Feb 2007 16:39:54 -0800, Jeffrey Cunningham <jeffrey at cunningham.net> wrote:
> 
> > In this case I set it to make a substitution for the 'bad'
> > character. Is it possible for there to be more than one?
> 
> Not yet.  See current discussion on the FLEXI-STREAMS mailing list.
> 
> > And more generally, should there not be a way to set drakma so it
> > may take a performance hit but is guaranteed not to die on any html
> > that is thrown at it?
> 
> It's not dying, it just signals an error.
> 
> And, no, I don't think there's a way to provide meaningful results and
> at the same time to be prepared to accept whatever bogus data or
> headers the server choses to send.  If you find something like that,
> send patches, but it sounds like magic (or at least very good AI) to
> me.

I guess I disagree. 

If I try to access a page like that using: links, lynx, wget, mozilla,
firefox, or any html parsing entity I can think of they don't stop
functioning, signal an error, or whatever you want to call it. They
give me their best approximation of the content. Seems like that ought
be the goal here, or at least a possibility. 

In an automated process, signaling an error means that processing has
stopped (or 'died'). The source of the error signal may be in
flexi-streams (I have read the discussions in the that list), but its
drakma that has to deal with its consequences. 

How do the above mentioned applications manage this problem? Certainly
not by magic. And I doubt the AI in links or lynx is very
sophisticated.


--Jeff







More information about the Drakma-devel mailing list