[hunchentoot-devel] googlebot revisitation rate excessive?

Jeff Cunningham jeffrey at cunningham.net
Fri Jul 4 21:50:39 UTC 2008


Hans Hübner wrote:
> It means that googlebot presented a session identifier string as a
> hunchentoot-session parameter that is not valid.  You are propably
> using sessions very frequently and the Google crawler managed to hit
> one of the URLs of your server that starts a session.  As the crawler
> did not accept the Cookie that Hunchentoot sent, Hunchentoot fell back
> to attaching the session identifier to all URLs in the outgoing HTML
> as a parameter.  The crawler saved the URLs it saw including the
> session identifier and now tries to crawl using these identifiers,
> which are propably old and no longer valid.
>
> First off, I would recommend that you switch of URL-REWRITE
> (http://weitz.de/hunchentoot/#*rewrite-for-session-urls*).  I am not
> using it myself precisely because it confuses simple crawlers.  If a
> user does not accept the cookies my site sends, they will not be able
> to use it with sessions.  For me, this has never been a problem.  This
> will propably not help you with your current problem, but it will make
> things easier in the future.
>
> In general, crawlers do not support cookies or session ids in GET
> parameters.  Thus, if you want to support crawlers, you need to make
> them work without sessions.  Note that if you just do nothing except
> switching off URL-REWRITE; every request from a crawler will create a
> new session.  This may or may not be a problem.
>
> I guess that Google now has a lot of your URLs it wants to crawl
> because the different session identifiers made it think that all of
> them are pointing to different resource.  I am kind of wondering
> whether that is standard googlebot behaviour.
>
> Lastly, I would vote for switching off URL-REWRITE by default.
>   
Thanks for the excellent explanation. It fits all the available facts. 
I've turned off *REWRITE-FOR-SESSION-URLS* so presumably, google should 
eventually out that the URL's it has are bad and drop them in favor of 
the sessionless ones (I hope).

I switched to a non-googlebotted site to experiment with and for some 
reason even when I'm not using sessions, I see a message about "No 
session for session identifier..." when I browse a page myself. I 
cleared my cache, here's an example:

[2008-07-04 14:46:34 [WARNING]] Fake session identifier 
'1:D5C66E2968BE2162C3164
B39B9029F13' (User-Agent: 'Mozilla/5.0 (X11; U; Linux x86_64; en-US; 
rv:1.8.1.14
) Gecko/20080404 Iceweasel/2.0.0.14 (Debian-2.0.0.14-2)', IP: '127.0.0.1')

That error message corresponds to this access log entry and this header 
output:

127.0.0.1 (192.168.1.1) - [2008-07-04 14:46:34] "GET / HTTP/1.1" 200 
9195 "-" "M
ozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8.1.14) Gecko/20080404 
Iceweasel/2
.0.0.14 (Debian-2.0.0.14-2)"


GET / 
HTTP/1.1                                                                                       

Host: 
127.0.0.1:4242                                                                                 

User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8.1.14) 
Gecko/20080404 Iceweasel/2.0.0.14\
 (Debian-2.0.0.14-2)                                                                                 

Accept: 
text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/\
*;q=0.5                                                                                              

Accept-Language: 
en-us,en;q=0.5                                                                      

Accept-Encoding: 
gzip,deflate                                                                        

Accept-Charset: 
ISO-8859-1,utf-8;q=0.7,*;q=0.7                                                       

Cookie: 
hunchentoot-session=1%3AD5C66E2968BE2162C3164B39B9029F13                                     

Max-Forwards: 
10                                                                                     

X-Forwarded-For: 
192.168.1.1                                                                         

X-Forwarded-Host: 
cunningham.homeip.net                                                              

X-Forwarded-Server: 
test.com                                                                         

Connection: 
Keep-Alive                                                                               

                                                                                                     

HTTP/1.1 200 
OK                                                                                      

Content-Length: 
9195^M                                                                               

Date: Fri, 04 Jul 2008 21:46:34 
GMT^M                                                               
Server: Hunchentoot 
1.0.0^M                                                                          

Keep-Alive: 
timeout=20^M                                                                             

Connection: 
Keep-Alive^M                                                                             

Content-Type: text/html; 
charset=iso-8859-1^M                                                       

--Jeff





More information about the Tbnl-devel mailing list