[hunchentoot-devel] googlebot revisitation rate excessive?
Jeff Cunningham
jeffrey at cunningham.net
Fri Jul 4 21:50:39 UTC 2008
Hans Hübner wrote:
> It means that googlebot presented a session identifier string as a
> hunchentoot-session parameter that is not valid. You are propably
> using sessions very frequently and the Google crawler managed to hit
> one of the URLs of your server that starts a session. As the crawler
> did not accept the Cookie that Hunchentoot sent, Hunchentoot fell back
> to attaching the session identifier to all URLs in the outgoing HTML
> as a parameter. The crawler saved the URLs it saw including the
> session identifier and now tries to crawl using these identifiers,
> which are propably old and no longer valid.
>
> First off, I would recommend that you switch of URL-REWRITE
> (http://weitz.de/hunchentoot/#*rewrite-for-session-urls*). I am not
> using it myself precisely because it confuses simple crawlers. If a
> user does not accept the cookies my site sends, they will not be able
> to use it with sessions. For me, this has never been a problem. This
> will propably not help you with your current problem, but it will make
> things easier in the future.
>
> In general, crawlers do not support cookies or session ids in GET
> parameters. Thus, if you want to support crawlers, you need to make
> them work without sessions. Note that if you just do nothing except
> switching off URL-REWRITE; every request from a crawler will create a
> new session. This may or may not be a problem.
>
> I guess that Google now has a lot of your URLs it wants to crawl
> because the different session identifiers made it think that all of
> them are pointing to different resource. I am kind of wondering
> whether that is standard googlebot behaviour.
>
> Lastly, I would vote for switching off URL-REWRITE by default.
>
Thanks for the excellent explanation. It fits all the available facts.
I've turned off *REWRITE-FOR-SESSION-URLS* so presumably, google should
eventually out that the URL's it has are bad and drop them in favor of
the sessionless ones (I hope).
I switched to a non-googlebotted site to experiment with and for some
reason even when I'm not using sessions, I see a message about "No
session for session identifier..." when I browse a page myself. I
cleared my cache, here's an example:
[2008-07-04 14:46:34 [WARNING]] Fake session identifier
'1:D5C66E2968BE2162C3164
B39B9029F13' (User-Agent: 'Mozilla/5.0 (X11; U; Linux x86_64; en-US;
rv:1.8.1.14
) Gecko/20080404 Iceweasel/2.0.0.14 (Debian-2.0.0.14-2)', IP: '127.0.0.1')
That error message corresponds to this access log entry and this header
output:
127.0.0.1 (192.168.1.1) - [2008-07-04 14:46:34] "GET / HTTP/1.1" 200
9195 "-" "M
ozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8.1.14) Gecko/20080404
Iceweasel/2
.0.0.14 (Debian-2.0.0.14-2)"
GET /
HTTP/1.1
Host:
127.0.0.1:4242
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8.1.14)
Gecko/20080404 Iceweasel/2.0.0.14\
(Debian-2.0.0.14-2)
Accept:
text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/\
*;q=0.5
Accept-Language:
en-us,en;q=0.5
Accept-Encoding:
gzip,deflate
Accept-Charset:
ISO-8859-1,utf-8;q=0.7,*;q=0.7
Cookie:
hunchentoot-session=1%3AD5C66E2968BE2162C3164B39B9029F13
Max-Forwards:
10
X-Forwarded-For:
192.168.1.1
X-Forwarded-Host:
cunningham.homeip.net
X-Forwarded-Server:
test.com
Connection:
Keep-Alive
HTTP/1.1 200
OK
Content-Length:
9195^M
Date: Fri, 04 Jul 2008 21:46:34
GMT^M
Server: Hunchentoot
1.0.0^M
Keep-Alive:
timeout=20^M
Connection:
Keep-Alive^M
Content-Type: text/html;
charset=iso-8859-1^M
--Jeff
More information about the Tbnl-devel
mailing list