Hire a senior Perl / Python programmer today; download my up-to-date resume (PDF)
John Bokma Perl
freelance Perl programmer

Comments: Googlebot statistics

5 comments

The Perl program below processes an access log file as generated by the Apache HTTP server and creates output which can be imported in a spread sheet program, for example Calc, which comes with OpenOffice.org, or Microsoft Excel.

Read the rest of Googlebot statistics.

Comments

In sourceforge you will find awstats.pl, which does similar things and has a good per crawler break-down... that said, only consider awstats.pl if you want to re-use a few bits and extend the functionality, of course...

All projects begin small.

Posted by Roy Schestowitz at 06:58 GMT on 5 December 2005

Hi Roy,

I only know awstats.pl from the "hack" attempts I see now and then on my site, so I have no idea if it's comparable. My script does a very fine break-down for Googlebot only, since that is what I am interested in the most.

I might add a thing or two to this script, but don't expect too much (it will stay a small tool).

Posted by John Bokma at 04:23 GMT on 6 December 2005

It looks like it doesn't work b/c the regex isn't matching googlebot.com. In my access log, it looks like this:

66.249.72.36 - - [15/Sep/2006:03:21:21 -0400] "GET
/shells/printer.phtml?id=5112 HTTP/1.1" 200 5084 "-" "Mozilla/5.0
(compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Posted by Jason at 21:03 GMT on 26 October 2006

Hi Jason, it doesn't work, because the Perl program assumes that host names are in the first column, not ip addresses. I will look into a fix for this.

Posted by John Bokma at 02:13 GMT on 27 October 2006

I was able to get it to work by tweaking the regexp:

$line =~ m!

	^.+?
	\[(\d\d)/(\w{3})/(\d{4})(?::\d\d){3}.+?\]
	\s"GET\s(.+?)\sHTTP/\d.\d"
	\s(\d{3})
	.+
	Googlebot/

That should work for domain names, too.

Posted by Jason at 12:23 GMT on 27 October 2006

Post a comment

Note that your comment doesn't show up immediately. I review each comment before I add it to this site.

Check the Follow this page option if you want to receive an email each time a comment is posted to this page, including yours. A link to turn this option off will be included with each email.

Internet adresses will be converted automatically. You can use the following notation to specify anchor text for a link: [url=http://example.com/]example text[/url].