The Perl program below processes an access log file as generated by the Apache HTTP server and creates output which can be imported in a spread sheet program, for example Calc, which comes with, or Microsoft Excel.

In sourceforge you will find, which does similar things and has a good per crawler break-down... that said, only consider if you want to re-use a few bits and extend the functionality, of course...

All projects begin small.

Posted by Roy Schestowitz at 06:58 GMT on 5 December 2005

Hi Roy,

I only know from the "hack" attempts I see now and then on my site, so I have no idea if it's comparable. My script does a very fine break-down for Googlebot only, since that is what I am interested in the most.

I might add a thing or two to this script, but don't expect too much (it will stay a small tool).

Posted by John Bokma at 04:23 GMT on 6 December 2005

It looks like it doesn't work b/c the regex isn't matching In my access log, it looks like this: - - [15/Sep/2006:03:21:21 -0400] "GET
/shells/printer.phtml?id=5112 HTTP/1.1" 200 5084 "-" "Mozilla/5.0
(compatible; Googlebot/2.1; +"
Posted by Jason at 21:03 GMT on 26 October 2006

Hi Jason, it doesn't work, because the Perl program assumes that host names are in the first column, not ip addresses. I will look into a fix for this.

Posted by John Bokma at 02:13 GMT on 27 October 2006

I was able to get it to work by tweaking the regexp:

$line =~ m!


That should work for domain names, too.

Posted by Jason at 12:23 GMT on 27 October 2006

