The Perl program below processes an access log file as generated by the Apache HTTP server and creates output which can be imported in a spread sheet program, for example Calc, which comes with OpenOffice.org, or Microsoft Excel.
Note: this script is a beta version. Also, it uses the domain name in the first column to detect googlebot. If your access log has only IP addresses in the first column, this Perl program as given doesn't work.
#!/usr/bin/perl
#
# googlebot-stats.pl - Googlebot statistics
#
# © Copyright, 2005 By John Bokma, http://johnbokma.com/
# License: The Artistic License
#
# $Id$
use strict;
use warnings;
sub print_usage_and_exit {
print <<USAGE;
usage: googlebot-stats.pl ACCESS_LOG [prefix]
USAGE
exit;
}
my $filename = shift;
defined $filename or print_usage_and_exit;
my $prefix = shift;
defined $prefix or $prefix = '';
open my $fh, $filename or
die "Can't open '$filename' for reading: $!";
# create a conversion table for Xxx month name to number
my %mon_number;
@mon_number{ qw( Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ) }
= 1..12;
my %stats;
while ( my $line = <$fh> ) {
$line =~ m!
^crawl(?:-\d+){4}\.googlebot\.com
.+?
\[(\d\d)/(\w{3})/(\d{4})(?::\d\d){3}.+?\]
\s"GET\s(.+?)\sHTTP/\d.\d"
\s(\d{3})
!x or next;
my ( $day, $mon, $year, $path, $status ) = ( $1, $2, $3, $4, $5 );
$day *= 1; # convert to number, remove trailing zero
$stats{ $year }{ $mon_number{ $mon } }{ $status }{ $path }{ $day }++;
}
close $fh or die "Can't close '$filename' after reading: $!";
my %status2description = (
200 => 'OK',
206 => 'Partial Content',
301 => 'Moved Permanently',
304 => 'Not Modified',
403 => 'Forbidden',
404 => 'Not Found',
);
for my $year ( sort { $a <=> $b } keys %stats ) {
my $months = $stats{ $year };
for my $month ( sort { $a <=> $b } keys %$months ) {
my $all_status = $months->{ $month };
for my $status ( sort keys %$all_status ) {
printf "%02d-%d status: %d", $month, $year, $status;
print exists $status2description{ $status }
? " ($status2description{ $status })\n"
: "\n";
print "PATH\t", join( "\t", 1 .. 31 ), "\tTOTAL\n";
my $paths = $all_status->{ $status };
for my $path ( sort keys %$paths ) {
print "$prefix$path";
my $total;
for my $day ( 1 .. 31 ) {
my $count = $paths->{ $path }{ $day };
print "\t", defined $count ? $count : '';
defined $count and $total += $count;
}
print "\t$total\n";
}
print "\n";
}
}
}
Example of how the program can be invoked:
googlebot-stats.pl access_log > stats.csv
Since the Perl program only extracts the path part of each URL from the access log, you can specify a prefix in order to get a full URL in the output, for example:
googlebot-stats.pl access_log http://johnbokma.com > stats.csv
Note: don't add a slash after the domain name, since each path in the log starts with a / already.
On my computer (Compaq Presario SR1505LA with an AMD Sempron, 3100+) it takes about 5 seconds to process an access log of about 200 MB.
If you give the output file the extension csv (despite it being tab separated instead of comma separated), the OpenOffice Calc program will show the Text Import dialog window when the output file is dropped onto the main window of the Calc program.
In order to import the Googlebot statistics file untick the Comma option, and tick the Tab option in the "Separated by" option of the Text Import dialog window, and press the OK button.
You might want to sort, for example, the results on the TOTAL column (descending): select all rows in the same section and select Sort... in the Data menu of Calc. Next, set "Sort by" to "Column AG", descending, and "Then by" to "Column A", ascending. Finally press OK. You'll probably see /robots.txt at the first position if you have this file, and the section is the 200 (OK) one (see screenshot).
Note: if you get the following warning when you re-run the Perl program:
The process cannot access the file because it is being used by another process.
this is because Calc locks the file it imports, and you tried to overwrite it. Close the imported file in Calc, or output to a file with different name.