John Bokma Perl
freelance Perl programmer

Googlebot statistics

Analyze your access_log | 5 comments

The Perl program below processes an access log file as generated by the Apache HTTP server and creates output which can be imported in a spread sheet program, for example Calc, which comes with OpenOffice.org, or Microsoft Excel.

Note: this script is a beta version. Also, it uses the domain name in the first column to detect googlebot. If your access log has only IP addresses in the first column, this Perl program as given doesn't work.

#!/usr/bin/perl
#
# googlebot-stats.pl - Googlebot statistics
#
#  Copyright, 2005 By John Bokma, http://johnbokma.com/
# License: The Artistic License
#
# $Id: googlebot-stats.pl 1096 2008-09-30 19:19:12Z john $ 

use strict;
use warnings;


sub print_usage_and_exit {

    print <<USAGE;
usage: googlebot-stats.pl ACCESS_LOG [prefix]
USAGE

    exit;
}


my $filename = shift;
defined $filename or print_usage_and_exit;

my $prefix = shift;
defined $prefix or $prefix = '';

open my $fh, $filename or
    die "Can't open '$filename' for reading: $!";

# create a conversion table for Xxx month name to number
my %mon_number;
@mon_number{ qw( Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ) }
    = 1..12;

my %stats;
while ( my $line = <$fh> ) {

    $line =~ m!

        ^crawl(?:-\d+){4}\.googlebot\.com
        .+?
        \[(\d\d)/(\w{3})/(\d{4})(?::\d\d){3}.+?\]
        \s"GET\s(.+?)\sHTTP/\d.\d"
        \s(\d{3})

    !x or next;

    my ( $day, $mon, $year, $path, $status ) = ( $1, $2, $3, $4, $5 );
    $day *= 1;  # convert to number, remove trailing zero
    $stats{ $year }{ $mon_number{ $mon } }{ $status }{ $path }{ $day }++;
}
close $fh or die "Can't close '$filename' after reading: $!";

my %status2description = (

    200 => 'OK',
    206 => 'Partial Content',
    301 => 'Moved Permanently',
    304 => 'Not Modified',
    403 => 'Forbidden',
    404 => 'Not Found',
);

for my $year ( sort { $a <=> $b } keys %stats ) {

    my $months = $stats{ $year };

    for my $month ( sort { $a <=> $b } keys %$months ) {

        my $all_status = $months->{ $month };

        for my $status ( sort keys %$all_status ) {

            printf "%02d-%d status: %d", $month, $year, $status;

            print exists $status2description{ $status }
                ? " ($status2description{ $status })\n"
                : "\n";

            print "PATH\t", join( "\t", 1 .. 31 ), "\tTOTAL\n";

            my $paths = $all_status->{ $status };

            for my $path ( sort keys %$paths ) {

                print "$prefix$path";
                my $total;

                for my $day ( 1 .. 31 ) {

                    my $count = $paths->{ $path }{ $day };

                    print "\t", defined $count ? $count : '';

                    defined $count and $total += $count;
                }
                print "\t$total\n";
            }
            print "\n";
        }
    }
}

Example of how the program can be invoked:

googlebot-stats.pl access_log > stats.csv

Since the Perl program only extracts the path part of each URL from the access log, you can specify a prefix in order to get a full URL in the output, for example:

googlebot-stats.pl access_log http://johnbokma.com > stats.csv

Note: don't add a slash after the domain name, since each path in the log starts with a / already.

On my computer (Compaq Presario SR1505LA with an AMD Sempron, 3100+) it takes about 5 seconds to process an access log of about 200 MB.

Importing the result in Open Office Calc

If you give the output file the extension csv (despite it being tab separated instead of comma separated), the OpenOffice Calc program will show the Text Import dialog window when the output file is dropped onto the main window of the Calc program.

Importing the Googlebot statistics.
Importing the Googlebot statistics.

In order to import the Googlebot statistics file untick the Comma option, and tick the Tab option in the "Separated by" option of the Text Import dialog window, and press the OK button.

Example of output sorted in Calc (some columns and rows hidden)
Example of output sorted in Calc (some columns and rows hidden)

You might want to sort, for example, the results on the TOTAL column (descending): select all rows in the same section and select Sort... in the Data menu of Calc. Next, set "Sort by" to "Column AG", descending, and "Then by" to "Column A", ascending. Finally press OK. You'll probably see /robots.txt at the first position if you have this file, and the section is the 200 (OK) one (see screenshot).

Note: if you get the following warning when you re-run the Perl program:

The process cannot access the file because it is being used by another process.

this is because Calc locks the file it imports, and you tried to overwrite it. Close the imported file in Calc, or output to a file with different name.

Related

Please post a comment | read 5 comments, latest by Jason | RSS feed