John Bokma Perl
freelance Perl programmer

Google Search Cloud

Look at your access log in a different way | 14 comments

The following Perl program parses an Apache HTTP server access_log in extended format and collects all referers [sic] which might be generated by a search action on Google. The query visitors used are extracted and grouped per web page and status code. Then the program generates a valid HTML 4.01 strict webpage that shows those queries as a cloud as used by some Web 2.0 websites to present tags.

Output as generated by the Google Search Cloud Perl program.
Output as generated by the Google Search Cloud Perl program.

The print_cloud_as_html_list Perl sub does the actual cloud rendering for a given reference to a hash with frequencies for each phrase. If you need your own cloud generating function, for example to generate a tag cloud, this Perl sub is a good start.

Copy and paste the following Perl program into your favorite editor and save it as gscloud.pl. Note that the Perl code is followed by a listing of the external stylesheet (CSS) that is required by the HTML page Google Search Cloud outputs. Don't accidentally copy this stylesheet together with the Perl program that follows or else you get several "Bareword found where operator expected" messages when you try to run the program.

#!/usr/bin/perl
#
# gscloud.pl - Google Search Cloud
#
#  Copyright, 2006 by John Bokma, http://johnbokma.com/
# License: The Artistic License
#
# $Id: gscloud.pl 1092 2008-09-30 19:15:23Z john $ 

use strict;
use warnings;

use Carp;
use Encode;
use HTML::Entities;
use URI::Escape;
use Getopt::Long;

my $time = time;
my $steps = 18;
my $mapping = 'log';
my $sort = 'alpha';
my $limit = 75;
my $scale = 0;
my $prefix = '';


sub print_usage_and_exit {

    print <<USAGE;
usage: gscloud.pl [OPTIONS] ACCESS_LOG

options:

    steps   - number of cloud sizes, default $steps
    mapping - log or lin, default $mapping
    sort    - alpha or num, default $sort
    limit   - maximum number of phrases, default $limit
    scale   - scale when phrases less then steps, default $scale
    prefix  - prefix for paths (creates links), default none
USAGE

    exit;
}


GetOptions(

    "steps=i"   => \$steps,
    "mapping=s" => \$mapping,
    "sort=s"    => \$sort,
    "limit=i"   => \$limit,
    "scale=i"   => \$scale,
    "prefix=s"  => \$prefix,
);

my $filename = shift;
defined $filename or print_usage_and_exit;

open my $fh, $filename or
    die "Can't open '$filename' for reading: $!";

my %stats;
while ( my $line = <$fh> ) {

    $line =~ m!

        \[\d{2}/\w{3}/\d{4}(?::\d\d){3}.+?\]
        \s"GET\s(\S+)\sHTTP/\d.\d"
        \s(\S+)
        \s\S+
        \s"http://w{1,3}\.google\.
        (?:[a-z]{2}|com?\.[a-z]{2}|com)\.?/
        [^\"]*q=([^\"&]+)[^\"]*"

    !xi or next;

    my ( $path, $status, $query ) = ( $1, $2, $3 );

    $query =~ s/\+/ /g;
    $query = join ' ' => split ' ', uri_unescape $query;
    $query = Encode::decode_utf8 $query;

    $stats{ "$path:$status" }{ sum }++;
    $stats{ "$path:$status" }{ queries }{ $query }++;
}

close $fh or die "Can't close '$filename' after reading: $!";

print_html_start();

my @ps = sort { $stats{ $b }{ sum } <=> $stats{ $a }{ sum } } keys %stats;
for my $ps ( @ps ) {

    my ( $path, $status ) = $ps =~ /(.*):(\d+)/;
    my $sum = $stats{ $ps }{ sum };

    my $section = $path;
    $prefix and $section = qq(<a href="$prefix$path">$section</a>);

    print "<h2>$section",
        qq( <span class="small">total: $sum, status: $status</span>),
        "</h2>\n";
    print_cloud_as_html_list(

        frequencies => $stats{ $ps }{ queries },
        steps => $steps,
        mapping => $mapping,
        sort => $sort,
        limit => $limit,
        scale => $scale,
    );
}

print_html_end( time - $time );
exit;


sub print_html_start {

    print <<"START";
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
 "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
    <title>Google Search Cloud (beta)</title>
    <link rel="stylesheet" type="text/css" href="gscloud.css">
</head>
<body>
<h1>Google <span class="blue">Search Cloud</span>
<span class="beta">beta</span></h1>
START
}


sub print_html_end {

    my $delta = shift;
    print <<FOOTER;
<div class="footer">
    <a href="http://johnbokma.com/perl/google-search-cloud.html">Google
    Search Cloud</a>, written by John Bokma, took $delta seconds to
    generate this page.
</div>
FOOTER
}


sub print_cloud_as_html_list {

    my %params = @_;

    my $frequencies = $params{ frequencies }
        or croak "Parameter 'frequencies' not given";

    my $steps = $params{ steps }
        or croak "Parameter 'steps' not given";

    my $mapping = $params{ mapping } || 'log';
    $mapping eq 'log' or $mapping eq 'lin'
        or croak "Parameter 'mapping' has an unsupported value ($mapping)";

    my $sort_method = $params{ sort } || 'alpha';
    $sort_method eq 'alpha' or $sort_method eq 'num'
        or croak "Parameter 'sort' has an unsupported value ($sort_method)";

    my @keys = sort
        { $frequencies->{ $b } <=> $frequencies->{ $a } } keys %$frequencies;

    # if there is a limit, take the top limit frequencies
    $params{ limit } and @keys = splice @keys, 0, $params{ limit };
    @keys or return;    # nothing to do

    $steps = @keys if $params{ scale } and $steps > @keys;
    my $max_step = $steps - 1;

    my ( $max, $min ) = @$frequencies{ $keys[ 0 ], $keys[ -1 ] };

    print qq(<ul class="cloud">\n);

    my $step = $min == $max
        ? sub { 1 }
        : $mapping eq 'log'
            ? sub {

                1 + int( $max_step * (
                    ( log( $frequencies->{ $_[ 0 ] } ) - log( $min )) /
                    ( log( $max ) - log( $min ) ) )
                )
            }
            : sub {

                1 + int( $max_step *
                    ( $frequencies->{ $_[ 0 ] } - $min ) /
                    ( $max - $min )
                )

            };

    $sort_method eq 'alpha' and @keys = sort { lc $a cmp lc $b } @keys;

    print '  <li class="size' . $step->( $_ ) . '">',
        encode_entities( $_ ), "</li>\n" for @keys;

    print "</ul>\n";
}

The Google Search Cloud program generates an HTML page which requires an external stylesheet named gscloud.css. You can tweak this stylesheet and even add more cloud levels if required so. Note that in the latter case you have to pass the number of levels via the steps command line option, or update the default in the Perl program.

Copy and paste the following code into your favorite editor and save it as gscloud.css.

/* gscloud.css - external stylesheet for gscloud.pl
 *
 * (c) Copyright, 2006 by John Bokma, http://johnbokma.com/
 * License: The Artistic License
 */

h1, h2 {
    font-family: "Trebuchet MS"; color: #d37;
    margin: 18px 0 2px 0
}
h1 { font-size: 18px }
h2 { font-size: 14px }
h2 a { color: #d37; }

span.beta { color: #aaa; font-size: 12px; vertical-align: super }
span.blue {
    color: #37d; font-size: 24px; font-weight: normal;
    vertical-align: sub;
}
h2 span.small {
    color: #999; font-size: 12px; vertical-align: super;
    font-weight: normal
}
div.footer { 
    font: normal 10px "Trebuchet MS";
    border-top: solid 1px #aaa;
    margin: 10px 0 0 0;
    color: #37d;
    text-align: center;
}

div.footer a { color: #37d }

ul.cloud { margin: 0 14px 0 14px; padding: 0 }
ul.cloud li { display: inline; color: #37d; padding: 0 4px 0 4px }
ul.cloud li.size1{ font: normal 10px  "Trebuchet MS" }
ul.cloud li.size2{ font: italic 11px  "Trebuchet MS" }
ul.cloud li.size3{ font: bold 12px  "Trebuchet MS" }
ul.cloud li.size4{ font: normal 13px  "Trebuchet MS" }
ul.cloud li.size5{ font: italic 14px  "Trebuchet MS" }
ul.cloud li.size6{ font: bold 15px  "Trebuchet MS" }
ul.cloud li.size7{ font: normal 16px  "Trebuchet MS" }
ul.cloud li.size8{ font: italic 17px  "Trebuchet MS" }
ul.cloud li.size9{ font: bold 18px  "Trebuchet MS" }
ul.cloud li.size10{ font: normal 19px / 14px  "Trebuchet MS" }
ul.cloud li.size11{ font: italic 20px / 14px  "Trebuchet MS" }
ul.cloud li.size12{ font: bold 21px / 14px  "Trebuchet MS" }
ul.cloud li.size13{ font: normal 22px / 18px  "Trebuchet MS" }
ul.cloud li.size14{ font: italic 23px / 18px  "Trebuchet MS" }
ul.cloud li.size15{ font: bold 24px / 18px  "Trebuchet MS" }
ul.cloud li.size16{ font: normal 25px / 18px  "Trebuchet MS" }
ul.cloud li.size17{ font: italic 26px / 18px  "Trebuchet MS" }
ul.cloud li.size18{ font: bold 27px / 18px  "Trebuchet MS" }

How print_cloud_as_html_list Perl sub renders the cloud can be controlled via command line options.

Examples of usage

Reading the access_log stored inside the logs directory and creating links to http://example.com (note: no trailing '/') for each path:

gscloud.pl -prefix http://example.com logs/access_log > cloud.html

Same as above, but now use linear mapping instead of logarithmic, and also sort the output by frequency (numerically) instead of alphabetically:

gscloud.pl -prefix http://example.com -sort num -mapping lin logs/access_log > cloud.html

The final example creates larger clouds:

gscloud.pl -prefix http://example.com -limit 200 logs/access_log > cloud.html

Processing a 328 MB Apache access_log takes on my computer (Compaq Presario SR1505LA with an AMD Sempron, 3100+) about 22 seconds.

Related

Please post a comment | read 14 comments, latest by John Bokma | RSS feed