The following Perl program parses an Apache HTTP server access_log in extended format and collects all referers [sic] which might be generated by a search action on Google. The query visitors used are extracted and grouped per web page and status code. Then the program generates a valid HTML 4.01 strict webpage that shows those queries as a cloud as used by some Web 2.0 websites to present tags.
Read the rest of Google Search Cloud.
In my logs, I found lots of mixed case results. Wrap the $3 with lc makes all results lower case and eliminates duplicates that differ only by case. Otherwise, very cool stuff!
Hi Erazmus - in another script I use I do indeed some normalization, even a bit more complicated: I lower case, split the phrase, sort it, and use that to group, and use the most used phrase for presentation.
The advantage is that this makes the cloud (in this case) less cluttered. A disadvantage is that you lose some information.
Thanks for the compliment, it's indeed cool stuff and I am very happy with the looks of the output.
Hi Ed - The main reason for rolling out my own code is that this is (I hope) lightning fast. Moreover, I still have to extract the referer since URI::ParseSearchString doesn't do this despite the "parse Apache refferer logs and extract search engine query strings".
Also I do a stricter check on Google compared to URI::ParseSearchString which hopefully excludes some malware that hit sites with fake referers. In addition I do some normalization on whitespace.
On the other hand, the Perl module you mentioned looks very useful, so many thanks for sharing. It might help people to expand Google Cloud Search to include other search engines as well.
I'm getting an error when I try to run it.
Bareword found where operator expected at ./gscloud.pl line 218, near
"18px"
(Missing operator before px?)
Number found where operator expected at ./gscloud.pl line 218, near "px
0"
(Do you need to predeclare px?)
Number found where operator expected at ./gscloud.pl line 218, near "0
2"
(Missing operator before 2?)
Bareword found where operator expected at ./gscloud.pl line 218, near
"2px"
(Missing operator before px?)
Number found where operator expected at ./gscloud.pl line 218, near "px
0"
(Do you need to predeclare px?)
Bareword found where operator expected at ./gscloud.pl line 220, near
"18px"
(Missing operator before px?)
Bareword found where operator expected at ./gscloud.pl line 221, near
"14px"
(Missing operator before px?)
syntax error at ./gscloud.pl line 217, near "family:"
syntax error at ./gscloud.pl line 220, near "size:"
syntax error at ./gscloud.pl line 221, near "size:"
syntax error at ./gscloud.pl line 222, near "color:"
Illegal declaration of anonymous subroutine at ./gscloud.pl line 227.
Hi Collin - you accidentally included the CSS (stylesheet) as part of the Perl program. I'll do my best to write a bit more text between the end of the Perl listing and the start of the CSS listing.
Hi, really a great tool =) i love it keep on your good work.
Btw is it possible to limit the output only to the index.htm or some specific page?
You haven't answered Mike A.'s question. I have a similar concern. I'd like to use it, but would need to modify it considerably, and I don't know where you stand on that. It would be nice if you included a license.
Mark, Mike A - I had to think about which license (there are (too) many). I decided to go with the Artistic License. Thanks to both for pointing this out to me, I will update my other code snippets as well later.
I hope I did this right, if not, let me know.
How would I got about integrating this into a shared server... it's an apache server but the ideal setup would be so I can navigate from the browser and have the html served.... I don't have ssh access... urg
Bart - A quick and dirty hack that works on my machine:
Comment out the GetOptions call (# at the very start of the line, and the eight
following lines. Optionally, also comment out the use Getopt::Long;
line (close to the top of the program).
Assign the full path to your access_log to the $filename variable, i.e. replace
my $filename = shift;
with
my $filename = '/path/to/your/access_log';
Put before the line containing open my $fh, ...
the following line:
print "Content-type: text\html\n\n";
You might want to change the location of the CSS file in the following line
(See the print_html_start
sub):
<link rel="stylesheet" type="text/css" href="gscloud.css">
For example, if you put gscloud.css in your document root, just adding a / in front of gscloud.css in the above line should make your CSS file working with the output generated by the Perl program.
Note that this probably isn't going to work when your access_log is huge, since the CGI script might be killed by the Apache web server when it takes too long to run.
If you use the CGI Perl module you can pass the command line options to the script via the URL, see the documentation of CGI.pm.
Hope this helps.
Hey there!
Neat script mate, well done.
I've hacked it up and glued it into a Web-based email app to get a feel/history of the person your speaking with from previous emails. Works a charm.
Keep up your good work!
@Ben - that's a great idea!
And thanks for the compliment.
Note that your comment doesn't show up immediately. I review each comment before I add it to this site.
Check the Follow this page option if you want to receive an email each time a comment is posted to this page, including yours. A link to turn this option off will be included with each email.
Internet adresses will be converted automatically. You can use the following notation to specify anchor text for a link: [url=http://example.com/]example text[/url].
I would love to use this, but you haven't licensed it.