Hire a senior Perl / Python programmer today; download my up-to-date resume (PDF)
John Bokma Perl
freelance Perl programmer

Comments: Google Search Cloud

14 comments

The following Perl program parses an Apache HTTP server access_log in extended format and collects all referers [sic] which might be generated by a search action on Google. The query visitors used are extracted and grouped per web page and status code. Then the program generates a valid HTML 4.01 strict webpage that shows those queries as a cloud as used by some Web 2.0 websites to present tags.

Read the rest of Google Search Cloud.

Comments

I would love to use this, but you haven't licensed it.

Posted by Mike A at 18:01 GMT on 5 October 2006

In my logs, I found lots of mixed case results. Wrap the $3 with lc makes all results lower case and eliminates duplicates that differ only by case. Otherwise, very cool stuff!

Posted by Erazmus at 18:22 GMT on 5 October 2006

Hi Erazmus - in another script I use I do indeed some normalization, even a bit more complicated: I lower case, split the phrase, sort it, and use that to group, and use the most used phrase for presentation.

The advantage is that this makes the cloud (in this case) less cluttered. A disadvantage is that you lose some information.

Thanks for the compliment, it's indeed cool stuff and I am very happy with the looks of the output.

Posted by John Bokma at 18:54 GMT on 5 October 2006

you should use

URI::ParseSearchString (CPAN).

Posted by Ed at 18:58 GMT on 5 October 2006

Hi Ed - The main reason for rolling out my own code is that this is (I hope) lightning fast. Moreover, I still have to extract the referer since URI::ParseSearchString doesn't do this despite the "parse Apache refferer logs and extract search engine query strings".

Also I do a stricter check on Google compared to URI::ParseSearchString which hopefully excludes some malware that hit sites with fake referers. In addition I do some normalization on whitespace.

On the other hand, the Perl module you mentioned looks very useful, so many thanks for sharing. It might help people to expand Google Cloud Search to include other search engines as well.

Posted by John Bokma at 19:26 GMT on 5 October 2006

I'm getting an error when I try to run it.

Bareword found where operator expected at ./gscloud.pl line 218, near
"18px"
	(Missing operator before px?)
Number found where operator expected at ./gscloud.pl line 218, near "px
0"
	(Do you need to predeclare px?)
Number found where operator expected at ./gscloud.pl line 218, near "0
2"
	(Missing operator before  2?)
Bareword found where operator expected at ./gscloud.pl line 218, near
"2px"
	(Missing operator before px?)
Number found where operator expected at ./gscloud.pl line 218, near "px
0"
	(Do you need to predeclare px?)
Bareword found where operator expected at ./gscloud.pl line 220, near
"18px"
	(Missing operator before px?)
Bareword found where operator expected at ./gscloud.pl line 221, near
"14px"
	(Missing operator before px?)
syntax error at ./gscloud.pl line 217, near "family:"
syntax error at ./gscloud.pl line 220, near "size:"
syntax error at ./gscloud.pl line 221, near "size:"
syntax error at ./gscloud.pl line 222, near "color:"
Illegal declaration of anonymous subroutine at ./gscloud.pl line 227.
Posted by Colin at 20:46 GMT on 5 October 2006

Hi Collin - you accidentally included the CSS (stylesheet) as part of the Perl program. I'll do my best to write a bit more text between the end of the Perl listing and the start of the CSS listing.

Posted by John Bokma at 21:20 GMT on 5 October 2006

Hi, really a great tool =) i love it keep on your good work.

Btw is it possible to limit the output only to the index.htm or some specific page?

Posted by easymobile at 22:02 GMT on 5 October 2006

You haven't answered Mike A.'s question. I have a similar concern. I'd like to use it, but would need to modify it considerably, and I don't know where you stand on that. It would be nice if you included a license.

Posted by Mark at 14:17 GMT on 6 October 2006

Mark, Mike A - I had to think about which license (there are (too) many). I decided to go with the Artistic License. Thanks to both for pointing this out to me, I will update my other code snippets as well later.

I hope I did this right, if not, let me know.

Posted by John Bokma at 16:06 GMT on 6 October 2006

How would I got about integrating this into a shared server... it's an apache server but the ideal setup would be so I can navigate from the browser and have the html served.... I don't have ssh access... urg

Posted by Bart at 18:49 GMT on 6 October 2006

Bart - A quick and dirty hack that works on my machine:

Comment out the GetOptions call (# at the very start of the line, and the eight following lines. Optionally, also comment out the use Getopt::Long; line (close to the top of the program).

Assign the full path to your access_log to the $filename variable, i.e. replace

my $filename = shift;

with

my $filename = '/path/to/your/access_log';

Put before the line containing open my $fh, ... the following line:

print "Content-type: text\html\n\n";

You might want to change the location of the CSS file in the following line (See the print_html_start sub):

<link rel="stylesheet" type="text/css" href="gscloud.css">

For example, if you put gscloud.css in your document root, just adding a / in front of gscloud.css in the above line should make your CSS file working with the output generated by the Perl program.

Note that this probably isn't going to work when your access_log is huge, since the CGI script might be killed by the Apache web server when it takes too long to run.

If you use the CGI Perl module you can pass the command line options to the script via the URL, see the documentation of CGI.pm.

Hope this helps.

Posted by John Bokma at 19:14 GMT on 6 October 2006

Hey there!

Neat script mate, well done.

I've hacked it up and glued it into a Web-based email app to get a feel/history of the person your speaking with from previous emails. Works a charm.

Keep up your good work!

Posted by Ben at 08:57 GMT on 9 March 2007

@Ben - that's a great idea!

And thanks for the compliment.

Posted by John Bokma at 04:54 GMT on 29 March 2007

Post a comment

Note that your comment doesn't show up immediately. I review each comment before I add it to this site.

Check the Follow this page option if you want to receive an email each time a comment is posted to this page, including yours. A link to turn this option off will be included with each email.

Internet adresses will be converted automatically. You can use the following notation to specify anchor text for a link: [url=http://example.com/]example text[/url].