Wiki Duplicate External Link finder

Wednesday, January 24, 2007 | 0 comments

Quite some time ago I decided to install a local Wiki, using the same software as Wikipedia: WikiMedia. Since I had already installed the Apache HTTP server and PHP on my Windows XP system, installation of the WikiMedia software back then was a piece of cake.

I use the local Wiki mainly for storing bookmarks. Since I discovered the Copy URL+ extension for Firefox this has become much easier. In the past I now and then ran a Perl program that converted the bookmarks file created by Mozilla Firefox to WikiMedia mark-up. I then copied this output to the Miscellaneous page of my local Wiki, and when I had some spare time I moved links to the right pages. Since I now and then also added links directly to the right page, after some time I ended up with duplicated links. To safe myself from discovering time after time that links on the Miscellaneous page were already on the right page as well, I wrote a Perl program to find all external links, and report duplicates.

GETing a Wiki page and parsing the HTML

Two Perl modules I use a lot for webscraping are LWP::UserAgent - to GET the web page, and HTML::TreeBuilder - to parse the HTML in the web page into a tree that's easy accessible. Hence the first lines of the Perl program became:

use strict;
use warnings;

use LWP::UserAgent;
use HTML::TreeBuilder;

I decided to write a small Perl function that takes care of both the web page fetching and HTML parsing steps and does some basic error checking:

sub get_tree {

    my $url = shift;

    my $ua = LWP::UserAgent->new;

    my $response = $ua->get( $url );
    $response->is_success
        or die "get_tree: get '$url' returned ", $response->status_line;

    return HTML::TreeBuilder->new_from_content( $response->content );
}

Getting links to all pages in a Wiki

Next, I needed somehow to get a list of all pages I had added to the Wiki. I discovered that the WikiMedia software makes such a page available as a special page aptly named Special:Allpages. Probably, if their are many pages in the Wiki this page only shows a small selection, but the pages in my Wiki still fit on one page, making the code quite simple:

my $HOST = 'cadwal';

my $all_pages_tree = get_tree(

    "http://$HOST/index.php/Special:Allpages"
);

my @a_elements = $all_pages_tree->look_down(

    _tag => 'a',
    href => qr{/index\.php/[^:]+$},
);

The look_down returns all a elements with a href attribute ending in /index.php/ followed by one or more characters not including colon(s). The latter restriction excludes Special pages like Special:Recentchanges, Help:Contents, and other meta pages.

Keeping track of external links

In the previous step all relative URLs to each Wiki page were obtained.

Also today

Wiki clean-up (overview)

Please post a comment | read 0 comments | RSS feed