Quite some time ago I decided to install a local Wiki, using the same software as Wikipedia: WikiMedia. Since I had already installed the Apache HTTP server and PHP on my Windows XP system, installation of the WikiMedia software back then was a piece of cake.
I use the local Wiki mainly for storing bookmarks. Since I discovered the Copy URL+ extension for Firefox this has become much easier. In the past I now and then ran a Perl program that converted the bookmarks file created by Mozilla Firefox to WikiMedia mark-up. I then copied this output to the Miscellaneous page of my local Wiki, and when I had some spare time I moved links to the right pages. Since I now and then also added links directly to the right page, after some time I ended up with duplicated links. To safe myself from discovering time after time that links on the Miscellaneous page were already on the right page as well, I wrote a Perl program to find all external links, and report duplicates.
Two Perl modules I use a lot for webscraping are LWP::UserAgent - to GET the web page, and HTML::TreeBuilder - to parse the HTML in the web page into a tree that's easy accessible. Hence the first lines of the Perl program became:
use strict;
use warnings;
use LWP::UserAgent;
use HTML::TreeBuilder;
I decided to write a small Perl function that takes care of both the web page fetching and HTML parsing steps and does some basic error checking:
sub get_tree {
my $url = shift;
my $ua = LWP::UserAgent->new;
my $response = $ua->get( $url );
$response->is_success
or die "get_tree: get '$url' returned ", $response->status_line;
return HTML::TreeBuilder->new_from_content( $response->content );
}
Next, I needed somehow to get a list of all pages I had added to the Wiki. I discovered that the WikiMedia software makes such a page available as a special page aptly named Special:Allpages. Probably, if their are many pages in the Wiki this page only shows a small selection, but the pages in my Wiki still fit on one page, making the code quite simple:
my $HOST = 'cadwal';
my $all_pages_tree = get_tree(
"http://$HOST/index.php/Special:Allpages"
);
my @a_elements = $all_pages_tree->look_down(
_tag => 'a',
href => qr{/index\.php/[^:]+$},
);
The look_down
returns all a elements with a href attribute ending
in /index.php/
followed by one or more characters not including colon(s).
The latter restriction excludes Special pages like Special:Recentchanges, Help:Contents, and
other meta pages.
In the previous step all relative URLs to each Wiki page were obtained.