Need a web scraping or data wrangling solution? Check out my resume (PDF).
John Bokma's Hacking & Hiking

Driving a Perl script via find

March 21, 2019

A week ago I wrote a quick hack to parse all the HTML pages containing blog posts of my old blog, find the first h1 element and print the text contained within along with the filename to stdout. Next, I used search and replace in Emacs to turn the filenames into URLs.

I wrote this hack to create a file with tweets suitable for the tweetfile.pl Perl program I finished the 8th of March, which picks a random tweet from such a file and posts it to Twitter.

I used find to execute the short Perl script once for each file found. Of course starting perl for each file found slows things down considerably. So, today I decided to write a Perl program that accepts multiple filenames and doesn't need search and replace to create URLs to my blog.

Building the find command

My old blog is currently located in a directory named mexit, so finding all HTML files becomes:

find mexit -type f -name '*.html'

Since I use year/month/day directories I have to restrict find to the depth of "day" to avoid including overview pages. Counting mexit this gives a value of 4:

find mexit -type f -mindepth 4 -maxdepth 4 -name '*.html'

I have comments posted by visitors on separate pages. If I have multiple blog posts on a single day there is a directory for comment pages. This directory is at depth 5 and hence skipped with -maxdepth 4. But if there is only one post on a given date there is a single comments.html page at depth 4. So I wanted all HTML files at level 4 not named comments.html:

find mexit -type f -mindepth 4 -maxdepth 4 -name '*.html' \
     -not -name 'comments.html'

Using the above the complete command for my set up becomes:

find mexit -type f-mindepth 4 -maxdepth 4 -name '*.html' \
     -not -name 'comments.html'                          \
     -exec perl ../../scripts/mktweets.pl                \
        --base-uri http://johnbokma.com/                 \
        --input-encoding ISO-8859-1 {} \+

The options given to the mktweets.pl program are described further down this page.

To show the difference between ending using \+ versus \; this is the output of time followed by the above find finding 1060 files:

real    0m1.632s
user    0m0.107s
sys     0m0.693s

And this is the result for ending the find ending with \;:

real    0m35.792s
user    0m23.664s
sys     0m9.043s

The overhead of calling perl 1060 times is significant.

Obtaining the text within an element

I used the HTML::Parse module to obtain the text contained within the first HTML element specified. In order to do so I had to set up three handlers:

The Perl code I wrote for the above recipe is as follows:

my $in_element = 0;
my $p = HTML::Parser->new( api_version => 3 );
$p->handler(
    start => sub {
        my $current_element = shift;
        $current_element eq $element or return;
        $in_element = 1;
        return;
    },
    'tagname'
);
$p->handler(
    text => sub {
        my $text = shift;
        print $text if $in_element;
        return;
    },
    'dtext'
);
$p->handler (
    end => sub {
        my ( $self, $current_element ) = @_;
        $current_element eq $element or return;
        $in_element = 0;
        $self->eof;
        return;
    },
    'self, tagname'
);

Because the actual handlers are very small, anonymous functions are used to handle each state.

Creating a link

In order to create a link we need the base URI given as an argument to the script. I used the Getopt::Long module to read the argument and store it in $base_uri. Inside the loop I create an absolute URI using the base URI and the filename reported by find, which are available in @ARGV. I replace /index.html or /index.htm with a single /.

my $base_uri;
GetOptions(
    'base-uri=s' => \$base_uri,
);

# check if base_uri is actually given as an argument omitted for brevity

for my $filename ( @ARGV ) {
    my $uri = URI->new_abs( $filename, $base_uri );
    $uri =~ s{/index\.html?$}{/};
    print "$uri\n";
}

Input encoding

While testing my script the text "Jo Nesbø" triggered an error (twice):

utf8 "\xF8" does not map to Unicode at /usr/lib/x86_64-linux-gnu/perl5/5.22/HTML
/Parser.pm line 95.

Then I recalled that I used ISO-8859-1 encoding for the old parts of my site. And hence I had to add an option to specify the encoding of the input HTML files to the script and open each file explicitly using this encoding:

my $input_encoding = 'UTF-8';
GetOptions(
    'input-encoding=s' => \$input_encoding,
);

my $p = HTML::Parser->new( api_version => 3 );
# setting up of the handlers omitted for brevity

for my $filename ( @ARGV ) {
    open my $fh, "<:encoding($input_encoding)", $filename
        or die "Can't open file '$filename' for reading: $!";
    $p->parse_file( $fh );
}

The complete Perl program

The complete Perl program, which I named mktweets.pl, is as follows:

#!/usr/bin/perl
#
# (c) John Bokma, 2019
#
# This program is free software; you can redistribute it and/or modify it
# under the same terms as Perl itself.

use strict;
use warnings;

use URI;
use Getopt::Long;
use HTML::Parser ();

my $base_uri;
my $element = 'h1';
my $input_encoding = 'UTF-8';
my $help;

GetOptions(
    'base-uri=s'       => \$base_uri,
    'element=s'        => \$element,
    'input-encoding=s' => \$input_encoding,
    'help'             => \$help,
);

if ( $help ) {
    show_help();
    exit;
}

if ( !defined $base_uri ) {
    warn "Error: a --base-uri must be given\n\n";
    show_help();
    exit( 1 );
}

my $in_element = 0;
my $p = HTML::Parser->new( api_version => 3 );
$p->handler(
    start => sub {
        my $current_element = shift;
        $current_element eq $element or return;
        $in_element = 1;
        return;
    },
    'tagname'
);
$p->handler(
    text => sub {
        my $text = shift;
        print $text if $in_element;
        return;
    },
    'dtext'
);
$p->handler (
    end => sub {
        my ( $self, $current_element ) = @_;
        $current_element eq $element or return;
        $in_element = 0;
        $self->eof;
        return;
    },
    'self, tagname'
);

binmode STDOUT, ':encoding(UTF-8)';
for my $filename ( @ARGV ) {
    open my $fh, "<:encoding($input_encoding)", $filename
        or die "Can't open file '$filename' for reading: $!";
    $p->parse_file( $fh );

    my $uri = URI->new_abs( $filename, $base_uri );
    $uri =~ s{/index\.html?$}{/};
    print "\n\n$uri\n%\n";
}

sub show_help {

    print <<'END_HELP';
NAME
        mktweets.pl - Parses HTML files and prints tweets

SYNOPSIS
        mktweets.pl --base_uri http://example.com/ [--element <element>]
            [--input-encoding] <encoding> <file-1> ... <file-n>

DESCRIPTION
        Parses all files given and prints tweets in a format suitable for
        tweetfile.pl.

        The base URI must be specified via the --base-uri argument.

        If the input encoding differs from UTF-8 it must be specified
        via the --input-encoding argument.

        If text must be captured from a different element than h1 it
        can be specified via the --element argument.

        The --help option shows this information.
END_HELP
}

Related