Driving a Perl script via find command
March 21, 2019
A week ago I wrote a quick hack to parse all the HTML pages containing blog posts of my old blog, find the first h1
element and print the text contained within along with the filename to stdout
. Next, I used search and replace in Emacs to turn the filenames into URLs.
I wrote this hack to create a file with tweets suitable for the tweetfile.pl Perl program I finished the 8th of March, which picks a random tweet from such a file and posts it to Twitter.
I used find
to execute the short Perl script once for each file found. Of course starting perl
for each file found slows things down considerably. So, today I decided to write a Perl program that accepts multiple filenames and doesn't need search and replace to create URLs to my blog.
Building the find command
My old blog is currently located in a directory named mexit
, so finding all HTML files becomes:
find mexit -type f -name '*.html'
Since I use year/month/day directories I have to restrict find
to the depth of "day" to avoid including overview pages. Counting mexit
this gives a value of 4:
find mexit -type f -mindepth 4 -maxdepth 4 -name '*.html'
I have comments posted by visitors on separate pages. If I have multiple blog posts on a single day there is a directory for comment pages. This directory is at depth 5 and hence skipped with -maxdepth 4
. But if there is only one post on a given date there is a single comments.html
page at depth 4. So I wanted
all HTML files at level 4 not named comments.html
:
find mexit -type f -mindepth 4 -maxdepth 4 -name '*.html' \
-not -name 'comments.html'
Using the above the complete command for my set up becomes:
find mexit -type f-mindepth 4 -maxdepth 4 -name '*.html' \
-not -name 'comments.html' \
-exec perl ../../scripts/mktweets.pl \
--base-uri http://johnbokma.com/ \
--input-encoding ISO-8859-1 {} \+
The options given to the mktweets.pl
program are described further down this page.
To show the difference between ending using \+
versus \;
this is the output of time
followed by the above find
finding 1060 files:
real 0m1.632s
user 0m0.107s
sys 0m0.693s
And this is the result for ending the find ending with \;
:
real 0m35.792s
user 0m23.664s
sys 0m9.043s
The overhead of calling perl
1060 times is significant.
Obtaining the text within an element
I used the HTML::Parse
module to obtain the text contained within the first HTML element specified. In order to do so I had to set up three handlers:
- the start handler, which as soon as the start of the element specified is encountered sets a flag to true.
- the text handler, which starts printing text found when the flag is true
- the end handler, which sets the flag to false (for the next round) if the end of the specified element is encountered and signals that the parsing process should end.
The Perl code I wrote for the above recipe is as follows:
my $in_element = 0;
my $p = HTML::Parser->new( api_version => 3 );
$p->handler(
start => sub {
my $current_element = shift;
$current_element eq $element or return;
$in_element = 1;
return;
},
'tagname'
);
$p->handler(
text => sub {
my $text = shift;
print $text if $in_element;
return;
},
'dtext'
);
$p->handler (
end => sub {
my ( $self, $current_element ) = @_;
$current_element eq $element or return;
$in_element = 0;
$self->eof;
return;
},
'self, tagname'
);
Because the actual handlers are very small, anonymous functions are used to handle each state.
Creating a link
In order to create a link we need the base URI given as an argument to the script. I used the Getopt::Long
module to read the argument and store it in $base_uri
. Inside the loop I create an absolute URI using the base URI and the filename reported by find
, which are available in @ARGV
. I replace /index.html
or /index.htm
with a single /
.
my $base_uri;
GetOptions(
'base-uri=s' => \$base_uri,
);
# check if base_uri is actually given as an argument omitted for brevity
for my $filename ( @ARGV ) {
my $uri = URI->new_abs( $filename, $base_uri );
$uri =~ s{/index\.html?$}{/};
print "$uri\n";
}
Input encoding
While testing my script the text "Jo Nesbø" triggered an error (twice):
utf8 "\xF8" does not map to Unicode at /usr/lib/x86_64-linux-gnu/perl5/5.22/HTML
/Parser.pm line 95.
Then I recalled that I used ISO-8859-1 encoding for the old parts of my site. And hence I had to add an option to specify the encoding of the input HTML files to the script and open each file explicitly using this encoding:
my $input_encoding = 'UTF-8';
GetOptions(
'input-encoding=s' => \$input_encoding,
);
my $p = HTML::Parser->new( api_version => 3 );
# setting up of the handlers omitted for brevity
for my $filename ( @ARGV ) {
open my $fh, "<:encoding($input_encoding)", $filename
or die "Can't open file '$filename' for reading: $!";
$p->parse_file( $fh );
}
The complete Perl program
The complete Perl program, which I named mktweets.pl
, is as follows:
#!/usr/bin/perl
#
# (c) John Bokma, 2019
#
# This program is free software; you can redistribute it and/or modify it
# under the same terms as Perl itself.
use strict;
use warnings;
use URI;
use Getopt::Long;
use HTML::Parser ();
my $base_uri;
my $element = 'h1';
my $input_encoding = 'UTF-8';
my $help;
GetOptions(
'base-uri=s' => \$base_uri,
'element=s' => \$element,
'input-encoding=s' => \$input_encoding,
'help' => \$help,
);
if ( $help ) {
show_help();
exit;
}
if ( !defined $base_uri ) {
warn "Error: a --base-uri must be given\n\n";
show_help();
exit( 1 );
}
my $in_element = 0;
my $p = HTML::Parser->new( api_version => 3 );
$p->handler(
start => sub {
my $current_element = shift;
$current_element eq $element or return;
$in_element = 1;
return;
},
'tagname'
);
$p->handler(
text => sub {
my $text = shift;
print $text if $in_element;
return;
},
'dtext'
);
$p->handler (
end => sub {
my ( $self, $current_element ) = @_;
$current_element eq $element or return;
$in_element = 0;
$self->eof;
return;
},
'self, tagname'
);
binmode STDOUT, ':encoding(UTF-8)';
for my $filename ( @ARGV ) {
open my $fh, "<:encoding($input_encoding)", $filename
or die "Can't open file '$filename' for reading: $!";
$p->parse_file( $fh );
my $uri = URI->new_abs( $filename, $base_uri );
$uri =~ s{/index\.html?$}{/};
print "\n\n$uri\n%\n";
}
sub show_help {
print <<'END_HELP';
NAME
mktweets.pl - Parses HTML files and prints tweets
SYNOPSIS
mktweets.pl --base_uri http://example.com/ [--element <element>]
[--input-encoding] <encoding> <file-1> ... <file-n>
DESCRIPTION
Parses all files given and prints tweets in a format suitable for
tweetfile.pl.
The base URI must be specified via the --base-uri argument.
If the input encoding differs from UTF-8 it must be specified
via the --input-encoding argument.
If text must be captured from a different element than h1 it
can be specified via the --element argument.
The --help option shows this information.
END_HELP
}