Today I updated my RSS feed construction Perl program. Originally I used the title of HTML pages for description. Way too short of course, and not really attractive to readers of my RSS feed.
the title an So I first replaced the regular expressions to obtain the text of the title element and the text of the first heading (h1 element) with code using HTML::TreeBuilder, a Perl module I love to use because it's very easy to use for tasks like locating HTML elements and getting the text contained within those elements.
my $file = read_file( $filename );
my $tree = HTML::TreeBuilder->new_from_content( $file );
my $h1_elt = $tree->look_down( _tag => 'h1' );
$title = defined $h1_elt ? $h1_elt->as_text : undef;
my $title_elt = $tree->look_down( _tag => 'title' );
my $long_title = defined $title_elt ? $title_elt->as_text : undef;
$desc = "$long_title [...]" if defined $long_title;
The read_file is provided by the File::Slurp Perl module, which I strongly recommend
if you need to read an entire file into a scalar or array (slurping). The look_down
method looks for the first occurrence of a given element when the result is assigned to a scalar.
The RSS item title is extracted from the text contained in the first h1 element found on a HTML page belonging to my site. The text contained in the title element of the same HTML page is used to provide the description, for now, because I added some extra Perl code after the above snippet:
my @p_elts = $tree->look_down( _tag => 'p' );
if ( @p_elts ) {
my $par = $p_elts[ 0 ]->as_trimmed_text;
if ( $par =~ /^[^.!?]+[.!?]$/ and @p_elts > 1 ) {
$par = $p_elts[ 1 ]->as_trimmed_text;
}
$par =~ s/\s+/ /g;
$par =~ s/(^.{1,300}\S+\s).*/$1\[...\]/;
$desc = $par;
}
$tree->delete;
The above Perl snippet first obtains all paragraph (p) elements if available in the HTML page.
If there are zero or more paragraphs, the text inside the p element is obtained, with
leading and trailing whitespace removed (as_trimmed_text
method). Next a
simple test checks if the first paragraph is not just one sentence (one or more
non-end-of-sentence characters followed by one end-of-sentence-character) and if there is at least
one more p element. If this is the case the text inside the second p element is used (again trimmed).
Finally, some clean up is done: sequences of arbitrary whitespace are compressed into a single space, and the paragraph length is reduced to roughly 301 characters followed by [...] without chopping off parts of non-whitespace characters. As always, feed back is welcome.
Note: $title and $desc are used outside the given code snippets, hence the lack of my
.