Perl programmer for hire: download my resume (PDF).
John Bokma MexIT
freelance Perl programmer

RSS feed update

Tuesday, November 21, 2006 | 0 comments

Today I updated my RSS feed construction Perl program. Originally I used the title of HTML pages for description. Way too short of course, and not really attractive to readers of my RSS feed.

the title an So I first replaced the regular expressions to obtain the text of the title element and the text of the first heading (h1 element) with code using HTML::TreeBuilder, a Perl module I love to use because it's very easy to use for tasks like locating HTML elements and getting the text contained within those elements.

my $file = read_file( $filename );
my $tree = HTML::TreeBuilder->new_from_content( $file );

my $h1_elt = $tree->look_down( _tag => 'h1' );
$title = defined $h1_elt ? $h1_elt->as_text : undef;

my $title_elt = $tree->look_down( _tag => 'title' );
my $long_title = defined $title_elt ? $title_elt->as_text : undef;
$desc = "$long_title [...]" if defined $long_title;

The read_file is provided by the File::Slurp Perl module, which I strongly recommend if you need to read an entire file into a scalar or array (slurping). The look_down method looks for the first occurrence of a given element when the result is assigned to a scalar.

The RSS item title is extracted from the text contained in the first h1 element found on a HTML page belonging to my site. The text contained in the title element of the same HTML page is used to provide the description, for now, because I added some extra Perl code after the above snippet:

my @p_elts = $tree->look_down( _tag => 'p' );
if ( @p_elts ) {

    my $par = $p_elts[ 0 ]->as_trimmed_text;
    if ( $par =~ /^[^.!?]+[.!?]$/ and @p_elts > 1 ) {

        $par = $p_elts[ 1 ]->as_trimmed_text;
    }

    $par =~ s/\s+/ /g;
    $par =~ s/(^.{1,300}\S+\s).*/$1\[...\]/;
    $desc = $par;
}

$tree->delete;

The above Perl snippet first obtains all paragraph (p) elements if available in the HTML page. If there are zero or more paragraphs, the text inside the p element is obtained, with leading and trailing whitespace removed (as_trimmed_text method). Next a simple test checks if the first paragraph is not just one sentence (one or more non-end-of-sentence characters followed by one end-of-sentence-character) and if there is at least one more p element. If this is the case the text inside the second p element is used (again trimmed).

Finally, some clean up is done: sequences of arbitrary whitespace are compressed into a single space, and the paragraph length is reduced to roughly 301 characters followed by [...] without chopping off parts of non-whitespace characters. As always, feed back is welcome.

Note: $title and $desc are used outside the given code snippets, hence the lack of my.

Also today

Please post a comment | read 0 comments | RSS feed