Perl programmer for hire: download my resume (PDF).
John Bokma's Hacking & Hiking

Perl Time::Piece Unicode Issue

June 3, 2021

The 25th of November 2019 while working on tumblelog I ran into an issue with the Perl module Time::Piece when generating month names. The result I expected was a Unicode month name. What I got was Mojibake.

To demonstrate this issue consider the following script, which I named tpbug.pl:

#!/usr/bin/perl

use strict;
use warnings;
use open ':std', ':encoding(UTF-8)';

use Time::Piece;

for my $month ( 1..12 ) {
    my $date = sprintf '2021-%02d-01', $month;
    my $tp = Time::Piece->strptime( $date, '%Y-%m-%d' );
    print $tp->strftime( '%B' ), ' ';
}
print "\n";

This will generate a list of the 12 month names. If we change the language using the LANG environment variable, for example to Russian, the bug shows its ugly head:

$ LANG=ru_RU.UTF-8 perl tpbug.pl
Time::Piece bug 97539 Mojibake
Time::Piece bug 97539 Mojibake.

This bug was reported as Bug #97539 back in 2014 (not by me).

To fix this issue in tumblelog I wrote a small helper function:

sub decode_utf8 {
    # UTF8 encoding for the Time::Piece strftime method, see bug #97539
    # https://rt.cpan.org/Public/Bug/Display.html?id=97539
    return decode( 'UTF-8', shift, Encode::FB_CROAK )
}

The helper function converts the octets generated by Time::Piece, assuming its UTF-8 data (and croaking if not), into a string in Perl's internal format. Since the open pragma was used to set stdout to UTF-8 this should give the desired result: month names printed on the terminal in UTF-8.

The updated and working Perl script becomes:

#!/usr/bin/perl

use strict;
use warnings;
use open ':std', ':encoding(UTF-8)';

use Encode 'decode';
use Time::Piece;

for my $month ( 1..12 ) {
    my $date = sprintf '2021-%02d-01', $month;
    my $tp = Time::Piece->strptime( $date, '%Y-%m-%d' );
    print decode_utf8( $tp->strftime( '%B' ) ), ' ';
}
print "\n";

sub decode_utf8 {
    # UTF8 encoding for the Time::Piece strftime method, see bug #97539
    # https://rt.cpan.org/Public/Bug/Display.html?id=97539
    return decode( 'UTF-8', shift, Encode::FB_CROAK )
}

Running this script, which I named tpok.pl results in the expected output:

$ LANG=ru_RU.UTF-8 perl tpok.pl 
января февраля марта апреля мая июня июля августа сентября октября ноября декабр
я 

Edit: Sinan Unur (nanis) recommends the Perl module Unicode::UTF8 over Encode with the hand-rolled helper in a discussion on Hacker News.

Related