Perl Time::Piece Unicode Issue
June 3, 2021
The 25th of November 2019 while working on tumblelog
I ran into an issue with the Perl module Time::Piece
when generating month names. The result I expected was a Unicode month name. What I got was Mojibake.
To demonstrate this issue consider the following script, which I named tpbug.pl
:
#!/usr/bin/perl
use strict;
use warnings;
use open ':std', ':encoding(UTF-8)';
use Time::Piece;
for my $month ( 1..12 ) {
my $date = sprintf '2021-%02d-01', $month;
my $tp = Time::Piece->strptime( $date, '%Y-%m-%d' );
print $tp->strftime( '%B' ), ' ';
}
print "\n";
This will generate a list of the 12 month names. If we change the language using the LANG
environment variable, for example to Russian, the bug shows its ugly head:
$ LANG=ru_RU.UTF-8 perl tpbug.pl
This bug was reported as Bug #97539 back in 2014 (not by me).
To fix this issue in tumblelog
I wrote a small helper function:
sub decode_utf8 {
# UTF8 encoding for the Time::Piece strftime method, see bug #97539
# https://rt.cpan.org/Public/Bug/Display.html?id=97539
return decode( 'UTF-8', shift, Encode::FB_CROAK )
}
The helper function converts the octets generated by Time::Piece, assuming its UTF-8 data (and croaking if not), into a string in Perl's internal format. Since the open
pragma was used to set stdout
to UTF-8 this should give the desired result: month names printed on the terminal in UTF-8.
The updated and working Perl script becomes:
#!/usr/bin/perl
use strict;
use warnings;
use open ':std', ':encoding(UTF-8)';
use Encode 'decode';
use Time::Piece;
for my $month ( 1..12 ) {
my $date = sprintf '2021-%02d-01', $month;
my $tp = Time::Piece->strptime( $date, '%Y-%m-%d' );
print decode_utf8( $tp->strftime( '%B' ) ), ' ';
}
print "\n";
sub decode_utf8 {
# UTF8 encoding for the Time::Piece strftime method, see bug #97539
# https://rt.cpan.org/Public/Bug/Display.html?id=97539
return decode( 'UTF-8', shift, Encode::FB_CROAK )
}
Running this script, which I named tpok.pl
results in the expected output:
$ LANG=ru_RU.UTF-8 perl tpok.pl
января февраля марта апреля мая июня июля августа сентября октября ноября декабр
я
Edit: Sinan Unur (nanis) recommends the Perl module Unicode::UTF8
over Encode
with the hand-rolled helper in a discussion on Hacker News.
Related
- Time::Piece module documentation.
- Encode module documentation.