When a Perl program I had written had some encoding issues with spreadsheet data a customer fed to it, I asked if he could email me the Excel spreadsheet so I could have a look.
After a quick examination I noticed that some cells had their data encoded as UCS2. So I decided to use the excellent Encode module to convert the data to UTF-8. At the top of the Perl program I added:
use Encode;
And to the loop that processed all spreadsheet rows I added a check to see if a cell had a defined Code field, and if so, if the encoding equals 'ucs2'. In which case the cell value is converted from UCS-2BE to UTF-8.
my $cell = $cells->[ $row ][ $col ];
my $value = $cell->{ Val };
if ( defined $cell->{ Code } ) {
if ( $cell->{ Code } eq 'ucs2' ) {
$value = Encode::decode( 'UCS-2BE', $value );
}
else {
die "$row: unknown encoding: $cell->{ Code }";
}
}
In case an unknown encoding is encountered,
the program halts with an error message. Based
on the documentation of Spreadsheet::ParseExcel
the only other value besides undef
,
and ucs2
the character encoding
of a cell can have is _native_
,
which is unlikely to occur in this case. And
since it's unlikely, I decided to handle this
as an exception using die
.