Double encoded UTF-8 strings in C#
This article shows how to convert a string that has been double encoded using UTF-8.
For example, say you have the string Müller instead of the string Müller.
How did it happen?
The letter ü is encoded in UTF-8 as 2 bytes: 195 and 188
If you encoded the bytes again then the 195 converts to 195 and 131 which is the Ã
And the 188 converts to 194 and 188 which is the ¼
How can it be converted back to what it should look like?
The following function will convert the double encoded string back to the original value…
private string decodeUTF8String(String utf8Str)
{
System.Text.Encoding iso = System.Text.Encoding.GetEncoding(“ISO-8859-1”);
System.Text.Encoding utf8 = System.Text.Encoding.UTF8;
byte[] utfBytes = utf8.GetBytes(utf8Str);
byte[] isoBytes = System.Text.Encoding.Convert(utf8, iso, utfBytes);
System.Text.UTF8Encoding encoding = new System.Text.UTF8Encoding();
return encoding.GetString(isoBytes);
}
How does this relate to barcodes?
Some PDF-417 barcodes may contain data that has already been encoded using UTF-8 and when we read the barcode we encode it again using UTF-8, giving a double encoded string. In the win32 DLL interface the work-around is simply to set the Encoding property to 0, but the above is necessary in the .Net interface.