Double encoded UTF-8 strings in C#
This article shows how to convert a string that has been double encoded using UTF-8.
For example, say you have the string Müller instead of the string Müller.
How did it happen?
The letter ü is encoded in UTF-8 as 2 bytes: 195 and 188
If you encoded the bytes again then the 195 converts to 195 and 131 which is the Ã
And the 188 converts to 194 and 188 which is the ¼
How can it be converted back to what it should look like?
The following function will convert the double encoded string back to the original value…
private string decodeUTF8String(String utf8Str)
System.Text.Encoding iso = System.Text.Encoding.GetEncoding(“ISO-8859-1”);
System.Text.Encoding utf8 = System.Text.Encoding.UTF8;
byte[] utfBytes = utf8.GetBytes(utf8Str);
byte[] isoBytes = System.Text.Encoding.Convert(utf8, iso, utfBytes);
System.Text.UTF8Encoding encoding = new System.Text.UTF8Encoding();
return encoding.GetString(isoBytes);
How does this relate to barcodes?
Some PDF-417 barcodes may contain data that has already been encoded using UTF-8 and when we read the barcode we encode it again using UTF-8, giving a double encoded string. In the win32 DLL interface the work-around is simply to set the Encoding property to 0, but the above is necessary in the .Net interface.