How to convert UTF-8 data into a String
If you have an array of UTF-8 bytes and want to convert them into a String then the following may help…
As you may know, UTF-8 is a way of encoding every character in the Unicode character set using a variable number of byte per character. For example, the letter A just needs 1 byte but the character 끰 requires 2 bytes (b0 70).
In VB.Net this is pretty straight forward…
Start out with the 2 bytes in an array…
Dim bytes() As Byte = New Byte() {&HB0, &H70} Dim str As String = System.Text.Encoding.UTF8.GetString(bytes)
And now str contains 끰
But what if you started out with an IntPtr to an un-managed C style string?
In that case you would need to marshal the data into a byte array and then do the above, as in the following funciton…
Public Function ConvertUTF8IntPtrtoString(ByVal ptr As System.IntPtr) As String Dim l As Integer l = System.Runtime.InteropServices.Marshal.PtrToStringAnsi(ptr).Length Dim utf8data(l) As Byte System.Runtime.InteropServices.Marshal.Copy(ptr, utf8data, 0, l) Return System.Text.Encoding.UTF8.GetString(utf8data) End Function
And the following C++ function will do the same in MFC:
int CSampleBarcodeReaderDlg::ConvertUTF8Value(LPCSTR in, CString &out) { int l = MultiByteToWideChar(CP_UTF8, 0, in, -1, NULL, 0); wchar_t *str = new wchar_t[l]; int r = MultiByteToWideChar(CP_UTF8, 0, in, -1, str, l); out = str; delete str ; return r ; }
A frustrating twist on the above is when you have a representation of UTF-8 already in a String and would like to convert it to a normal String. There are probably smarter ways of doing this but here’s a take on it…
In this example utf8 starts out a a string that happens to contain a representation of UTF-8 data. This is converted, character by character to a byte array and then back to a String using UTF-8 encoding. In this case str ends up with the value ?.
Dim utf8 As String = "ç?³" Dim ch() As Char = utf8.ToCharArray() Dim bytes(ch.Length) As Byte For i = 0 To (ch.Length - 1) bytes(i) = System.Convert.ToByte(ch(i)) Next Dim str As String = System.Text.Encoding.UTF8.GetString(bytes)