Accessing MATLAB's unicode strings from C -
Accessing MATLAB's unicode strings from C -
how can access underlying unicode info of matlab strings through matlab engine or mex c interfaces?
here's example. let's set unicode characters in utf-8 encoded file test.txt, read as
fid=fopen('test.txt','r','l','utf-8'); s=fscanf(fid, '%s')
in matlab.
now if first feature('defaultcharacterset', 'utf-8')
, c engevalstring(ep, "s")
, output text file utf-8. proves matlab stores unicode internally. if mxarraytostring(enggetvariable(ep, "s"))
, unicode2native(s, 'latin-1')
give me in matlab: non-latin-1 characters replaced character code 26. need getting access underlying unicode info c string in unicode format (utf-8, utf-16, etc.), , preserving non-latin-1 characters. is possible?
my platform os x, matlab r2012b.
addendum: documentation explicitly states "[mxarraytostring()] supports multibyte encoded characters", yet still gives me latin-1 approximation original data.
first, allow me share few references found online:
according mxchar
description,
matlab stores characters 2-byte unicode characters on machines multi-byte character sets
still term mbcs ambiguous me, think meant utf-16 in context (although i'm not sure surrogate pairs, makes ucs-2 instead).
update: mathworks changed wording to:
matlab uses 16-bit unsigned integer character encoding unicode characters.
the mxarraytostring
page states handle multibyte encoded characters (unlinke mxgetstring
handles single-byte encoding schemes). unfortunately, no illustration on how this.
finally, here thread on matlab newsgroup mentions couple of undocumented function related (you can find loading libmx.dll
library tool dependency walker on windows).
here's little experiment did in mex:
my_func.c#include "mex.h" void mexfunction(int nlhs, mxarray *plhs[], int nrhs, const mxarray *prhs[]) { char str_ascii[] = {0x41, 0x6d, 0x72, 0x6f, 0x00}; // {'a','m','r','o',0} char str_utf8[] = { 0x41, // u+0041 0xc3, 0x80, // u+00c0 0xe6, 0xb0, 0xb4, // u+6c34 0x00 }; char str_utf16_le[] = { 0x41, 0x00, // u+0041 0xc0, 0x00, // u+00c0 0x34, 0x6c, // u+6c34 0x00, 0x00 }; plhs[0] = mxcreatestring(str_ascii); plhs[1] = mxcreatestring_utf8(str_utf8); // undocumented! plhs[2] = mxcreatestring_utf16(str_utf16_le); // undocumented! }
i create 3 strings in c code encoded ascii, utf-8, , utf-16le respectively. pass them matlab using mxcreatestring
mex function (and other undocumented versions of it).
i got byte sequences consulting fileformat.info website: a (u+0041), À (u+00c0), , 水 (u+6c34).
let's test above function within matlab:
%# phone call mex function [str_ascii, str_utf8, str_utf16_le] = my_func() %# matlab exposes 2 strings in decoded form (unicode code points) double(str_utf8) %# decimal form: [65, 192, 27700] assert(isequal(str_utf8, str_utf16_le)) %# convert them bytes (in hex) b1 = unicode2native(str_utf8, 'utf-8') b2 = unicode2native(str_utf16_le, 'utf-16') cellstr(dec2hex(b1))' %# {'41','c3','80','e6','b0','b4'} cellstr(dec2hex(b2))' %# {'ff','fe','41','00','c0','00','34','6c'} %# (note first 2 bytes bom markers) %# show string view_unicode_string(str_utf8)
i making utilize of embedded java capability view strings:
function view_unicode_string(str) %# create swing jlabel jlabel = javaobjectedt('javax.swing.jlabel', str); font = java.awt.font('arial unicode ms', java.awt.font.plain, 72); jlabel.setfont(font); jlabel.sethorizontalalignment(javax.swing.swingconstants.center); %# place java component within matlab figure hfig = figure('menubar','none'); [~,jlabelhg] = javacomponent(jlabel, [], hfig); set(jlabelhg, 'units','normalized', 'position',[0 0 1 1]) end
now let's work in reverse direction (accepting string matlab c):
my_func_reverse.c#include "mex.h" void print_hex(const unsigned char* s, size_t len) { size_t i; for(i=0; i<len; ++i) { mexprintf("0x%02x ", s[i] & 0xff); } mexprintf("0x00\n"); } void mexfunction(int nlhs, mxarray *plhs[], int nrhs, const mxarray *prhs[]) { char *str; if (nrhs<1 || !mxischar(prhs[0])) { mexerrmsgidandtxt("mex:error", "expecting string"); } str = mxarraytostring_utf8(prhs[0]); // utf-8 encoded string unicode print_hex(str, strlen(str)); // print bytes plhs[0] = mxcreatestring_utf8(str); // create unicode string utf-8 mxfree(str); }
and test within matlab:
>> s = char(hex2dec(['0041';'00c0';'6c34'])'); %# "\u0041\u00c0\u6c34" >> ss = my_func_reverse(s); 0x41 0xc3 0x80 0xe6 0xb0 0xb4 0x00 %# utf-8 encoding >> assert(isequal(s,ss))
finally should if reason still having problems, easiest thing convert non-ascii strings uint8
datatype before passing matlab engine program.
so within matlab process do:
%# read contents of utf-8 file fid = fopen('test.txt', 'rb', 'native', 'utf-8'); str = fread(fid, '*char')'; fclose(fid); str_bytes = unicode2native(str,'utf-8'); %# convert bytes %# or read file contents bytes begin %fid = fopen('test.txt', 'rb'); %str_bytes = fread(fid, '*uint8')'; %fclose(fid);
and access variable using engine api as:
mxarray *arr = enggetvariable(ep, "str_bytes"); uint8_t *bytes = (uint8_t*) mxgetdata(arr); // decode utf-8 string on end ...
all tests done on winxp running r2012b default charset:
>> feature('defaultcharacterset') ans = windows-1252
hope helps..
edit:in matlab r2014a, many undocumented c functions removed libmx
library (including ones used above), , replaced equivalent c++ functions exposed under namespace matrix::detail::noninlined::mx_array_api
.
it should easy adjust examples above (as explained here) run on latest r2014a version.
matlab unicode encoding mex matlab-engine
Comments
Post a Comment