php - Multi-byte strings and look-around weird bug -



php - Multi-byte strings and look-around weird bug -

why next code behaves differently different multi-bye strings?

echo preg_replace('@(?=\pl)@u', '*', 'م'); // prints: '*م' ✓ echo preg_replace('@(?=\pl)@u', '*', 'ض'); // prints: '*ض' ✓ echo preg_replace('@(?=\pl)@u', '*', 'غ'); // prints: '*�*�' ✗ echo preg_replace('@(?=\pl)@u', '*', 'ص'); // prints: '*�*�' ✗

see: http://3v4l.org/fvab1

you need include modifier letters (lm). see next script iterating on whole standard arabic unicode block:

<?php function uchar_2($dec) { $utf = chr(192 + (($dec - ($dec % 64)) / 64)); $utf .= chr(128 + ($dec % 64)); homecoming $utf; } $issues = 0; $count = 0; ($dec = 1536; $dec <= 1791; $dec++) { $char = uchar_2($dec); if (preg_replace('@^(?=\plm)$@u', '*', $char) !== $char) { printf("issue %s (%s)\n", $dec, $char); $issues++; } $count++; } printf("found %d issues in %d rows\n", $issues, $count);

with out lm, fail around half of characters.

php regex

Comments

Popular posts from this blog

web services - java.lang.NoClassDefFoundError: Could not initialize class net.sf.cglib.proxy.Enhancer -

Accessing MATLAB's unicode strings from C -

javascript - mongodb won't find my schema method in nested container -