This Week on perl5-porters (8-14 March 2004)

This Week on perl5-porters (8-14 March 2004)

This week was the can-of-Unicode-worms-festival week for the Perl 5 porters. Regular expressions were another recurrent topic. Read on for details.

Unicode and UTF-8 coding

The Big Topic of this week was UTF-8, Unicode, and how Perl deals with it.

This all started with a report about seemingly innocuous UTF-8 failures. Digging into this deeper, Chip Salzenberg pointed out a flaw in Perl's handling of Unicode strings: conversions from byte strings (with "regular" eight-bit chars) to UTF-8 currently map high bit characters to Unicode without translation (or, depending on how you look at it, by implicitly assuming the byte strings are in Latin-1). This is potentially wrong, because Perl assumes the C locale by default. Thus upgrading a string to UTF-8 may change the meaning of its contents regarding character classes, case mapping, etc. But this behaviour was chosen in perl 5.8.x for backwards compatibilty.

Jarkko Hietaniemi, former 5.8 pumpking and Unicode guru, stepped into the discussion and provided insight. Various solutions were proposed and discussed.

Should upgrade from byte strings that contain characters in the range 0x80-0xFF be forbidden, or emit a warning? Autrijus Tang, deciding to speak in code, released a module on CPAN that implements this last solution, and wishes them to be integrated into the core at some point in the 5.9 development track. This would need also to turn the encoding pragma into a lexically-scoped one (like locale currently is.)

While we're at it, Nick Ing-Simmons wonders what's the proper method for XS coders to get UTF-8 data (without converting an SV to UTF-8 in place, which is considered a Bad Thing). Sadahiro Tomoyuki provides some answers.

substr() lvalues

Ton Hospel reported (some time ago) bug #24346, concerning the behaviour of the return value of substr() when it is used as an lvalue. He points out, with examples, that the current situation is not satisfactory, because the lvalue acts as a fixed-length window. This causes in some cases some surprising action at distance, making a variable (coming from the result of a substr()) hold a value different from the one it has been assigned to.

Graham Barr fixed this problem. Nicholas, apparently, still hesitates whether this should go in perl 5.8 or not, in the absence of any good argument for or against.

Regular expression bugs

Hugo reports that Damian reported that use re 'eval' is not seen in patterns interpolated at run-time via /(??{...})/. Yitzchak Scott-Thoennes explains that this comes from the fact that this compile-time pragma setting is no longer seen at run-time (and this is one more reason to rewrite the support for pragmas in the core.)

Hugo reports also a case of incorrect regexp compilation warning (bug #27603) with /(??{...})/ blocks:

Jamie Lokier found a bug in the regular expression engine, more precisely in the optimisation pass (bug #27515), leading to wrong interpretation of the regular expression /^(.*)(?=x)x/. Hugo confirmed that this was a known bug, possibly difficult to fix.

Jamie found also that using return() from a /(?{...})/ block may lead to segmentation fault (bug #27595). Such blocks are considered completely broken by the higher authorities (Dave Mitchell) and are hopefully to be reimplemented.

Other bugs (and fixes)

Rafael reports that the source-filter-based Switch module is confused by occurences the ($) function prototype in the filtered source. (Bug #27472.)

Chip Salzenberg fixed the line-buffering problem noticed by Stas Bekman last week.

Paul Kramer remarks that one can't change the ownership of a symlink with perl's chown() built-in. Rafael suggests to add lchown() to the POSIX module (which contains chown() already.) (Bug #27547.)

Nicholas Clark proposed a load of patches for Storable: fixes for storing restricted hashes, references to undef, plus a space optimization. (Bug #27616.)


Arthur Bergman released the second development release of Ponie, which seems to be impressive so far.

Tels released new versions of his math packages, Math::BigInt v1.70, bignum 0.15, and Math::BigRat 0.12.

About this summary

This summary was written by Rafael Garcia-Suarez. Weekly summaries are published on and posted on a mailing list, which subscription address is Comments and corrections are welcome.