Unicode Text Limits
Posted: 30 Jul 2009, 00:45
I've been looking at the latest protocol spec to tighten the conformance of Netfraction. When I looked at the NI and DE fields of the INF message and saw that valid text is "all characters in the Unicode character set with code point equal to or greater than 32 [U+0020]", I went about crafting a regular expression to match this.
Using C# and the .Net Framework as I am, the Framework defines a number of character classes that correspond to Unicode character classes (http://msdn.microsoft.com/en-us/library ... egory.aspx and http://www.unicode.org/Public/UNIDATA/U ... ory_Values).
The category 'Cc' in the Unicode specification captures 'control' characters U+0000-U+001F (C0 controls, http://www.unicode.org/charts/PDF/U0000.pdf) and U+007F-U+009F (C1 controls, http://www.unicode.org/charts/PDF/U0080.pdf) which originate in ISO/IEC 6429.
Thus, the ADC Protocol currently restricts the use of C0 controls, plus the space character U+0020. Unicode also defines a Separator category 'Z' which includes Line (Zl), Paragraph (Zp) and Space (Zs) components. I'm having trouble finding the exact definitions of these on the official Unicode site, but they appear to be listed here: http://www.fileformat.info/info/unicode ... /index.htm
I propose that the ADC Protocol exclude, for the purposes of nicknames and descriptions, not just C0 controls and U+0020; but, all controls in the Cc category and all separator characters in the Z category.
I include here a C# function containing a regular expression which verifies text against this restriction:
Also, I believe the Protocol should specify a minimum Nickname length of at least one character.
Using C# and the .Net Framework as I am, the Framework defines a number of character classes that correspond to Unicode character classes (http://msdn.microsoft.com/en-us/library ... egory.aspx and http://www.unicode.org/Public/UNIDATA/U ... ory_Values).
The category 'Cc' in the Unicode specification captures 'control' characters U+0000-U+001F (C0 controls, http://www.unicode.org/charts/PDF/U0000.pdf) and U+007F-U+009F (C1 controls, http://www.unicode.org/charts/PDF/U0080.pdf) which originate in ISO/IEC 6429.
Thus, the ADC Protocol currently restricts the use of C0 controls, plus the space character U+0020. Unicode also defines a Separator category 'Z' which includes Line (Zl), Paragraph (Zp) and Space (Zs) components. I'm having trouble finding the exact definitions of these on the official Unicode site, but they appear to be listed here: http://www.fileformat.info/info/unicode ... /index.htm
I propose that the ADC Protocol exclude, for the purposes of nicknames and descriptions, not just C0 controls and U+0020; but, all controls in the Cc category and all separator characters in the Z category.
I include here a C# function containing a regular expression which verifies text against this restriction:
Code: Select all
private static bool ContainsUnicodeControlOrSeperator(string text)
{
return !System.Text.RegularExpressions.Regex.Match(text, @"^[^\p{Cc}\p{Z}]+$").Success;
}