4-byte utf-8 and DC software

klondike · Post by **klondike** » 16 Nov 2014, 02:57

Hi!

I detected whilst updating my hub's MOTD (for reference I have hanged it at http://klondike.es/motd.txt) that messages with 4-byte UTF-8 whilst sent properly are filtered by many DC engines.

Ironically the character I was testing was the one at http://www.fileformat.info/info/unicode ... /index.htm (this forums SQL databse rejects inserting it).

So far my results are these:

Hub software:
* ADCH: Can send messages with the character but filters incoming ones
* uhub: Can send messages with the character but filters incoming ones
* Flexhub: Replaces the character on the message by an '?' (question mark)

Client software:
* eiskaltdc: Can send messages with the character but filters incoming ones

I'm unsure what's causing it as the character is valid utf-8 and shouldn't be filtered.

klondike · Post by **klondike** » 17 Nov 2014, 04:27

Pretorian asked if this could be caused by an old unicode engine as it is fairly recent. I didn't test as deeply with the pineapple codepoint http://www.fileformat.info/info/unicode ... /index.htm as I did with the other but it seems to trigger similar behaviour where I tested.

Post by **cologic** » 17 Nov 2014, 21:14

MySQL's manual probably explains why the forum has rejected your 4-byte "NO PIRACY" codepoint:

https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html wrote:The character set named utf8 uses a maximum of three bytes per character and contains only BMP characters. As of MySQL 5.5.3, the utf8mb4 character set uses a maximum of four bytes per character supports supplemental characters

I haven't checked uhub or FlexHub, but adchpp's Text.cpp doesn't support 4-byte UTF-8 encodings in either Text::utf8ToWc or Text::wcToUtf8, which explains your observations.

4-byte UTF-8 shouldn't be filtered, no. RFC 3629 defines 4-byte UTF-8; however, it was simply never implemented.

klondike · Post by **klondike** » 18 Nov 2014, 01:27

Yeah I found the code doing the filtering in uhub too, see https://github.com/janvidar/uhub/blob/m ... til/misc.c

We probably should explain this into the ADC standard and maybe provide some use cases (if possible with something other than that stupid "no piracy" icon).

The main question this raises is, should we leave the ground set in case they decide to expand to 5 or even 6 bytes unicode or just stick with the standard?

Post by **cologic** » 18 Nov 2014, 02:02

klondike wrote:Yeah I found the code doing the filtering in uhub too, see https://github.com/janvidar/uhub/blob/m ... til/misc.c

We probably should explain this into the ADC standard and maybe provide some use cases (if possible with something other than that stupid "no piracy" icon).

The main question this raises is, should we leave the ground set in case they decide to expand to 5 or even 6 bytes unicode or just stick with the standard?

Hopefully you've alerted janvidar to the issue in uhub.

I agree, this many hubs getting it wrong in the same way suggests a widespread misunderstanding with an unambiguously correct fix. It's worth mentioning somewhere, with example use cases.

I'm disinclined to encourage preparation for 5+ byte UTF-8 support. The previous UTF-8 RFC did allow for "sequences of 1 to 6 octets". However, not only does the superceding UTF-8 RFC actually enumerate exactly 4 possible codepoint representation lengths, but, after five years' accumulated experience between 1998 and 2003, deliberately "Restricted the range of characters to 0000-10FFFF (the UTF-16 accessible range)" versus RFC 2279, all representable within 4 bytes. I'd suggest just sticking with the current standard.

klondike · Post by **klondike** » 18 Nov 2014, 06:20

cologic wrote:Hopefully you've alerted janvidar to the issue in uhub.

Nah, I'll just fix the code my self and send a merge request his way instead.

cologic wrote:I agree, this many hubs getting it wrong in the same way suggests a widespread misunderstanding with an unambiguously correct fix. It's worth mentioning somewhere, with example use cases.

A good example would probably the emoticon block which can be found on http://www.fileformat.info/info/unicode ... s/list.htm most are utf-8, likely to be used and providing a good baseline for what is historically an evolution of IRC.

cologic wrote:I'm disinclined to encourage preparation for 5+ byte UTF-8 support. The previous UTF-8 RFC did allow for "sequences of 1 to 6 octets". However, not only does the superceding UTF-8 RFC actually enumerate exactly 4 possible codepoint representation lengths, but, after five years' accumulated experience between 1998 and 2003, deliberately "Restricted the range of characters to 0000-10FFFF (the UTF-16 accessible range)" versus RFC 2279, all representable within 4 bytes. I'd suggest just sticking with the current standard.

Sounds like a good idea, I'd like to see what the others say so if this bites us again on the far future (let's hope not) whoever has to fix this can know the reasons why we didn't take another approach.

Post by **cologic** » 19 Nov 2014, 17:23

klondike wrote:A good example would probably the emoticon block which can be found on http://www.fileformat.info/info/unicode ... s/list.htm most are utf-8, likely to be used and providing a good baseline for what is historically an evolution of IRC.

Works for me.

klondike wrote:
cologic wrote:I'm disinclined to encourage preparation for 5+ byte UTF-8 support. The previous UTF-8 RFC did allow for "sequences of 1 to 6 octets". However, not only does the superceding UTF-8 RFC actually enumerate exactly 4 possible codepoint representation lengths, but, after five years' accumulated experience between 1998 and 2003, deliberately "Restricted the range of characters to 0000-10FFFF (the UTF-16 accessible range)" versus RFC 2279, all representable within 4 bytes. I'd suggest just sticking with the current standard.
Sounds like a good idea, I'd like to see what the others say so if this bites us again on the far future (let's hope not) whoever has to fix this can know the reasons why we didn't take another approach.

Agree that documenting rationales for this sort of decision is important.

Additionally, both XML 1.0 (2000) and XML 1.1 (2006) define supported character ranges topping out at #x10FFFF, just as RFC 3629. For filelists, this likewise restrains valid Unicode codepoints to those fitting in 4-byte UTF-8 representations.

klondike · Post by **klondike** » 24 Nov 2014, 09:53

cologic wrote:
klondike wrote:A good example would probably the emoticon block which can be found on http://www.fileformat.info/info/unicode ... s/list.htm most are utf-8, likely to be used and providing a good baseline for what is historically an evolution of IRC.
Works for me.
klondike wrote:
cologic wrote:I'm disinclined to encourage preparation for 5+ byte UTF-8 support. The previous UTF-8 RFC did allow for "sequences of 1 to 6 octets". However, not only does the superceding UTF-8 RFC actually enumerate exactly 4 possible codepoint representation lengths, but, after five years' accumulated experience between 1998 and 2003, deliberately "Restricted the range of characters to 0000-10FFFF (the UTF-16 accessible range)" versus RFC 2279, all representable within 4 bytes. I'd suggest just sticking with the current standard.
Sounds like a good idea, I'd like to see what the others say so if this bites us again on the far future (let's hope not) whoever has to fix this can know the reasons why we didn't take another approach.
Agree that documenting rationales for this sort of decision is important.

Additionally, both XML 1.0 (2000) and XML 1.1 (2006) define supported character ranges topping out at #x10FFFF, just as RFC 3629. For filelists, this likewise restrains valid Unicode codepoints to those fitting in 4-byte UTF-8 representations.

Cool, I suppose Pretorian knows better where should we keep this noted down.

For the record (and in case it helps those needing a reference) here is the patch for uhub https://github.com/janvidar/uhub/pull/27

Post by **cologic** » 25 Nov 2014, 06:11

klondike wrote:Cool, I suppose Pretorian knows better where should we keep this noted down.

For the record (and in case it helps those needing a reference) here is the patch for uhub https://github.com/janvidar/uhub/pull/27

Yes, Pretorian would be the person to ask here.

Your commit to add support for 4 byte UTF-8 characters and stricter character checking including a specific check for what appears to be CESU-8 or some similar UTF-16 surrogate encoding puzzles me; RFC 3926 notes that:

https://tools.ietf.org/html/rfc3629#section-3 wrote:The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters. When encoding in UTF-8 from UTF-16 data, it is necessary to first decode the UTF-16 data to obtain character numbers, which are then encoded in UTF-8 as described above. This contrasts with CESU-8, which is a UTF-8-like encoding that is not meant for use on the Internet. CESU-8 operates similarly to UTF-8 but encodes the UTF-16 code values (16-bit quantities) instead of the character number (code point). This leads to different results for character numbers above 0xFFFF; the CESU-8 encoding of those characters is NOT valid UTF-8.

While your code indeed appears to treat this as invalid input, I'm puzzled why it should have to be specifically checked for and prohibited at all; to the extent it's invalid, it should be invalid without any specific reference to this not-meant-for-internet-usage surrogate encoding. What is the use case here; has some ADC software historically used such encodings?

I realize that adchpp's Text::utf8ToWc has a similar "Ugly utf-16 surrogate catch" and I'm mostly confused why it's there at all. I was intending to remove it, so I'm especially curious why you specifically (re-)added it.

klondike · Post by **klondike** » 25 Nov 2014, 07:01

The reason why they recommend treating surrogates as invalid is the same as why they recommend checking for character ranges: to prevent various ways of encoding the same thing which may (for example) allow going over filters.

The reason why some encodings allow encoding surrogates is because there are systems using utf-16 (back when they thought 16 bits would be enough for everybody) which encode the 16byte value directly into utf-8 instead of handling the surrogates for example for storage in a database.

The problem with such approach is that a system using utf-16 as its internal representation will decode the surrogates and then interpret them as the symbol they represent instead of converting the 3 or 4 byte utf-8 sequence with the symbol into the appropriate surrogate set. This causes information to be representable in two ways which is a bad thing for filters.

DCBase

4-byte utf-8 and DC software

4-byte utf-8 and DC software

Re: 4-byte utf-8 and DC software

Re: 4-byte utf-8 and DC software

Re: 4-byte utf-8 and DC software

Re: 4-byte utf-8 and DC software

Re: 4-byte utf-8 and DC software

Re: 4-byte utf-8 and DC software

Re: 4-byte utf-8 and DC software

Re: 4-byte utf-8 and DC software

Re: 4-byte utf-8 and DC software