ADC character encoding

Here are whitepapers on ADC Features that can give you help.
Locked

Should we use UTF8 or just US-ASCII ?

We should, everybody has the right to use their language
9
100%
We shouldn't, because I don't like ADC and I don't have any argument for that
0
No votes
We shouldn't, everybody should learn english so we can understand all each-other
0
No votes
I'm american, i dont give a shit about the rest, so long live US-ASCII
0
No votes
I don't know
0
No votes
I don't care
0
No votes
 
Total votes: 9

Pietry
Senior Member
Posts: 328
Joined: 04 Dec 2007, 07:25
Location: Bucharest
Contact:

ADC character encoding

Post by Pietry » 12 Mar 2008, 17:26

In the beggining, before God created DC and computers, when the first Turing Machines were created, a little problem appeared. The need of saving characters.
Characters are letters, digits, special marks, anything that you use for communication every day.

Since the first machines had something like 16 KiB of memory, the characters had to use as little space as possible. The bit ( 0 or 1 ) is the smallest unit for storing information. The byte ( 8 bits ) is the smalles unit that can represent information. So, what was better then to represent characters by a single byte ?

Ok, but that worked only for a while. A byte, 8 bits, can only hold numbers from 0000 0000 to 1111 1111. What does that mean ? That if you convert the number into decimal, you get the smallest number 0 and the biggest 256. Oh my, does that mean that we can have only 256 characters? Partly yes.

For a while, those 256 characters were sufficient, and some code called the ASCII code was created. In this code, each symbol would get a number from 0 to 256. 'a' became 97, 'A' became 65 and so on.
After some time, people realised this in fact was a very restrictive code. They tried creating some extensions that would help in coding more characters.
Each language has special characters. Just think about chinese !
Some characters like şţăî couldn't be represented using the ASCII code. Since the code was used by Americans and it was created by Americans, like always, they never think about the others.
The best solution to this problem was to create an extension that could bring any characters people needed.
This solution is called Unicode. The most used encoding for Unicode is UTF8. This representation completely includes the ASCII codes, and extends them.
The fun part is, that if you use ASCII codes, UTF8 still uses one byte. So, there is no performance penalty in using it. The best part comes when encoding the rest of the characters.
According to wikipedia,
"UTF-8 encodes each character in one to four octets (8-bit bytes):

1. One byte is needed to encode the 128 US-ASCII characters (Unicode range U+0000 to U+007F).
2. Two bytes are needed for Latin letters with diacritics and for characters from Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac and Thaana alphabets (Unicode range U+0080 to U+07FF).
3. Three bytes are needed for the rest of the Basic Multilingual Plane (which contains virtually all characters in common use).
4. Four bytes are needed for characters in the other planes of Unicode, which are rarely used in practice.
"
Isn't that just better?

What does this have to do with ADC ? To answer that, we need to go back to the ages of NMDC. NMDC uses ASCII only localization. That means, if you try to send some other characters, you will see some weird ones or some squares. The encoding is not recognized. ( I think it's possible if both clients have the same locale to see something. Anyway, it's crap like that.)

ADC completely uses UTF8. That makes ADC an i18n ( international ) protocol , multilingual and not-discriminative.
Let's give everybody a chance.


References:
http://en.wikipedia.org/wiki/ASCII
http://en.wikipedia.org/wiki/Unicode
http://en.wikipedia.org/wiki/UTF-8
Just someone

Locked