Hello, please sign in or register
You are here: Home

Character sets - Unicode and encodings

I read a great article from Joel Spolsky http://www.joelonsoftware.com/articles/Unicode.html entitled, "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)"

I work with a large number of characters. From English to French and Yiddish to Mandarin. And wish to store them in the minimum number of bytes.

Most european languages use the ASCII character set, common English characters can be seen in the 0-127 range (see table below), whilst French accented characters are found in the 128-255 range. This is important because storing French in ASCII would take up less bytes than the same words in UFT-8. Yiddish and Mandadrin wouldn't be supported under Ascii.

Character Number Range Ascii (Bytes) UFT-8 (Bytes) UTF-16/UCS2 (Bytes) UCS-4
0-127 1 1 2 4
128-255 1 2 2 4
256-16'384 na 2 2 4
16'384-32'768 na 3 2 4
32'769-65'536 na 3 4 4
65'537-2mill na 3 4 4

There are 8 bits to the byte which are usable. For ASCII and other character sets that use a fixed number of bytes the formula becomes roundup(n/2^(B*8)) = B. Where n is the character number and B is the number of Bytes.

The last bit in Unicode and UTF-8 tells us whether the character is using the next byte. Thus an inifinite number of characters can be represented. But also means there's only 7 bits to any byte which are representable of a character. Therefore the number of bytes B used by the character n is roundup( n/(2^(B*7))  = B

... still working on those formulars

 

 

Comments

Title*
Comment

Prove you are not a robot

To prove you are not a robot, please type in the six character code you see in the picture below
Security confirmation codeI can't see this!
Contact
Name*
Email never shown*
Home Page

Author

Andrew Dodson
Since:Feb 2007

Comment | flag

Categories

Bookmark and Share