[Subject Prev][Subject Next][Thread Prev][Thread Next][Subject Index][Thread Index]
A look at UTF-8 std. for Indian Lang
few days back i was in a discussion with Dr. Vineet Chaitanya (Lang. Tech.
Research Centre, IIIT-Hyd). At that time only i came to know about some
strange facts of UTF-8 std. Here are Dr.VC statement about UTF-8:
o UTF-8 not good for Indian languages
- 0% overhead for English
- Possibly 10% overhead for European languages
- No overhead for CJK
--> 100% overhead for IL <--
UTF-8 converts Unicode two-byte codes to byte sequence of one to
four bytes. In the process they make sure that ASCII part of the
Unicode is transmitted as single byte only. So for a language like
English and a few more which use only 0-127 part of the code there
is no overhead. European languages need to use some character codes
in the region 128-255 in addition to 0-127 part, so we estimate that
for transmitting this portion they will incur some overhead say of the
order of 10%. CJK has about 25000 codes so the average information
encoded is much larger per Unicode character-code, so this will
compensate increased size of byte sequence in their case.
In contrast to above cases Indian languages use no part of the code
in region 0-127 and moreover there character codes occupy less
that 127 codes for each language, so what could have been transmitted
in one byte will be transmitted in a sequence of two to four bytes,
thus minimum extra overhead will be 100% !
-- mks --
---
Visit our home page at: www.chennailug.org
Send e-mail to 'ilugc-request@xxxxxxxxxxxxxxxxxx' with 'unsubscribe'
in either the subject or the body to unsubscribe from this list.