[Subject Prev][Subject Next][Thread Prev][Thread Next][Subject Index][Thread Index]

A look at UTF-8 std. for Indian Lang



few days back i was in a discussion with Dr. Vineet Chaitanya (Lang. Tech.
Research Centre, IIIT-Hyd).  At that time only i came to know about some
strange facts of UTF-8 std.  Here are Dr.VC statement about UTF-8:

o  UTF-8 not good for Indian languages
   -  0% overhead for English
   -  Possibly 10% overhead for European languages
   -  No overhead for CJK

   -->  100% overhead for IL  <--


	UTF-8 converts Unicode two-byte codes to byte sequence of one to
	four bytes. In the process they make sure that ASCII part of the
	Unicode is transmitted as single byte only. So for a language like
	English and a few more which use only 0-127 part of the code there
	is no overhead. European languages need to use some character codes
	in the region 128-255 in addition to 0-127 part, so we estimate that
	for transmitting this portion they will incur some overhead say of the
	order of 10%. CJK has about 25000 codes so the average information
	encoded is much larger per Unicode character-code, so this will
	compensate increased size of byte sequence in their case.
	In contrast to above cases Indian languages use no part of the code
	in region 0-127 and moreover there character codes occupy less
	that 127 codes for each language, so what could have been transmitted
	in one byte will be transmitted in a sequence of two to four bytes,
	thus minimum extra overhead will be 100% !




-- mks --

---
Visit our home page at: www.chennailug.org
Send e-mail to 'ilugc-request@xxxxxxxxxxxxxxxxxx' with 'unsubscribe' 
in either the subject or the body to unsubscribe from this list.