Saturday, July 3, 2010


Unicode, what is it ?

Unicode is increasing being accepted as a standard for Information Interchange worldwide as most of the major IT Companies have declared their support for it. Unicode for Indian Languages use ISCII-88 and not ISCII-91 which is the latest official standard. It was felt necessary that Indian Government should represent UNICODE Consortium for necessary modification in the code pertaining to Indian languages script and hence Department of Information Technology became full member of Unicode Consortium with voting right.

Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one. Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. No single encoding could contain enough characters: for example, the European Union alone requires several different encodings to cover all its languages. Even for a single language like English no single encoding was adequate for all the letters, punctuation, and technical symbols in common use.

These encoding systems also conflict with one another. That is, two encodings can use the same number for two different characters, or use different numbers for the same character. Any given computer (especially servers) needs to support many different encodings; yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption.

Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. The Unicode Standard has been adopted by such industry leaders as Apple, HP, IBM, JustSystems, Microsoft, Oracle, SAP, Sun, Sybase, Unisys and many others. Unicode is required by modern standards such as XML, Java, ECMAScript (JavaScript), LDAP, CORBA 3.0, WML, etc., and is the official way to implement ISO/IEC 10646. It is supported in many operating systems, all modern browsers, and many other products. The emergence of the Unicode Standard, and the availability of tools supporting it, are among the most significant recent global software technology trends.

Incorporating Unicode into client-server or multi-tiered applications and websites offers significant cost savings over the use of legacy character sets. Unicode enables a single software product or a single website to be targeted across multiple platforms, languages and countries without re-engineering. It allows data to be transported through many different systems without corruption

16 Bit (2 Byte) UNICODE 
Unicode standard is the Universal character encoding standard, used for representation of text for Computer Processing. Unicode standard provides the capacity to encode all of the characters used for the written languages of the world. The Unicode standards provide information about the character and their use. Unicode Standards are very useful for Computer users who deal with multilingual text, Business people, Linguists, Researchers, Scientists, Mathematicians and Technicians. Unicode uses a 16 bit encoding that provides code point for more than 65000 characters (65536). Unicode Standards assigns each character a unique numeric value and name. The Unicode standard and ISO10646 Standard provide an extension mechanism called UTF-16 that allows for encoding as many as a million. Presently Unicode Standard provide codes for 49194 characters. 

Character planes and blocks

Main article: Mapping of Unicode character planes

The Unicode codespace is divided into seventeen planes, each comprising 65,536 code points or 256 rows of 256 code points:







Basic Multilingual Plane




Supplementary Multilingual Plane




Supplementary Ideographic Plane




Tentatively designated as the Tertiary Ideographic Plane (TIP), but no characters have been assigned to it yet.[4]


4 to 13


currently unassigned




Supplementary Special-purpose Plane




Supplementary Private Use Area-A




Supplementary Private Use Area-B



Unicode, policy for encoding character

Unicode consortium has laid down certain policy regarding character encoding stability by which no character deletion or change in character name is possible only annotation update is possible
1. Once a character is encoded, it will not be moved or removed. 
2. Once a character is encoded, its character name will not be changed. 
3. Once a character is encoded, its canonical combining class and decomposition (either canonical or compatibility) will not be changed in a way that would affect normalization. 
4. Once a character is encoded, its properties may still be changed, but not in such a way as to change the fundamental identity of the character. 
5. The structure of certain property values in the Unicode character database will not be changed. 

Some of the unicodes

Devanagari (Newsletter Jan 2002 pdf) (For Devanagari & Devanagari based languages) 

Gujarati, Malayalam (Newsletter April 2002 pdf) (For Gujarati & Malayalam)

Oriya, Gurmukhi & Telugu (Newsletter April 2002 pdf) (For Oriya, Gurmukhi & Telugu) 

Bangla (Newsletter July 2002 pdf) (For Bangla & Bangla based languages) 

Tamil, Kannada (Newsletter Oct 2002 pdf)
 (For Tamil & Kannada) 

Arabic-Urdu, Sindhi, Kashmere (Newsletter Oct 2002 pdf)
  (For Arabic-Urdu, Sindhi,                                                                                                     Kashmere) 

Vedic (Newsletter Oct 2002 pdf) (For Vedic Sanskrit) 

You must have the latest version of Acrobat Reader to read these PDFs



No comments: