Unicode Standard Gets an Extreme Makeover

The code that specifies how computer text looks in modern software products and standards got an upgrade this week.

The Unicode Consortium Wednesday said it has published Version 4.0 of the Unicode Standard. The fundamental spec assigns a unique number for every character and is at the core of all modern software, programming languages, and standards, including Windows, Java, C#, Perl, XML, HTML, DB2, and Oracle.

The Mountain View, Calif.-based non-profit is backed by the likes of software giants Adobe Systems , Apple Computer , Basis Technology, Hewlett-Packard , IBM , Justsystem, Microsoft , Oracle , PeopleSoft , RLG, SAP , Sun Microsystems , and Sybase . The group also includes the Government of India (Ministry of Information Technology) and Government of Pakistan (National Language Authority).

The organization says the latest upgrade strengthens Unicode support for worldwide communication, software availability, and publishing. The text has been extensively rewritten, and incorporates specifications that were previously only available as separate documents.

Version 4.0 encodes over 96,000 characters (twice as many as Version 3.0) and includes two record-breaking collections of encoded characters: The largest encoded character collection for Chinese characters in the history of computing, and the largest set of characters for mathematical and technical publishing.

The organization says the spec also now contains more than 2,000 years of Chinese, Japanese, Korean, and Vietnamese literary usage, including all the main classical dictionaries of these languages. For math and tech languages, the character repertoires of Version 4.0 is now completely compliant with International Standard ISO/IEC 10646.

The Consortium also says the expanded capabilities of the Unicode Standard also help close the gap of a “digital divide” because it meet the needs of all languages.

“Small linguistic communities all over the world have the opportunity to get mainstream software working right out of the box, instead of waiting years for special adaptations that may never come,” the group issued in a statement.

The group says the Unicode and associated standards are continually being extended, not only in terms of the addition of characters, but also in specifying how those characters work.

Challenges that have now been addressed include aligning correctly for East Asian languages (e.g. vertically) or in Middle Eastern languages (from right to left). Version 4.0 als looks at how text should upper- or lowercase, breaking text into lines or words and showing how text behaves in Regular Expressions (a key tool used in a vast number of Web servers).

The latest version is available in book form from publisher Addison-Wesley and is available from the Unicode Consortium or through the book trade. The text and code charts of Version 4.0 are also available on the Consortium’s Web site.

News Around the Web