RealTime IT News

W3C Advances XML 1.1 to Candidate Draft

More than four years after the World Wide Web Consortium (W3C) advanced Extensible Markup Language (XML) 1.0 -- the foundational technology of Web services -- as a recommendation, the standards organization is nearly ready to go forward with XML 1.1.

On Tuesday, the W3C released the XML 1.1 specification as a candidate recommendation, just a step away from release as an official recommendation. But the draft is not without controversy in the XML community, where many members argue that it breaks backward compatibility with XML 1.0 without good reason.

XML 1.0, with errata, has persisted for so long because its stability has proved useful for maintaining interoperability, according to the W3C's XML Core Working Group. But the current lineup of changes actually alters the definition of well-formed documents within XML, which the working group suggested necessitates advancing the specification to version 1.1. Detractors of the draft say some well-formed XML 1.0 documents will become malformed XML 1.1 documents.

Largely, the new changes revolve around bringing the XML specification up to speed with the evolution of Unicode , a standard for representing characters as integers. Unlike ASCII , which uses 8 bits for each character, Unicode uses 16 bits, which means that it can represent more than 65,000 unique characters. The move is considered important in the increasingly globalized world of technology. While supporting 65,000 characters is unnecessary for English and most other western European languages, it is essential for representing languages like Amharic, Burmese, Canadian aboriginal languages, Cantonese (Bopomofo script), Cherokee, Dhivehi, Khmer, Mongolian (traditional script), Oromo, Syriac, Tigre, Yi, Chinese, Japanese, Korean (Hangul script), etc.

When XML 1.1 was released, Unicode was on version 2.0. It has since progressed to version 3.2 and continues to evolve. While XML 1.0 does allow for characters not present in Unicode 2.0 to be used in character data, they cannot be used in XML names like element type names, attribute names, enumerated attribute values, processing instruction targets, etc.

This was because XML 1.0 provided a rigid definition of names, forbidding everything which was not explicitly permitted. The candidate recommendation W3C put on the table Tuesday takes an entirely different tack, permitting everything that is not forbidden. The design should allow XML to allow almost any character in names without the need for further changes.

But critics of the XML 1.1 draft argue that the working group is addressing a paper tiger.

"If this were true, it would be a very serious criticism of XML 1.0," an editorial on the XML news site Café con Leche noted Wednesday. "...Only the markup, that is, the tags, would have to be written in another script. Given that there aren't even localized operating systems in most of these languages, and that today's software effectively requires users to have a solid knowledge of at least the ASCII characters, I don't think the need to write markup (as opposed to text) in Cherokee justifies breaking backwards compatibility."

The editorial continued, "But wait! It's not even that bad. Several of the languages listed are total red herrings. You most certainly can write markup in Cantonese, Japanese, Korean, Mandarin, and Vietnamese today. The new characters Unicode has added to these scripts are very obscure. In fact, experts often disagree over whether some of them exist at all, or are merely typographical variations of existing characters. Since the 1700s Vietnamese has been written in a Latin-based alphabet that is fully available in XML and that can write any Vietnamese word. Vietnamese only uses the Han ideographs for classical documents and occasional signage or decoration, and it seems very unlikely that a Vietnamese speaker would write their markup using Han ideographs. Japanese has not one but two phonetic alphabets that can write any Japanese word if the right Han ideograph character is not encoded. Chinese speakers can use either Latin characters or the native Bopomofo phonetic system for the very rare cases where a character they need is not encoded. The fact is most native speakers of Chinese, Japanese, Korean and Vietnamese do not recognize the vast majority of these new characters, and the need for them in markup (again, as opposed to text) is non-existent."

Perhaps the most controversial change in XML 1.1 is one that specifically addresses a problem IBM's mainframe systems have in using the language. IBM's mainframes use a special character to designate the end of a line of text, but XML 1.0 can't recognize the character, forcing XML 1.0 documents generated on mainframes to either violate the local line-end conventions or utilize translation phases before parsing and after generation. To fix this problem, XML 1.1 adds the character to list of supported line-end characters (and also adds support for the Unicode line separator character).

The change has led to allegations in XML discussion circles that IBM is unduly using its power in the XML Working Group to serve its own ends and force others to conform to its needs.

"The concern with respect to IBM is that one of the world's largest corporations, with thousands of patents, legions of programmers, billions of dollars in revenue, and resources pouring out of every orifice is somehow unable to handle documents where lines end with carriage returns and line feeds, as documents do on every non-IBM system on the planet," the Café con Leche editorial said. "The only reason there's a problem here at all is because IBM tried to go it alone as a monopoly and set standards by fiat for years rather than working with the rest of the industry."

Finally, adds support for the use of character references to several previously forbidden control characters.

The candidate recommendation will be open for comment through mid-February.