More than four years after the World Wide Web Consortium (W3C) advanced
Extensible Markup Language (XML) 1.0 — the foundational technology of Web
services — as a recommendation, the standards organization is nearly ready
to go forward with XML 1.1.
On Tuesday, the W3C released the XML 1.1
specification as a candidate recommendation, just a step away from
release as an official recommendation. But the draft is not without
controversy in the XML community, where many members argue that it
breaks backward compatibility with XML 1.0 without good reason.
XML 1.0, with errata, has persisted for so long because its stability has
proved useful for maintaining interoperability, according to the W3C’s XML
Core Working Group. But the current lineup of changes actually alters the
definition of well-formed documents within XML, which the working group
suggested necessitates advancing the specification to version 1.1.
Detractors of the draft say some well-formed XML 1.0 documents will become
malformed XML 1.1 documents.
Largely, the new changes revolve around bringing the XML specification up
to speed with the evolution of Unicode
representing characters as integers. Unlike ASCII
uses 8 bits for each character, Unicode uses 16 bits, which means that it
can represent more than 65,000 unique characters. The move is considered
important in the increasingly globalized world of technology. While
supporting 65,000 characters is unnecessary for English and most other
western European languages, it is essential for representing languages like
Amharic, Burmese, Canadian aboriginal languages, Cantonese (Bopomofo
script), Cherokee, Dhivehi, Khmer, Mongolian (traditional script), Oromo,
Syriac, Tigre, Yi, Chinese, Japanese, Korean (Hangul script), etc.
When XML 1.1 was released, Unicode was on version 2.0. It has since
progressed to version 3.2 and continues to evolve. While XML 1.0 does allow
for characters not present in Unicode 2.0 to be used in character data,
they cannot be used in XML names like element type names, attribute names,
enumerated attribute values, processing instruction targets, etc.
This was because XML 1.0 provided a rigid definition of names, forbidding
everything which was not explicitly permitted. The candidate recommendation
W3C put on the table Tuesday takes an entirely different tack, permitting
everything that is not forbidden. The design should allow XML to allow
almost any character in names without the need for further changes.
But critics of the XML 1.1 draft argue that the working group is addressing
a paper tiger.
“If this were true, it would be a very serious criticism of XML 1.0,” an
editorial on the XML news site Café con Leche noted Wednesday.
“…Only the markup, that is, the tags, would have to be written in another
script. Given that there aren’t even localized operating systems in most of
these languages, and that today’s software effectively requires users to
have a solid knowledge of at least the ASCII characters, I don’t think the
need to write markup (as opposed to text) in Cherokee justifies breaking
backwards compatibility.”
The editorial continued, “But wait! It’s not even that bad. Several of the
languages listed are total red herrings. You most certainly can write
markup in Cantonese, Japanese, Korean, Mandarin, and Vietnamese today. The
new characters Unicode has added to these scripts are very obscure. In
fact, experts often disagree over whether some of them exist at all, or are
merely typographical variations of existing characters. Since the 1700s
Vietnamese has been written in a Latin-based alphabet that is fully
available in XML and that can write any Vietnamese word. Vietnamese only
uses the Han ideographs for classical documents and occasional signage or
decoration, and it seems very unlikely that a Vietnamese speaker would
write their markup using Han ideographs. Japanese has not one but two
phonetic alphabets that can write any Japanese word if the right Han
ideograph character is not encoded. Chinese speakers can use either Latin
characters or the native Bopomofo phonetic system for the very rare cases
where a character they need is not encoded. The fact is most native
speakers of Chinese, Japanese, Korean and Vietnamese do not recognize the
vast majority of these new characters, and the need for them in markup
(again, as opposed to text) is non-existent.”
Perhaps the most controversial change in XML 1.1 is one that specifically
addresses a problem IBM’s mainframe systems have in using the language.
IBM’s mainframes use a special character to designate the end of a line of
text, but XML 1.0 can’t recognize the character, forcing XML 1.0 documents
generated on mainframes to either violate the local line-end conventions or
utilize translation phases before parsing and after generation. To fix this
problem, XML 1.1 adds the character to list of supported line-end
characters (and also adds support for the Unicode line separator
character).
The change has led to allegations in XML discussion circles that IBM is
unduly using its power in the XML Working Group to serve its own ends and
force others to conform to its needs.
“The concern with respect to IBM is that one of the world’s largest
corporations, with thousands of patents, legions of programmers, billions
of dollars in revenue, and resources pouring out of every orifice is
somehow unable to handle documents where lines end with carriage returns
and line feeds, as documents do on every non-IBM system on the planet,” the
Café con Leche editorial said. “The only reason there’s a problem here
at all is because IBM tried to go it alone as a monopoly and set standards
by fiat for years rather than working with the rest of the industry.”
Finally, adds support for the use of character references to several
previously forbidden control characters.
The candidate recommendation will be open for comment through mid-February.