Does XML Suck?
Or: Why XML is Technologically Terrible, but You Have to Use It
Anyway
Aaron Crane
aaron.crane@gbdirect.co.uk
GBdirect Ltd.
http://xmlsucks.org/
14 May 2002
1. What is XML?
- “XML is a giant step in no direction at all.” (Erik
Naggum)
- XML purports to be a simple, vendor-neutral textual external
representation for hierarchically-structured data
- Reasonably accurate — except for the simplicity bit
- Any arbitrary data can be expressed with a hierarchical
representation
- So XML can represent any data whatsoever
- The big win claimed for XML:
- One piece of software can process any XML document, and
therefore any data
- Thus making XML ideal for data interchange
2. What XML Looks Like
3. Parts of an XML Document
<?xml version="1.0"?> is an XML
declaration<email>python@different.org</email> is an
element
<email> is the element’s start tag</email> is the element’s end tag- The text
python@different.org is the element’s
content
date="2002-05-14" in the memo start tag is
an attribute of the memo element- There’s exactly one top-level element in a document — the
root element
4. Well-Formedness and Validity
- Every XML document is well-formed (by definition)
- Means essentially that all the elements are properly nested within
each other, without overlapping
- Some XML documents are also valid
- A valid document is declared to meet a Document Type Definition
(DTD)
- The DTD can make additional constraints on the (logical) structure of
the document:
- What elements are permissible
- What attributes they can have
- What elements and other content a given element can contain
5. XML, SGML, HTML, …
- People with experience of HTML may think XML looks like a souped-up
version
- If only it were that simple
- First came SGML — the Standard Generalized Markup Language
- SGML describes a way of defining markup languages (‘SGML
applications’) that can be processed by an SGML processor
- HTML is an SGML application designed for doing simple hypertext
- XML is like SGML, but with some of the worst excrescences removed
- Every XML document is also an SGML document (kind of)
- XHTML is an XML syntax for HTML
6. Problems with XML
- Verbosity
- XML documents are frustratingly and unnecessarily verbose for human
authors
- Also implies high storage and bandwidth costs
- Complexity
- What little XML offers is more complex than it could be
- Oversimplification
- XML per se is too simplistic to handle what
people actually need
- This leads to a huge number of other technologies to work around
deficiencies in XML
- All of them put together are insanely more complex than a
reasonable solution would have been
7. Verbosity
- XML is hideously verbose for humans — just like SGML
- The earlier memo example could be written as a plain text email:
Date: Tue, 14 May 2002 11:58:41 +0100 (BST)
From: Monty Python <python@different.org>
To: Frank Sinatra <chairman@board.net>
Stop that, it's silly
- Needs a special-purpose processing tool
- Doesn’t encode as much information
- Or you could use an alternative bracketed notation:
<memo <date 2002-05-14>
<from <name Monty Python><email python@different.org>>
<to <name Frank Sinatra><email chairman@board.net>>
<message Stop that, it's silly.>>
8. Complexity: Attributes and Content
9. Complexity: Parsing
- One of the design goals of XML was that XML processing programs be
easy to write
- The working group wanted CS graduates to be able to write an XML
processor in a week
- Quite clear that XML does not meet this goal
- Many items of XML syntax not mentioned here: empty element tags,
DOCTYPE declarations, comments, CDATA
sections, parsed entities, processing instructions, default attribute
values, DTD syntax, parameter entities, external/internal DTD
subsets,…
- Extremely hard to find an XML parser which combines all of:
- Completeness (including validation)
- Correctness
- Efficiency
10. Oversimplification
- The base XML specification explicitly refuses to consider any issues
except data representation
- For example, it has nothing to say about:
- Appropriate ways of processing XML documents
- How to constrain and validate an XML document in ways that can’t be
described with DTDs
- Combining content from multiple XML vocabularies into a single
document
- Creating links from one XML document to another
- Extracting bits of information from an XML document
- How to transform one XML document into another
- Rendering XML data using existing technologies
11. Oversimplification Causes Acronym Proliferation
- Many other people have noticed these limitations in XML
- And apparently they’ve all written at least one add-on
specification for XML…
- Learning the over-complex XML itself isn’t enough for using XML to
conduct your business
- Also have to acquire familiarity with some or all of a bewildering
array of additional technologies
- Some of which overlap in scope or even contradict each other
12. Processing Technologies
- Two semi-standard interfaces for XML processors (parsers) to present
to applications
- DOM (Document Object Model)
- The parser reads the entire document into an in-memory tree
structure
- It provides a wide variety of methods for examining or changing
individual parts of the tree
- SAX (Simple API for XML)
- A streaming model: the parser reads the data bit by bit
- It tells the application whenever it finds a meaningful unit
(start tag, end tag, chunk of text, etc.)
- Also DOM2, SAX2 (Because One Is Never Enough)
13. Improved Validity Checking
- DTDs really suck
- Non-XML syntax, so you can’t manipulate them with XML tools
- Extremely limited in what constraints they can specify
- Incredibly complex rules for how non-validating parsers
should handle DTDs
- Approximately 71,386 alternative validity-checking technologies:
- Schemas (pretty popular)
- RELAX (Regular Language Description for XML; obsolete)
- RELAX-NG (Next Generation)
- TREX (Tree Regular Expressions for XML)
- Schematron (has a really cool name)
- But don’t forget: only DTDs will let you use entity references
14. Combining Multiple Vocabularies
15. Linking Between XML Documents
- XLink:
<a href="...">...</a> on acid - A quick example:
<my:crossReference
xmlns:my="http://xmlsucks.org/xmlns/my"
my:lastEdited="2000-06-10"
xmlns:xlink="http://www.w3.org/1999/xlink"
xlink:type="simple"
xlink:href="students.xml">
Current List of Students
</my:crossReference>
- This is broadly equivalent to the HTML:
<a href="students.xml">Current List of Students</a>
16. Data Extraction
- XPath lets you find specific parts of an XML document, assuming you
know something of its structure
- Used by several other important standards
- Not unreasonable in its more basic usages:
/doc/chapter[5]/section[2]
- XQuery is similar but, unfortunately, not quite the same
- An entire programming language
- The bastard offspring of XPath, SQL, and a host of earlier XML
querying technologies
- But at least it uses the word ‘atomization’ as a technical
term
17. Transformation
- Extremely common to need to transform one XML document into
another
- Exchange invoices, purchase orders, etc. as XML documents
conforming to a standardised vocabulary
- But leave your internal data in some company-specific format
- Two options for transformations:
- Write a program in your language of choice that uses a SAX or DOM
parser and manipulates the document as necessary
- Or if you feel masochistic, use XSLT (XSL Transformations)
- Simple transformations in XSLT are trivial (if obscenely
verbose)
- More complex ones can be absurdly hard
18. Rendering
- CSS (Cascading Style Sheets) can be used directly with XML documents
- Don’t forget to specify a
display: property for each
element in the document
- Alternatively, transform your XML document into something else
- XSLT was designed for translating arbitrary XML documents into
XSL-FO (XSL Formatting Objects) documents
- XSL-FO is a fairly complete XML-based document rendering
language
- XSL-FO processing applications typically handle conversion into
HTML and/or PDF
19. (Some of) The Rest
- A few selected XML standards (but spot the two fakes…)
- CML (Chemical Markup Language), DC (Dublin Core metadata), ebXML (XML
for e-business), MathML, RDF (Resource Description Framework), SMIL
(Synchronized Multimedia Integration Language), SVG (Scalable Vector
Graphics), VoiceXML, ADS, DISCO, DSD (Document Structure Description),
SOAP (Simple Object Access Protocol), UDDI (Universal Description,
Discovery, and Integration), WDDL, WRDL (Web Resource Description
Language), WSCL (Web Service Conversation Language), X2EE, WSDL (Web
Service Description Language), WSIL (Web Service Inspection Language),
XBase, XFL (XML Framework Language), XForms, XInclude, XKMS (XML Key
Management Specification), XML Signature, XML-RPC, XPointer, XQL (XML
Query Language), XML-QL (XML Query Language — again), Quilt, XSL
(XML Stylesheet Language), …
20. Other Thoughts
- “Doing more than skimming the XML specs would require far longer
than I have; and they’ve also now fallen through my good strong 19th
century floor and killed several innocent bystanders in the floors below
before finally coming to rest, smoking, embedded in the bedrock a few
hundred yards under my flat.” (Tim Bradshaw)
- “Structure is nothing if it is all you’ve got. Skeletons spook
people if they try to walk around on their own; I really wonder why XML
does not.” (Erik Naggum)
21. Doing It Anyway
- XML qua technology seems to be a bad joke
- But you have to do it anyway
- Two main reasons:
- The technological problems can be managed
- More importantly: everyone else is doing it
22. Managing the Technological Problems
- The verbosity is merely an annoyance
- XML’s one big advantage is that most of the innate complexity can be
dealt with once and for all
- The hard part is the complexity-through-oversimplification
- Even here, the situation is improving
- It’s becoming more obvious which XML technologies are really
important and which are dead ends
- Programming tools like Perl’s SAX Machines are extremely
powerful
23. A Bigger Boy Made Me Do It
- Huge numbers of businesses already rely on XML
- Not least because it is mandated by governments and by large
corporations at the top of industry-wide supply chains
- All the evidence suggests that XML will become ubiquitous (if not
unavoidable) in the near future
- XML is already being used to accomplish hard tasks — XSL-FO and
SAX Machines come to mind
- Though adopting XML technologies implies significant costs for
businesses, the costs of not using XML may well be bigger
24. Conclusions
- XML does suck technologically, but:
- The technological problems can be managed
- The political and commercial advantages of working with the same
open standard as everyone else are enormous
- XML is here to stay