Coping with XML
Aaron Crane
aaron.crane@gbdirect.co.uk
GBdirect Ltd.
http://xmlsucks.org/
26 September 2002
1. Overview
- What is XML?
- Problems with XML
- You have to use it anyway
- Tools to cope with XML
2. What is XML?
- XML purports to be a simple, vendor-neutral textual external
representation for hierarchically-structured data
- Reasonably accurate — except for the simplicity bit
- Any arbitrary data can be expressed with a hierarchical
representation
- So XML can represent any data whatsoever
- The big win claimed for XML:
- One piece of software can process any XML document, and
therefore any data
- Thus making XML ideal for data interchange
3. What XML Looks Like
4. Parts of an XML Document
<?xml version="1.0"?> is an XML
declaration<email>python@different.org</email> is an
element
<email> is the element’s start tag</email> is the element’s end tag- The text
python@different.org is the element’s
content
date="2002-05-14" in the memo start tag is
an attribute of the memo element- There’s exactly one top-level element in a document — the
root element
5. Well-Formedness and Validity
- Every XML document is well-formed (by definition)
- Means essentially that all the elements are properly nested within
each other, without overlapping
- Some XML documents are also valid
- A valid document is declared to meet a Document Type Definition
(DTD)
- The DTD can make additional constraints on the (logical) structure of
the document:
- What elements are permissible
- What attributes they can have
- What elements and other content a given element can contain
6. XML, SGML, HTML, …
- People with experience of HTML may think XML looks like a souped-up
version
- If only it were that simple
- First came SGML — the Standard Generalized Markup Language
- SGML describes a way of defining markup languages (‘SGML
applications’) that can be processed by an SGML processor
- HTML is an SGML application designed for doing simple hypertext
- XML is like SGML, but with some of the worst excrescences removed
- Every XML document is also an SGML document (kind of)
- XHTML is an XML syntax for HTML
7. Problems with XML
- Verbosity
- XML documents are frustratingly and unnecessarily verbose for human
authors
- Also implies high costs for storage and bandwidth (in the absence
of compression)
- Complexity
- What little XML offers is more complex than it could be
- Oversimplification
- XML per se is too simplistic to do what people
actually need
- This leads to a huge number of other technologies to work around
deficiencies in XML
- All of them put together are insanely more complicated than a
reasonable solution would have been
8. Verbosity
- XML is hideously verbose for humans — like SGML, but worse
- The earlier memo example could be written as a plain text email:
Date: Tue, 14 May 2002 11:58:41 +0100 (BST)
From: Monty Python <python@different.org>
To: Frank Sinatra <chairman@board.net>
Stop that, it's silly
- Needs a special-purpose processing tool
- Doesn’t encode as much information
- Or you could use an alternative bracketed notation, such as this:
<memo <date 2002-05-14>
<from <name Monty Python><email python@different.org>>
<to <name Frank Sinatra><email chairman@board.net>>
<message Stop that, it's silly.>>
9. Complexity
- Three main areas of unnecessary complexity in XML:
- No guidance on data design — should you use attributes or
content?
- Ridiculously hard to parse
- No clear view of what XML is meant to accomplish
- Is it for humans or machines?
10. Data Design
11. Attributes or Content?
- Many people try to use content for ‘data’ and attributes for
‘metadata’
- So metadata is never structured?
- Others, recognising that problem, tend to use content universally
- For them, attributes clutter up the standard without offering any
benefit
12. Parsing
- One of the design goals of XML was that XML processing programs be
easy to write
- The working group wanted a typical Computer Science graduate to be
able to write an XML processor in a week
- Quite clear that XML does not meet this goal
- Many items of XML syntax not mentioned here: empty element tags,
attribute value normalisation, comments,
CDATA sections,
parsed entities, processing instructions, default attribute values,
parameter entities, external/internal DTD
subsets, …
- Hard to find an XML parser which combines all of:
- Completeness (including validation)
- Correctness
- Run-time efficiency
13. Humans and Machines
- Many of my complaints are about the suitability of XML for
humans
- Some people have countered that “XML is for programs, not
humans”
- Debatable
- Design goal that “XML documents should be
human-legible”
- Some physical structures that can only be useful for human
authors: general entities, empty-element tags
- Too much syntactic latitude for truly simple programs
- But too little for human authors
- Lack of clear goals makes it hard to decide whether and how XML
should be used in your organisation
- XML is optimised neither for humans nor for computers
14. Oversimplification
- The base XML specification explicitly refuses to consider any issues
except data representation
- For example, it has nothing to say about:
- Appropriate ways of processing XML documents
- How to constrain and validate an XML document in ways that can’t be
described with DTDs
- Combining content from multiple XML vocabularies into a single
document
- Extracting bits of information from an XML document
- How to transform one XML document into another
- Rendering XML data using existing technologies
15. Oversimplification Causes Acronym Proliferation
- Many other people have noticed these limitations in XML
- And apparently they’ve all written at least one add-on
specification for XML…
- Learning the over-complex XML itself isn’t enough for using XML to
conduct your business
- Also have to acquire familiarity with some or all of a bewildering
array of additional technologies
- Some of which overlap in scope or even contradict each other
16. Processing Technologies
- Two semi-standard interfaces for XML processors (parsers) to present
to applications
- DOM (Document Object Model)
- The parser reads the entire document into an in-memory tree
structure
- It provides a wide variety of methods for examining or changing
individual parts of the tree
- SAX (Simple API for XML)
- A streaming model: the parser reads the data bit by bit
- It tells the application whenever it finds a meaningful unit
(start tag, end tag, chunk of text, etc.)
- Also DOM2, SAX2 (Because One Is Never Enough)
17. Improved Validity Checking
- DTDs really suck
- Non-XML syntax, so you can’t manipulate them with XML tools
- Extremely limited in what constraints they can specify
- Incredibly intricate rules for how non-validating parsers
should handle DTDs
- Approximately 71,384 alternative validity-checking technologies:
- W3C Schemas (pretty popular)
- RELAX (Regular Language Description for XML; obsolete)
- RELAX-NG (Next Generation)
- TREX (Tree Regular Expressions for XML)
- Schematron (has a really cool name)
- But don’t forget: only DTDs will let you use entity references
18. Combining Multiple Vocabularies
19. Data Extraction
- XPath lets you find specific parts of an XML document, assuming you
know something of its structure
- Used by several other important standards
- Not unreasonable in its more basic usages:
/doc/chapter[5]/section[2]
- XQuery is similar but, unfortunately, not quite the same
- An entire programming language
- The bastard offspring of XPath, SQL, and a host of earlier XML
querying technologies
- But at least it uses the word ‘atomization’ as a technical
term
20. Transformation
- Extremely common to need to transform one XML document into
another
- Exchange invoices, purchase orders, etc. as XML documents
conforming to a standardised vocabulary
- But leave your internal data in some company-specific format
- Two options for transformations:
- Write a program in your language of choice that uses a SAX or DOM
parser and manipulates the document as necessary
- Or if you feel masochistic, use XSLT (XSL Transformations)
- Simple transformations in XSLT are trivial (if obscenely
verbose)
- More interesting ones can be absurdly hard
21. Rendering
- CSS (Cascading Style Sheets) can be used directly with XML documents
- Don’t forget to specify a
display: property for each
element in the document
- Alternatively, transform your XML document into something else
- XSLT was designed for translating arbitrary XML documents into
XSL-FO (XSL Formatting Objects) documents
- XSL-FO is a fairly complete XML-based document rendering
language
- Would that any of the available implementations were equally
complete…
- XSL-FO processing applications typically handle conversion into
HTML and/or PDF
22. (Some of) The Rest
- A few selected XML standards (but spot the two fakes…)
- BEEP (also called BXXP), CML (Chemical Markup Language), DC (Dublin
Core metadata), ebXML (XML for e-business), MathML, RDF (Resource
Description Framework), SMIL (Synchronized Multimedia Integration
Language), SVG (Scalable Vector Graphics), VoiceXML, ADS, DISCO, DSD
(Document Structure Description), SOAP (Simple Object Access Protocol),
UDDI (Universal Description, Discovery, and Integration), WRDL (Web
Resource Description Language), WSCL (Web Service Conversation
Language), X2EE, WSDL (Web Service Description Language), WSIL (Web
Service Inspection Language), XBase, XFL (XML Framework Language),
XForms, XInclude, XKMS (XML Key Management Specification), XML
Signature, XML-RPC, XPointer, XQL (XML Query Language), XML-QL (XML
Query Language — again), Quilt, XSL (XML Stylesheet
Language), …
23. Doing It Anyway
- XML qua technology seems to be a bad joke
- But you have to do it anyway
- Two main reasons:
- It’s (just) good enough technologically
- More importantly: everyone else is doing it
24. Worse is Better
- XML has many flaws
- But the ęsthetic appeal of the technology is comparatively
uninteresting
- In practice, XML’s flaws can be lived with
25. A Bigger Boy Made Me Do It
- Huge numbers of businesses already rely on XML
- Not least because it is mandated by governments and by large
corporations at the top of industry-wide supply chains
- All the evidence suggests that XML will become ubiquitous (if not
unavoidable) in the near future
- XML is already being used to accomplish hard tasks — XSL-FO and
SAX Machines come to mind
- Though adopting XML technologies implies significant costs for
businesses, the costs of not using XML may well be bigger
26. How to Cope with XML
- The technological problems can be managed
- The verbosity is merely an annoyance
- XML’s one big advantage is that most of the innate complexity can be
dealt with once and for all
- The hard part is the complexity-through-oversimplification
- Even here, the situation is improving
- It’s becoming more obvious which XML technologies are really
important and which are dead ends
- Several specific tools and technologies for getting stuff done
27. Choosing Technologies
- Which (if any) XML add-ons should you use?
- Namespaces are vital for combining data from disparate sources
- Reused over and over in the useful technologies
- DTDs are a waste of effort except in certain specific situations
- Good to let human authors use DTD-aware editing tools
- Stick to W3C Schemas for validation
- XSLT is fairly flexible and conceptually elegant
- But your staff will need functional-programming experience to do
interesting transformations
- And the syntax still bites…
- What domain-specific XML vocabularies and standards are used in your
field?
28. Choosing Fundamental Libraries
- Avoid most of the Java tools unless you have a burning desire to buy
lots of very fast hardware
- Use Gnome’s libxml as the underlying parser in your programs
- Written in fast, portable C
- Reasonable DOM- and SAX-like interfaces
- Bindings for many popular languages (Python, Perl, etc.)
- Reads legacy HTML documents as XHTML
- Similarly, use Gnome’s libxslt as an XSLT engine
- Screamingly fast — especially when compared to the well-known
Java equivalents
- Also written in C, with bindings available for the language of your
choice
29. Additional Tools
- Barrie Slaymaker’s SAX-Machines library for Perl
- Reminiscent of the Unix pipeline/filter approach
- Write small, simple processors using the SAX interface
- Connect them together in arbitrary ways using SAX-Machines
- Similar tools include AxKit, a web application framework using
Apache and mod_perl
- XSL-FO looks extremely promising as a way of automating document
production, but:
- Currently only two open-source implementations: one in Java, one in
(of all things) TeX
- Both incomplete and somewhat slow
- Probably advisable to find an alternative in the short term
30. Conclusions
- XML does suck technologically, but:
- The technological problems can be managed
- The political and commercial advantages of working with the same
open standard as everyone else are enormous
- There are open-source and Unix-friendly tools which actually help you
get your work done in an XML-besotted world, including:
- Gnome libxml and libxslt
- SAX-Machines or similar systems
- XML is here to stay