Doing Useful Work with XML and Open-Source Software
Aaron Crane
aaron.crane@gbdirect.co.uk
GBdirect Ltd.
http://xmlsucks.org/
Presented at LinuxWorld 2003
1. Overview
- What is XML?
- Problems with XML
- You have to use it anyway
- Tools to cope with XML
2. What is XML?
- XML purports to be a simple, vendor-neutral textual external
representation for hierarchically-structured data
- Reasonably accurate — except for the simplicity bit
- Any arbitrary data can be expressed with a hierarchical
representation
- So XML can represent any data whatsoever
- The big win claimed for XML:
- One piece of software can process any XML document, and
therefore any data
- Thus making XML ideal for data interchange
3. What XML Looks Like
4. Parts of an XML Document
<?xml version="1.0"?> is an XML
declaration<email>python@different.org</email> is an
element
<email> is the element’s start tag</email> is the element’s end tag- The text
python@different.org is the element’s
content
date="2003-01-22" in the memo start tag is
an attribute of the memo element- There’s exactly one top-level element in a document
5. Well-Formedness and Validity
- Every XML document is well-formed (by definition)
- Means essentially that all the elements are properly nested within
each other, without overlapping
- Some XML documents are also valid
- A valid document is declared to meet a Document Type Definition
(DTD)
- The DTD can make additional constraints on the (logical) structure of
the document:
- What elements are permissible
- What attributes they can have
- What elements and other content a given element can contain
6. Problems with XML
- Verbosity
- XML documents are frustratingly and unnecessarily verbose for human
authors
- Also implies high costs for storage and bandwidth (in the absence
of compression)
- Complexity
- What little XML offers is more complex than it could be
- Oversimplification
- XML per se is too simplistic to do
what people actually need
- This leads to a huge number of other technologies to work around
deficiencies in XML
- All of them put together are vastly more complicated than a
reasonable solution would have been
7. Verbosity
- XML is hideously verbose for humans — like SGML, but worse
- The earlier memo example could be written as a plain text email
- Would be less verbose
- But needs a special-purpose processing tool
- Also doesn’t encode as much information
- Many alternative bracketed notations have been proposed
- Avoid repeating the element name in the closing bracket
8. Complexity
- Three main areas of unnecessary complexity in XML:
- No guidance on data design — should you use attributes or
content?
- Ridiculously hard to parse
- No clear view of what XML is meant to accomplish
- Is it for humans or machines?
9. Data Design
10. Attributes or Content?
- Many people try to use content for ‘data’ and attributes for
‘metadata’
- So metadata is never structured?
- Others, recognizing that problem, tend to use content universally
- For them, attributes clutter up the standard without offering any
benefit
11. Parsing
- One of the design goals of XML was that XML processing programs be
easy to write
- The working group wanted a typical Computer Science graduate to be
able to write an XML processor in a week
- Quite clear that XML does not meet this goal
- Many items of XML syntax not mentioned here
- Hard to find an XML parser which combines all of:
- Completeness (including validation)
- Correctness
- Run-time efficiency
12. Humans and Machines
- Many of my complaints are about the suitability of XML for
humans
- Some people have countered that “XML is for programs, not
humans”
- Debatable
- Design goal that “XML documents should be
human-legible”
- Some physical structures that can only be useful for human
authors: general entities, empty-element tags
- Too much syntactic latitude for truly simple programs
- But too little for human authors
- Lack of clear goals makes it hard to decide whether and how XML
should be used in your organization
- XML is optimized neither for humans nor for computers
13. Oversimplification
- The base XML specification explicitly refuses to consider any issues
except data representation
- For example, it has nothing to say about:
- Appropriate ways of processing XML documents
- How to constrain and validate an XML document in ways that can’t be
described with DTDs
- Combining content from multiple XML vocabularies into a single
document
- How to transform one XML document into another
- Rendering XML data using existing technologies
14. Oversimplification Causes Acronym Proliferation
- The existence of these limitations is old news
- So others have written add-on specifications to deal with XML’s
deficiencies
- Learning the over-complex XML itself isn’t enough for using XML to
conduct your business
- Also have to acquire familiarity with some or all of a bewildering
array of additional technologies
- Some of which overlap in scope or even contradict each other
15. Processing Technologies
- Two semi-standard interfaces for XML processors (parsers) to present
to applications
- DOM (Document Object Model)
- The parser reads the entire document into an in-memory tree
structure
- It provides a wide variety of methods for examining or changing
individual parts of the tree
- SAX (Simple API for XML)
- A streaming model: the parser reads the data bit by bit
- It tells the application whenever it finds a meaningful unit
(start tag, end tag, chunk of text, etc.)
16. Improved Validity Checking
- DTDs are extremely problematic
- Non-XML syntax, so you can’t manipulate them with XML tools
- Extremely limited in what constraints they can specify
- Incredibly intricate rules for how non-validating parsers
should handle DTDs
- Many alternative validity-checking technologies:
- W3C Schemas (probably the most widespread)
- RELAX (Regular Language Description for XML; obsolete)
- RELAX-NG (Next Generation)
- TREX (Tree Regular Expressions for XML)
- Schematron
- But only DTDs will let you use entity references
17. Combining Multiple Vocabularies
18. Transformation
- Extremely common to need to transform one XML document into
another
- Exchange invoices, purchase orders, etc. as XML documents
conforming to a standardized vocabulary
- But leave your internal data in some company-specific format
- Two options for transformations:
- Write a program in your language of choice that uses a SAX or DOM
parser and manipulates the document as necessary
- Use XSLT: XSL Transformations
- Simple transformations in XSLT are trivial (if obscenely
verbose)
- More interesting ones can be absurdly hard
19. Rendering
- CSS (Cascading Style Sheets) can be used directly with XML
documents
- Alternatively, transform your XML document into something else
- XSLT was designed for translating arbitrary XML documents into
XSL-FO (XSL Formatting Objects) documents
- XSL-FO is a fairly complete XML-based document rendering
language
- Would that any of the available implementations were equally
complete…
- XSL-FO processing applications typically handle conversion into
HTML and/or PDF
- Translating to non-XML markup languages for typesetting (like LaTeX)
can be difficult
20. Doing It Anyway
- XML qua technology almost seems to be a
bad joke
- But you have to do it anyway
- Two main reasons:
- It’s (just) good enough technologically
- More importantly: everyone else is doing it
21. Worse is Better
- XML has many flaws
- But the ęsthetic appeal of the technology is comparatively
uninteresting
- In practice, XML’s flaws can be lived with
22. A Bigger Boy Made Me Do It
- Huge numbers of businesses already rely on XML
- Not least because it is mandated by governments and by large
corporations at the top of industry-wide supply chains
- All the evidence suggests that XML will become ubiquitous (if not
unavoidable) in the near future
- XML is already being used to accomplish hard tasks — XSL-FO and
SAX Machines come to mind
- Though adopting XML technologies implies significant costs for
businesses, the costs of not using XML may well be bigger
23. How to Cope with XML
- The technological problems can be managed
- The verbosity is merely an annoyance
- XML’s one big advantage is that most of the innate complexity can be
dealt with once and for all
- The hard part is the complexity-through-oversimplification
- Even here, the situation is improving
- It’s becoming more obvious which XML technologies are really
important and which are dead ends
- Several specific tools and technologies for getting stuff done
24. Choosing Technologies
- Which (if any) XML add-ons should you use?
- Namespaces are vital for combining data from disparate sources
- Reused over and over in the useful technologies
- DTDs are a waste of effort except in certain specific situations
- Good to let human authors use DTD-aware editing tools
- Stick to W3C Schemas for validation
- XSLT is fairly flexible and conceptually elegant
- But your staff will need functional-programming experience to do
interesting transformations
- The awkward syntax makes life harder
- What domain-specific XML vocabularies and standards are used in your
field?
25. Choosing Fundamental Libraries
- Avoid most of the Java tools unless you have a burning desire to buy
lots of very fast hardware
- Use GNOME’s libxml as the underlying parser in your programs
- Written in fast, portable C
- Reasonable DOM- and SAX-like interfaces
- Bindings for many popular languages (Python, Perl, etc.)
- Reads legacy HTML documents as XHTML
- Similarly, use GNOME’s libxslt as an XSLT engine
- Screamingly fast — especially when compared to the well-known
Java equivalents
- Also written in C, with bindings available for the language of your
choice
26. Additional Tools
- Barrie Slaymaker’s SAX-Machines library for Perl
- Strongly reminiscent of the Unix pipeline/filter approach
- Write small, simple processors using the SAX interface
- Connect them together in arbitrary ways using SAX-Machines
- Similar tools include AxKit, a web application framework using
Apache and mod_perl
- XSL-FO looks extremely promising as a way of automating document
production, but:
- Currently only two open-source implementations: one in Java, one in
(of all things) TeX
- Both incomplete and somewhat slow
- Probably advisable to find an alternative in the short term
27. Conclusions
- XML does have a number of technological problems, but:
- The technological problems can be managed
- The political and commercial advantages of working with the same
open standard as everyone else are enormous
- There are open-source and Linux-friendly tools which actually help you
get your work done in an XML-besotted world, including:
- GNOME libxml and libxslt
- SAX-Machines or similar systems
- XML is here to stay