<?xml version="1.0"?>
<!DOCTYPE talk SYSTEM "module.dtd" [
<!ELEMENT talk (title, sub?, date?, author*, affiliation*, location*, slide+)>
<!ELEMENT sub (title)>
<!ELEMENT author (#PCDATA|code|a)*>
<!ELEMENT date (#PCDATA)>
<!ELEMENT location (#PCDATA)>
<!ELEMENT affiliation (#PCDATA|a|img)*>
]>

<talk>
 <title>coping with XML</title>
 <date>26 September 2002</date>
 <author>Aaron Crane</author>
 <author><code>aaron.crane@gbdirect.co.uk</code></author>
 <affiliation><a href="http://training.gbdirect.co.uk/courses/xml/">GBdirect Ltd.</a></affiliation>
 <affiliation><a>http://xmlsucks.org/</a></affiliation>

 <slide>
  <title>overview</title>
  <ul>
   <li>What is XML?</li>
   <li>Problems with XML</li>
   <li>You have to use it anyway</li>
   <li>Tools to cope with XML</li>
  </ul>
 </slide>

 <slide>
  <title>what is XML?</title>
  <ul>
   <li>XML purports to be a simple, vendor-neutral textual external
    representation for hierarchically-structured data
    <ul>
     <li>Reasonably accurate &dash; except for the simplicity bit</li>
    </ul>
   </li>
   <li>Any arbitrary data can be expressed with a hierarchical
    representation
    <ul>
     <li>So XML can represent any data whatsoever</li>
    </ul>
   </li>
   <li>The big win claimed for XML:
    <ul>
     <li>One piece of software can process <em>any</em> XML document, and
      therefore any data
      <ul>
       <li>Yeah, right</li>
      </ul>
     </li>
     <li>Thus making XML ideal for data interchange</li>
    </ul>
   </li>
  </ul>
 </slide>

 <slide>
  <title>what XML looks like</title>
  <ul>
   <li>A sample XML document:
    <pre><![CDATA[
     <?xml version="1.0"?>
     <memo date="2002-05-14">
      <from>
       <name>Monty Python</name><email>python@different.org</email>
      </from>
      <to>
       <name>Frank Sinatra</name><email>chairmain@board.net</email>
      </to>
      <message>
       Stop that, it's silly.
      </message>
     </memo>
    ]]></pre>
   </li>
  </ul>
 </slide>

 <slide>
  <title>parts of an XML document</title>
  <ul>
   <li><code><![CDATA[<?xml version="1.0"?>]]></code> is an <dfn>XML
     declaration</dfn></li>
   <li><code><![CDATA[<email>python@different.org</email>]]></code> is an
    <dfn>element</dfn>
    <ul>
     <li><code>&lt;email&gt;</code> is the element's <dfn>start tag</dfn></li>
     <li><code>&lt;/email&gt;</code> is the element's <dfn>end tag</dfn></li>
     <li>The text <code>python@different.org</code> is the element's
      <dfn>content</dfn></li>
    </ul>
   </li>
   <li><code>date="2002-05-14"</code> in the <code>memo</code> start tag is
    an <dfn>attribute</dfn> of the <code>memo</code> element</li>
   <li>There's exactly one top-level element in a document &dash; the
    <dfn>root element</dfn></li>
  </ul>
 </slide>

 <slide>
  <title>well-formedness and validity</title>
  <ul>
   <li>Every XML document is <dfn>well-formed</dfn> (by definition)
    <ul>
     <li>Means essentially that all the elements are properly nested within
      each other, without overlapping</li>
    </ul>
   </li>
   <li>Some XML documents are also <dfn>valid</dfn></li>
   <li>A valid document is declared to meet a Document Type Definition
    (<dfn>DTD</dfn>)</li>
   <li>The DTD can make additional constraints on the (logical) structure of
    the document:
    <ul>
     <li>What elements are permissible</li>
     <li>What attributes they can have</li>
     <li>What elements and other content a given element can contain</li>
    </ul>
   </li>
  </ul>
 </slide>

 <slide>
  <title>XML, SGML, HTML, &ellipsis;</title>
  <ul>
   <li>People with experience of HTML may think XML looks like a souped-up
    version
    <ul>
     <li>If only it were that simple</li>
    </ul>
   </li>
   <li>First came SGML &dash; the Standard Generalized Markup Language
    <ul>
     <li>SGML describes a way of defining markup languages (&lq;SGML
      applications&rq;) that can be processed by an SGML processor</li>
    </ul>
   </li>
   <li>HTML is an SGML application designed for doing simple hypertext</li>
   <li>XML is like SGML, but with some of the worst excrescences removed
    <ul>
     <li>Every XML document is also an SGML document (kind of)</li>
    </ul>
   </li>
   <li>XHTML is an XML syntax for HTML</li>
  </ul>
 </slide>

 <slide>
  <title>problems with XML</title>
  <ol>
   <li>Verbosity
    <ul>
     <li>XML documents are frustratingly and unnecessarily verbose for human
      authors</li>
     <li>Also implies high costs for storage and bandwidth (in the absence
      of compression)</li>
    </ul>
   </li>
   <li>Complexity
    <ul>
     <li>What little XML offers is more complex than it could be</li>
    </ul>
   </li>
   <li>Oversimplification
    <ul>
     <li>XML <cite>per&nbsp;se</cite> is too simplistic to do what people
      actually need</li>
     <li>This leads to a huge number of other technologies to work around
      deficiencies in XML</li>
     <li>All of them put together are insanely more complicated than a
      reasonable solution would have been</li>
    </ul>
   </li>
  </ol>
 </slide>

 <slide>
  <title>verbosity</title>
  <ul>
   <li>XML is hideously verbose for humans &dash;  like SGML, but worse</li>
   <li>The earlier memo example could be written as a plain text email:
    <pre><![CDATA[
     Date: Tue, 14 May 2002 11:58:41 +0100 (BST)
     From: Monty Python <python@different.org>
     To: Frank Sinatra <chairman@board.net>

     Stop that, it's silly
    ]]></pre>
    <ul>
     <li>Needs a special-purpose processing tool</li>
     <li>Doesn't encode as much information</li>
    </ul>
   </li>
   <li>Or you could use an alternative bracketed notation, such as this:
    <pre><![CDATA[
     <memo <date 2002-05-14>
      <from <name Monty Python><email python@different.org>>
      <to <name Frank Sinatra><email chairman@board.net>>
      <message Stop that, it's silly.>>
    ]]></pre>
   </li>
  </ul>
 </slide>

 <slide>
  <title>complexity</title>
  <ul>
   <li>Three main areas of unnecessary complexity in XML:
    <ol>
     <li>No guidance on data design &dash; should you use attributes or
      content?</li>
     <li>Ridiculously hard to parse</li>
     <li>No clear view of what XML is meant to accomplish
      <ul>
       <li>Is it for humans or machines?</li>
      </ul>
     </li>
    </ol>
   </li>
  </ul>
 </slide>

 <slide>
  <title>data design</title>
  <ul>
   <li>Designing XML data structures is harder than it ought to be
    <ul>
     <li>Unnatural distinction between element content and attributes</li>
    </ul>
   </li>
   <li>An easy example:
    <pre><![CDATA[
     <invoice id="12345">...</invoice>
     <invoice><id value="12345"></id>...</invoice>
     <invoice><id>12345</id>...</invoice>
     <element name="invoice" id="12345">...</element>
     <element><name>invoice</name><id>12345</id>...</element>
     <element>
       <name>invoice</name>
       <attribute name="id">12345</attribute>
       ...
     </element>
    ]]></pre>
   </li>
  </ul>
 </slide>

 <slide>
  <title>attributes or content?</title>
  <ul>
   <li>Many people try to use content for &lq;data&rq; and attributes for
    &lq;metadata&rq;
    <ul>
     <li>So metadata is never structured?</li>
    </ul>
   </li>
   <li>Others, recognising that problem, tend to use content universally
    <ul>
     <li>For them, attributes clutter up the standard without offering any
      benefit</li>
    </ul>
   </li>
  </ul>
 </slide>

 <slide>
  <title>parsing</title>
  <ul>
   <li>One of the design goals of XML was that XML processing programs be
    easy to write
    <ul>
     <li>The working group wanted a typical Computer Science graduate to be
      able to write an XML processor in a week</li>
    </ul>
   </li>
   <li>Quite clear that XML does not meet this goal
    <ul>
     <li>Many items of XML syntax not mentioned here: empty element tags,
      attribute value normalisation, comments, <code>CDATA</code> sections,
      parsed entities, processing instructions, default attribute values,
      parameter entities, external/internal DTD
      subsets,&nbsp;&ellipsis;</li>
    </ul>
   </li>
   <li>Hard to find an XML parser which combines all of:
    <ul>
     <li>Completeness (including validation)</li>
     <li>Correctness</li>
     <li>Run-time efficiency</li>
    </ul>
   </li>
  </ul>
 </slide>

 <slide>
  <title>humans and machines</title>
  <ul>
   <li>Many of my complaints are about the suitability of XML for
    humans</li>
   <li>Some people have countered that &lqq;XML is for programs, not
    humans&rqq;</li>
   <li>Debatable
    <ul>
     <li>Design goal that &lqq;XML documents should be
      human-legible&rqq;</li>
     <li>Some physical structures that can only be useful for human
      authors: general entities, empty-element tags</li>
    </ul>
   </li>
   <li>Too much syntactic latitude for truly simple programs
    <ul>
     <li>But too little for human authors</li>
    </ul>
   </li>
   <li>Lack of clear goals makes it hard to decide whether and how XML
    should be used in your organisation
    <ul>
     <li>XML is optimised neither for humans nor for computers</li>
    </ul>
   </li>
  </ul>
 </slide>

 <slide>
  <title>oversimplification</title>
  <ul>
   <li>The base XML specification explicitly refuses to consider any issues
    except data representation</li>
   <li>For example, it has nothing to say about:
    <ul>
     <li>Appropriate ways of processing XML documents</li>
     <li>How to constrain and validate an XML document in ways that can't be
      described with DTDs</li>
     <li>Combining content from multiple XML vocabularies into a single
      document</li>
     <li>Extracting bits of information from an XML document</li>
     <li>How to transform one XML document into another</li>
     <li>Rendering XML data using existing technologies</li>
    </ul>
   </li>
  </ul>
 </slide>

 <slide>
  <title>oversimplification causes acronym proliferation</title>
  <ul>
   <li>Many other people have noticed these limitations in XML
    <ul>
     <li>And apparently they've all written at least one add-on
      specification for XML&ellipsis;</li>
    </ul>
   </li>
   <li>Learning the over-complex XML itself isn't enough for using XML to
    conduct your business</li>
   <li>Also have to acquire familiarity with some or all of a bewildering
    array of additional technologies
    <ul>
     <li>Some of which overlap in scope or even contradict each other</li>
    </ul>
   </li>
  </ul>
 </slide>

 <slide>
  <title>processing technologies</title>
  <ul>
   <li>Two semi-standard interfaces for XML processors (parsers) to present
    to applications
    <ul>
     <li>DOM (Document Object Model)
      <ul>
       <li>The parser reads the entire document into an in-memory tree
        structure</li>
       <li>It provides a wide variety of methods for examining or changing
        individual parts of the tree</li>
      </ul>
     </li>
     <li>SAX (Simple API for XML)
      <ul>
       <li>A streaming model: the parser reads the data bit by bit</li>
       <li>It tells the application whenever it finds a meaningful unit
        (start tag, end tag, chunk of text, etc.)</li>
      </ul>
     </li>
    </ul>
   </li>
   <li>Also DOM2, SAX2 (Because One Is Never Enough)</li>
  </ul>
 </slide>

 <slide>
  <title>improved validity checking</title>
  <ul>
   <li>DTDs <em>really</em> suck
    <ul>
     <li>Non-XML syntax, so you can't manipulate them with XML tools</li>
     <li>Extremely limited in what constraints they can specify</li>
     <li>Incredibly intricate rules for how <em>non</em>-validating parsers
      should handle DTDs</li>
    </ul>
   </li>
   <li>Approximately 71,384 alternative validity-checking technologies:
    <ul>
     <li>W3C Schemas (pretty popular)</li>
     <li>RELAX (Regular Language Description for XML; obsolete)</li>
     <li>RELAX-NG (Next Generation)</li>
     <li>TREX (Tree Regular Expressions for XML)</li>
     <li>Schematron (has a really cool name)</li>
    </ul>
   </li>
   <li>But don't forget: only DTDs will let you use entity references</li>
  </ul>
 </slide>

 <slide>
  <title>combining multiple vocabularies</title>
  <ul>
   <li>There is a Namespaces specification
    <ul>
     <li>Each element in a document is associated with a namespace</li>
     <li>Each namespace is identified by a unique URL</li>
    </ul>
   </li>
   <li>A simple example:
    <pre><![CDATA[
     <applicant xmlns="http://xmlsucks.org/xmlns/applicant">
      <name>Monty Python</name>
      <position>Cheese Seller</position>
      <cv>
       <html xmlns="http://www.w3.org/1999/xhtml">
        ...
       </html>
      </cv>
     </applicant>
    ]]></pre>
   </li>
   <li>Fairly tricky to combine Namespaces with validity checking</li>
  </ul>
 </slide>

 <slide>
  <title>data extraction</title>
  <ul>
   <li>XPath lets you find specific parts of an XML document, assuming you
    know something of its structure
    <ul>
     <li>Used by several other important standards</li>
     <li>Not unreasonable in its more basic usages:
      <code>/doc/chapter[5]/section[2]</code></li>
    </ul>
   </li>
   <li>XQuery is similar but, unfortunately, not quite the same
    <ul>
     <li>An entire programming language</li>
     <li>The bastard offspring of XPath, SQL, and a host of earlier XML
      querying technologies</li>
     <li>But at least it uses the word &lq;atomization&rq; as a technical
      term</li>
    </ul>
   </li>
  </ul>
 </slide>

 <slide>
  <title>transformation</title>
  <ul>
   <li>Extremely common to need to transform one XML document into
    another
    <ul>
     <li>Exchange invoices, purchase orders, etc. as XML documents
      conforming to a standardised vocabulary</li>
     <li>But leave your internal data in some company-specific format</li>
    </ul>
   </li>
   <li>Two options for transformations:
    <ol>
     <li>Write a program in your language of choice that uses a SAX or DOM
      parser and manipulates the document as necessary</li>
     <li>Or if you feel masochistic, use XSLT (XSL Transformations)</li>
    </ol>
   </li>
   <li>Simple transformations in XSLT are trivial (if obscenely
    verbose)</li>
   <li>More interesting ones can be absurdly hard</li>
  </ul>
 </slide>

 <slide>
  <title>rendering</title>
  <ul>
   <li>CSS (Cascading Style Sheets) can be used directly with XML documents
    <ul>
     <li>Don't forget to specify a <code>display:</code> property for each
      element in the document</li>
    </ul>
   </li>
   <li>Alternatively, transform your XML document into something else
    <ul>
     <li>XSLT was designed for translating arbitrary XML documents into
      XSL-FO (XSL Formatting Objects) documents</li>
     <li>XSL-FO is a fairly complete XML-based document rendering
      language
      <ul>
       <li>Would that any of the available implementations were equally
        complete&ellipsis;</li>
      </ul>
     </li>
     <li>XSL-FO processing applications typically handle conversion into
      HTML and/or PDF</li>
    </ul>
   </li>
  </ul>
 </slide>

 <slide>
  <title>(some of) The Rest</title>
  <ul>
   <li>A few selected XML standards (but spot the two fakes&ellipsis;)</li>
   <li>BEEP (also called BXXP), CML (Chemical Markup Language), DC (Dublin
    Core metadata), ebXML (XML for e-business), MathML, RDF (Resource
    Description Framework), SMIL (Synchronized Multimedia Integration
    Language), SVG (Scalable Vector Graphics), VoiceXML, ADS, DISCO, DSD
    (Document Structure Description), SOAP (Simple Object Access Protocol),
    UDDI (Universal Description, Discovery, and Integration), WRDL (Web
    Resource Description Language), WSCL (Web Service Conversation
    Language), X2EE, WSDL (Web Service Description Language), WSIL (Web
    Service Inspection Language), XBase, XFL (XML Framework Language),
    XForms, XInclude, XKMS (XML Key Management Specification), XML
    Signature, XML-RPC, XPointer, XQL (XML Query Language), XML-QL (XML
    Query Language &dash; again), Quilt, XSL (XML Stylesheet
    Language),&nbsp;&ellipsis;</li>
  </ul>
 </slide>

 <slide>
  <title>doing it anyway</title>
  <ul>
   <li>XML <cite>qua</cite> technology seems to be a bad joke</li>
   <li>But you have to do it anyway</li>
   <li>Two main reasons:
    <ol>
     <li>It's (just) good enough technologically</li>
     <li>More importantly: <em>everyone else is doing it</em></li>
    </ol>
   </li>
  </ul>
 </slide>

 <slide>
  <title>worse is better</title>
  <ul>
   <li>XML has many flaws</li>
   <li>But the &aelig;sthetic appeal of the technology is comparatively
    uninteresting</li>
   <li>In practice, XML's flaws can be lived with
    <ul>
     <li>An example of &lq;worse is better&rq;:
      <a>http://www.jwz.org/doc/worse-is-better.html</a></li>
     <li>Since XML is good enough, it isn't worth replacing it</li>
    </ul>
   </li>
  </ul>
 </slide>

 <slide>
  <title>a bigger boy made me do it</title>
  <ul>
   <li>Huge numbers of businesses already rely on XML
    <ul>
     <li>Not least because it is mandated by governments and by large
      corporations at the top of industry-wide supply chains</li>
    </ul>
   </li>
   <li>All the evidence suggests that XML will become ubiquitous (if not
    unavoidable) in the near future</li>
   <li>XML is already being used to accomplish hard tasks &dash; XSL-FO and
    SAX Machines come to mind</li>
   <li>Though adopting XML technologies implies significant costs for
    businesses, the costs of <em>not</em> using XML may well be bigger</li>
  </ul>
 </slide>

 <slide>
  <title>how to cope with XML</title>
  <ul>
   <li>The technological problems can be managed</li>
   <li>The verbosity is merely an annoyance</li>
   <li>XML's one big advantage is that most of the innate complexity can be
    dealt with once and for all</li>
   <li>The hard part is the complexity-through-oversimplification
    <ul>
     <li>Even here, the situation is improving</li>
     <li>It's becoming more obvious which XML technologies are really
      important and which are dead ends</li>
    </ul>
   </li>
   <li>Several specific tools and technologies for getting stuff done</li>
  </ul>
 </slide>

 <slide>
  <title>choosing technologies</title>
  <ul>
   <li>Which (if any) XML add-ons should you use?</li>
   <li>Namespaces are vital for combining data from disparate sources
    <ul>
     <li>Reused over and over in the useful technologies</li>
    </ul>
   </li>
   <li>DTDs are a waste of effort except in certain specific situations
    <ul>
     <li>Good to let human authors use DTD-aware editing tools</li>
     <li>Stick to W3C Schemas for validation</li>
    </ul>
   </li>
   <li>XSLT is fairly flexible and conceptually elegant
    <ul>
     <li>But your staff will need functional-programming experience to do
      interesting transformations</li>
     <li>And the syntax still bites&ellipsis;</li>
    </ul>
   </li>
   <li>What domain-specific XML vocabularies and standards are used in your
    field?</li>
  </ul>
 </slide>

 <slide>
  <title>choosing fundamental libraries</title>
  <ul>
   <li>Avoid most of the Java tools unless you have a burning desire to buy
    lots of very fast hardware</li>
   <li>Use Gnome's libxml as the underlying parser in your programs
    <ul>
     <li>Written in fast, portable C</li>
     <li>Reasonable DOM- and SAX-like interfaces</li>
     <li>Bindings for many popular languages (Python, Perl, etc.)</li>
     <li>Reads legacy HTML documents as XHTML</li>
    </ul>
   </li>
   <li>Similarly, use Gnome's libxslt as an XSLT engine
    <ul>
     <li>Screamingly fast &dash; especially when compared to the well-known
      Java equivalents</li>
     <li>Also written in C, with bindings available for the language of your
      choice</li>
    </ul>
   </li>
  </ul>
 </slide>

 <slide>
  <title>additional tools</title>
  <ul>
   <li>Barrie Slaymaker's SAX-Machines library for Perl
    <ul>
     <li>Reminiscent of the Unix pipeline/filter approach</li>
     <li>Write small, simple processors using the SAX interface</li>
     <li>Connect them together in arbitrary ways using SAX-Machines</li>
     <li>Similar tools include AxKit, a web application framework using
      Apache and mod_perl</li>
    </ul>
   </li>
   <li>XSL-FO looks extremely promising as a way of automating document
    production, but:
    <ul>
     <li>Currently only two open-source implementations: one in Java, one in
      (of all things) TeX</li>
     <li>Both incomplete and somewhat slow</li>
     <li>Probably advisable to find an alternative in the short term</li>
    </ul>
   </li>
  </ul>
 </slide>

 <slide>
  <title>conclusions</title>
  <ul>
   <li>XML does suck technologically, but:
    <ul>
     <li>The technological problems can be managed</li>
     <li>The political and commercial advantages of working with the same
      open standard as everyone else are enormous</li>
    </ul>
   </li>
   <li>There are open-source and Unix-friendly tools which actually help you
    get your work done in an XML-besotted world, including:
    <ul>
     <li>Gnome libxml and libxslt</li>
     <li>SAX-Machines or similar systems</li>
    </ul>
   </li>
   <li>XML is here to stay</li>
  </ul>
 </slide>

</talk>
