<?xml version="1.0"?>
<!DOCTYPE talk SYSTEM "module.dtd" [
<!ELEMENT talk (title, sub?, date?, author*, affiliation*, location*, slide+)>
<!ELEMENT sub (title)>
<!ELEMENT author (#PCDATA|code|a)*>
<!ELEMENT date (#PCDATA)>
<!ELEMENT location (#PCDATA)>
<!ELEMENT affiliation (#PCDATA|a|img)*>
]>

<talk>
 <title>does XML suck?</title>
 <sub><title>or: why XML is technologically terrible, but you have to use it
   anyway</title></sub>
 <date>14 May 2002</date>
 <author>Aaron Crane</author>
 <author><code>aaron.crane@gbdirect.co.uk</code></author>
 <affiliation><a href="http://training.gbdirect.co.uk/courses/xml/">GBdirect Ltd.</a></affiliation>
 <affiliation><a>http://xmlsucks.org/</a></affiliation>

 <slide>
  <title>what is XML?</title>
  <ul>
   <li>&lqq;XML is a giant step in no direction at all.&rqq;  (Erik
    Naggum)</li>
   <li>XML purports to be a simple, vendor-neutral textual external
    representation for hierarchically-structured data
    <ul>
     <li>Reasonably accurate &dash; except for the simplicity bit</li>
    </ul>
   </li>
   <li>Any arbitrary data can be expressed with a hierarchical
    representation
    <ul>
     <li>So XML can represent any data whatsoever</li>
    </ul>
   </li>
   <li>The big win claimed for XML:
    <ul>
     <li>One piece of software can process <em>any</em> XML document, and
      therefore any data</li>
     <li>Thus making XML ideal for data interchange</li>
    </ul>
   </li>
  </ul>
 </slide>

 <slide>
  <title>what XML looks like</title>
  <ul>
   <li>A sample XML document:
    <pre><![CDATA[
     <?xml version="1.0"?>
     <memo date="2002-05-14">
      <from>
       <name>Monty Python</name><email>python@different.org</email>
      </from>
      <to>
       <name>Frank Sinatra</name><email>chairmain@board.net</email>
      </to>
      <message>
       Stop that, it's silly.
      </message>
     </memo>
    ]]></pre>
   </li>
  </ul>
 </slide>

 <slide>
  <title>parts of an XML document</title>
  <ul>
   <li><code><![CDATA[<?xml version="1.0"?>]]></code> is an <dfn>XML
     declaration</dfn></li>
   <li><code><![CDATA[<email>python@different.org</email>]]></code> is an
    <dfn>element</dfn>
    <ul>
     <li><code>&lt;email&gt;</code> is the element's <dfn>start tag</dfn></li>
     <li><code>&lt;/email&gt;</code> is the element's <dfn>end tag</dfn></li>
     <li>The text <code>python@different.org</code> is the element's
      <dfn>content</dfn></li>
    </ul>
   </li>
   <li><code>date="2002-05-14"</code> in the <code>memo</code> start tag is
    an <dfn>attribute</dfn> of the <code>memo</code> element</li>
   <li>There's exactly one top-level element in a document &dash; the
    <dfn>root element</dfn></li>
  </ul>
 </slide>

 <slide>
  <title>well-formedness and validity</title>
  <ul>
   <li>Every XML document is <dfn>well-formed</dfn> (by definition)
    <ul>
     <li>Means essentially that all the elements are properly nested within
      each other, without overlapping</li>
    </ul>
   </li>
   <li>Some XML documents are also <dfn>valid</dfn></li>
   <li>A valid document is declared to meet a Document Type Definition
    (<dfn>DTD</dfn>)</li>
   <li>The DTD can make additional constraints on the (logical) structure of
    the document:
    <ul>
     <li>What elements are permissible</li>
     <li>What attributes they can have</li>
     <li>What elements and other content a given element can contain</li>
    </ul>
   </li>
  </ul>
 </slide>

 <slide>
  <title>XML, SGML, HTML, &ellipsis;</title>
  <ul>
   <li>People with experience of HTML may think XML looks like a souped-up
    version
    <ul>
     <li>If only it were that simple</li>
    </ul>
   </li>
   <li>First came SGML &dash; the Standard Generalized Markup Language
    <ul>
     <li>SGML describes a way of defining markup languages (&lq;SGML
      applications&rq;) that can be processed by an SGML processor</li>
    </ul>
   </li>
   <li>HTML is an SGML application designed for doing simple hypertext</li>
   <li>XML is like SGML, but with some of the worst excrescences removed
    <ul>
     <li>Every XML document is also an SGML document (kind of)</li>
    </ul>
   </li>
   <li>XHTML is an XML syntax for HTML</li>
  </ul>
 </slide>

 <slide>
  <title>problems with XML</title>
  <ol>
   <li>Verbosity
    <ul>
     <li>XML documents are frustratingly and unnecessarily verbose for human
      authors</li>
     <li>Also implies high storage and bandwidth costs</li>
    </ul>
   </li>
   <li>Complexity
    <ul>
     <li>What little XML offers is more complex than it could be</li>
    </ul>
   </li>
   <li>Oversimplification
    <ul>
     <li>XML <cite>per&nbsp;se</cite> is too simplistic to handle what
      people actually need</li>
     <li>This leads to a huge number of other technologies to work around
      deficiencies in XML</li>
     <li>All of them put together are insanely more complex than a
      reasonable solution would have been</li>
    </ul>
   </li>
  </ol>
 </slide>

 <slide>
  <title>verbosity</title>
  <ul>
   <li>XML is hideously verbose for humans &dash; just like SGML</li>
   <li>The earlier memo example could be written as a plain text email:
    <pre><![CDATA[
     Date: Tue, 14 May 2002 11:58:41 +0100 (BST)
     From: Monty Python <python@different.org>
     To: Frank Sinatra <chairman@board.net>

     Stop that, it's silly
    ]]></pre>
    <ul>
     <li>Needs a special-purpose processing tool</li>
     <li>Doesn't encode as much information</li>
    </ul>
   </li>
   <li>Or you could use an alternative bracketed notation:
    <pre><![CDATA[
     <memo <date 2002-05-14>
      <from <name Monty Python><email python@different.org>>
      <to <name Frank Sinatra><email chairman@board.net>>
      <message Stop that, it's silly.>>
    ]]></pre>
   </li>
  </ul>
 </slide>

 <slide>
  <title>complexity: attributes and content</title>
  <ul>
   <li>Designing XML data structures is harder than it ought to be
    <ul>
     <li>Unnatural distinction between element content and attributes</li>
    </ul>
   </li>
   <li>An easy example:
    <pre><![CDATA[
     <invoice id="12345">...</invoice>
     <invoice><id value="12345"></id>...</invoice>
     <invoice><id>12345</id>...</invoice>
     <element name="invoice" id="12345">...</element>
     <element><name>invoice</name><id>12345</id>...</element>
     <element>
       <name>invoice</name>
       <attribute name="id">12345</attribute>
       ...
     </element>
    ]]></pre>
   </li>
   <li>Many people use content for &lq;data&rq; and attributes for
    &lq;metadata&rq;
    <ul>
     <li>So metadata is never structured?</li>
    </ul>
   </li>
  </ul>
 </slide>

 <slide>
  <title>complexity: parsing</title>
  <ul>
   <li>One of the design goals of XML was that XML processing programs be
    easy to write
    <ul>
     <li>The working group wanted CS graduates to be able to write an XML
      processor in a week</li>
    </ul>
   </li>
   <li>Quite clear that XML does not meet this goal
    <ul>
     <li>Many items of XML syntax not mentioned here: empty element tags,
      <code>DOCTYPE</code> declarations, comments, <code>CDATA</code>
      sections, parsed entities, processing instructions, default attribute
      values, DTD syntax, parameter entities, external/internal DTD
      subsets,&ellipsis;</li>
    </ul>
   </li>
   <li>Extremely hard to find an XML parser which combines all of:
    <ul>
     <li>Completeness (including validation)</li>
     <li>Correctness</li>
     <li>Efficiency</li>
    </ul>
   </li>
  </ul>
 </slide>

 <slide>
  <title>oversimplification</title>
  <ul>
   <li>The base XML specification explicitly refuses to consider any issues
    except data representation</li>
   <li>For example, it has nothing to say about:
    <ul>
     <li>Appropriate ways of processing XML documents</li>
     <li>How to constrain and validate an XML document in ways that can't be
      described with DTDs</li>
     <li>Combining content from multiple XML vocabularies into a single
      document</li>
     <li>Creating links from one XML document to another</li>
     <li>Extracting bits of information from an XML document</li>
     <li>How to transform one XML document into another</li>
     <li>Rendering XML data using existing technologies</li>
    </ul>
   </li>
  </ul>
 </slide>

 <slide>
  <title>oversimplification causes acronym proliferation</title>
  <ul>
   <li>Many other people have noticed these limitations in XML
    <ul>
     <li>And apparently they've all written at least one add-on
      specification for XML&ellipsis;</li>
    </ul>
   </li>
   <li>Learning the over-complex XML itself isn't enough for using XML to
    conduct your business</li>
   <li>Also have to acquire familiarity with some or all of a bewildering
    array of additional technologies
    <ul>
     <li>Some of which overlap in scope or even contradict each other</li>
    </ul>
   </li>
  </ul>
 </slide>

 <slide>
  <title>processing technologies</title>
  <ul>
   <li>Two semi-standard interfaces for XML processors (parsers) to present
    to applications
    <ul>
     <li>DOM (Document Object Model)
      <ul>
       <li>The parser reads the entire document into an in-memory tree
        structure</li>
       <li>It provides a wide variety of methods for examining or changing
        individual parts of the tree</li>
      </ul>
     </li>
     <li>SAX (Simple API for XML)
      <ul>
       <li>A streaming model: the parser reads the data bit by bit</li>
       <li>It tells the application whenever it finds a meaningful unit
        (start tag, end tag, chunk of text, etc.)</li>
      </ul>
     </li>
    </ul>
   </li>
   <li>Also DOM2, SAX2 (Because One Is Never Enough)</li>
  </ul>
 </slide>

 <slide>
  <title>improved validity checking</title>
  <ul>
   <li>DTDs <em>really</em> suck
    <ul>
     <li>Non-XML syntax, so you can't manipulate them with XML tools</li>
     <li>Extremely limited in what constraints they can specify</li>
     <li>Incredibly complex rules for how <em>non</em>-validating parsers
      should handle DTDs</li>
    </ul>
   </li>
   <li>Approximately 71,386 alternative validity-checking technologies:
    <ul>
     <li>Schemas (pretty popular)</li>
     <li>RELAX (Regular Language Description for XML; obsolete)</li>
     <li>RELAX-NG (Next Generation)</li>
     <li>TREX (Tree Regular Expressions for XML)</li>
     <li>Schematron (has a really cool name)</li>
    </ul>
   </li>
   <li>But don't forget: only DTDs will let you use entity references</li>
  </ul>
 </slide>

 <slide>
  <title>combining multiple vocabularies</title>
  <ul>
   <li>There is a Namespaces specification
    <ul>
     <li>Each element in a document is associated with a namespace</li>
     <li>Each namespace is identified by a unique URL</li>
    </ul>
   </li>
   <li>A simple example:
    <pre><![CDATA[
     <applicant xmlns="http://xmlsucks.org/xmlns/applicant">
      <name>Monty Python</name>
      <position>Cheese Seller</position>
      <cv>
       <html xmlns="http://www.w3.org/1999/xhtml">
        ...
       </html>
      </cv>
     </applicant>
    ]]></pre>
   </li>
   <li>Fairly tricky to combine Namespaces with validity checking</li>
  </ul>
 </slide>

 <slide>
  <title>linking between XML documents</title>
  <ul>
   <li>XLink: <code><![CDATA[<a href="...">...</a>]]></code> on acid</li>
   <li>A quick example:
    <pre><![CDATA[
     <my:crossReference
           xmlns:my="http://xmlsucks.org/xmlns/my"
           my:lastEdited="2000-06-10"
           xmlns:xlink="http://www.w3.org/1999/xlink"
           xlink:type="simple"
           xlink:href="students.xml">
      Current List of Students
     </my:crossReference>
    ]]></pre>
   </li>
   <li>This is broadly equivalent to the HTML:
    <pre><![CDATA[<a href="students.xml">Current List of Students</a>]]></pre>
   </li>
  </ul>
 </slide>

 <slide>
  <title>data extraction</title>
  <ul>
   <li>XPath lets you find specific parts of an XML document, assuming you
    know something of its structure
    <ul>
     <li>Used by several other important standards</li>
     <li>Not unreasonable in its more basic usages:
      <code>/doc/chapter[5]/section[2]</code></li>
    </ul>
   </li>
   <li>XQuery is similar but, unfortunately, not quite the same
    <ul>
     <li>An entire programming language</li>
     <li>The bastard offspring of XPath, SQL, and a host of earlier XML
      querying technologies</li>
     <li>But at least it uses the word &lq;atomization&rq; as a technical
      term</li>
    </ul>
   </li>
  </ul>
 </slide>

 <slide>
  <title>transformation</title>
  <ul>
   <li>Extremely common to need to transform one XML document into
    another
    <ul>
     <li>Exchange invoices, purchase orders, etc. as XML documents
      conforming to a standardised vocabulary</li>
     <li>But leave your internal data in some company-specific format</li>
    </ul>
   </li>
   <li>Two options for transformations:
    <ol>
     <li>Write a program in your language of choice that uses a SAX or DOM
      parser and manipulates the document as necessary</li>
     <li>Or if you feel masochistic, use XSLT (XSL Transformations)</li>
    </ol>
   </li>
   <li>Simple transformations in XSLT are trivial (if obscenely
    verbose)</li>
   <li>More complex ones can be absurdly hard</li>
  </ul>
 </slide>

 <slide>
  <title>rendering</title>
  <ul>
   <li>CSS (Cascading Style Sheets) can be used directly with XML documents
    <ul>
     <li>Don't forget to specify a <code>display:</code> property for each
      element in the document</li>
    </ul>
   </li>
   <li>Alternatively, transform your XML document into something else
    <ul>
     <li>XSLT was designed for translating arbitrary XML documents into
      XSL-FO (XSL Formatting Objects) documents</li>
     <li>XSL-FO is a fairly complete XML-based document rendering
      language</li>
     <li>XSL-FO processing applications typically handle conversion into
      HTML and/or PDF</li>
    </ul>
   </li>
  </ul>
 </slide>

 <slide>
  <title>(some of) The Rest</title>
  <ul>
   <li>A few selected XML standards (but spot the two fakes&ellipsis;)</li>
   <li>CML (Chemical Markup Language), DC (Dublin Core metadata), ebXML (XML
    for e-business), MathML, RDF (Resource Description Framework), SMIL
    (Synchronized Multimedia Integration Language), SVG (Scalable Vector
    Graphics), VoiceXML, ADS, DISCO, DSD (Document Structure Description),
    SOAP (Simple Object Access Protocol), UDDI (Universal Description,
    Discovery, and Integration), WDDL, WRDL (Web Resource Description
    Language), WSCL (Web Service Conversation Language), X2EE, WSDL (Web
    Service Description Language), WSIL (Web Service Inspection Language),
    XBase, XFL (XML Framework Language), XForms, XInclude, XKMS (XML Key
    Management Specification), XML Signature, XML-RPC, XPointer, XQL (XML
    Query Language), XML-QL (XML Query Language &dash; again), Quilt, XSL
    (XML Stylesheet Language),&nbsp;&ellipsis;</li>
  </ul>
 </slide>

 <slide>
  <title>other thoughts</title>
  <ul>
   <li>&lqq;Doing more than skimming the XML specs would require far longer
    than I have; and they've also now fallen through my good strong 19th
    century floor and killed several innocent bystanders in the floors below
    before finally coming to rest, smoking, embedded in the bedrock a few
    hundred yards under my flat.&rqq;  (Tim Bradshaw)</li>
   <li>&lqq;Structure is nothing if it is all you've got.  Skeletons spook
    people if they try to walk around on their own; I really wonder why XML
    does not.&rqq; (Erik Naggum)</li>
  </ul>
 </slide>

 <slide>
  <title>doing it anyway</title>
  <ul>
   <li>XML <cite>qua</cite> technology seems to be a bad joke</li>
   <li>But you have to do it anyway</li>
   <li>Two main reasons:
    <ol>
     <li>The technological problems can be managed</li>
     <li>More importantly: <em>everyone else is doing it</em></li>
    </ol>
   </li>
  </ul>
 </slide>

 <slide>
  <title>managing the technological problems</title>
  <ul>
   <li>The verbosity is merely an annoyance</li>
   <li>XML's one big advantage is that most of the innate complexity can be
    dealt with once and for all</li>
   <li>The hard part is the complexity-through-oversimplification
    <ul>
     <li>Even here, the situation is improving</li>
     <li>It's becoming more obvious which XML technologies are really
      important and which are dead ends</li>
     <li>Programming tools like Perl's SAX Machines are extremely
      powerful</li>
    </ul>
   </li>
  </ul>
 </slide>

 <slide>
  <title>a bigger boy made me do it</title>
  <ul>
   <li>Huge numbers of businesses already rely on XML
    <ul>
     <li>Not least because it is mandated by governments and by large
      corporations at the top of industry-wide supply chains</li>
    </ul>
   </li>
   <li>All the evidence suggests that XML will become ubiquitous (if not
    unavoidable) in the near future</li>
   <li>XML is already being used to accomplish hard tasks &dash; XSL-FO and
    SAX Machines come to mind</li>
   <li>Though adopting XML technologies implies significant costs for
    businesses, the costs of <em>not</em> using XML may well be bigger</li>
  </ul>
 </slide>

 <slide>
  <title>conclusions</title>
  <ul>
   <li>XML does suck technologically, but:
    <ul>
     <li>The technological problems can be managed</li>
     <li>The political and commercial advantages of working with the same
      open standard as everyone else are enormous</li>
    </ul>
   </li>
   <li>XML is here to stay</li>
  </ul>
 </slide>

</talk>
