<?xml version="1.0"?>
<!DOCTYPE talk SYSTEM "module.dtd" [
<!ELEMENT talk (title, sub?, date?, author*, affiliation*, location*, slide+)>
<!ELEMENT sub (title)>
<!ELEMENT author (#PCDATA|code|a)*>
<!ELEMENT date (#PCDATA)>
<!ELEMENT location (#PCDATA|a)*>
<!ELEMENT affiliation (#PCDATA|a|img)*>
]>

<talk>
 <title>doing useful work with XML and open-source software</title>
 <author>Aaron Crane</author>
 <author><code>aaron.crane@gbdirect.co.uk</code></author>
 <affiliation><a href="http://www.gbdirect.co.uk/">GBdirect Ltd.</a></affiliation>
 <affiliation><a>http://xmlsucks.org/</a></affiliation>
 <location>Presented at <a href="http://www.linuxworldexpo.com/">LinuxWorld 2003</a></location>

 <slide>
  <title>overview</title>
  <ul>
   <li>What is XML?</li>
   <li>Problems with XML</li>
   <li>You have to use it anyway</li>
   <li>Tools to cope with XML</li>
  </ul>
 </slide>

 <slide>
  <title>what is XML?</title>
  <ul>
   <li>XML purports to be a simple, vendor-neutral textual external
    representation for hierarchically-structured data
    <ul>
     <li>Reasonably accurate &dash; except for the simplicity bit</li>
    </ul>
   </li>
   <li>Any arbitrary data can be expressed with a hierarchical
    representation
    <ul>
     <li>So XML can represent any data whatsoever</li>
    </ul>
   </li>
   <li>The big win claimed for XML:
    <ul>
     <li>One piece of software can process <em>any</em> XML document, and
      therefore any data</li>
     <li>Thus making XML ideal for data interchange</li>
    </ul>
   </li>
  </ul>
 </slide>

 <slide>
  <title>what XML looks like</title>
  <ul>
   <li>A sample XML document:
    <pre><![CDATA[
     <?xml version="1.0"?>
     <memo date="2003-01-22">
      <from>
       <name>Monty Python</name><email>python@different.org</email>
      </from>
      <to>
       <name>Frank Sinatra</name><email>chairmain@board.net</email>
      </to>
      <message>
       Stop that, it's silly.
      </message>
     </memo>
    ]]></pre>
   </li>
  </ul>
 </slide>

 <slide>
  <title>parts of an XML document</title>
  <ul>
   <li><code><![CDATA[<?xml version="1.0"?>]]></code> is an <dfn>XML
     declaration</dfn></li>
   <li><code><![CDATA[<email>python@different.org</email>]]></code> is an
    <dfn>element</dfn>
    <ul>
     <li><code>&lt;email&gt;</code> is the element's <dfn>start tag</dfn></li>
     <li><code>&lt;/email&gt;</code> is the element's <dfn>end tag</dfn></li>
     <li>The text <code>python@different.org</code> is the element's
      <dfn>content</dfn></li>
    </ul>
   </li>
   <li><code>date="2003-01-22"</code> in the <code>memo</code> start tag is
    an <dfn>attribute</dfn> of the <code>memo</code> element</li>
   <li>There's exactly one top-level element in a document</li>
  </ul>
 </slide>

 <slide>
  <title>well-formedness and validity</title>
  <ul>
   <li>Every XML document is <dfn>well-formed</dfn> (by definition)
    <ul>
     <li>Means essentially that all the elements are properly nested within
      each other, without overlapping</li>
    </ul>
   </li>
   <li>Some XML documents are also <dfn>valid</dfn></li>
   <li>A valid document is declared to meet a Document Type Definition
    (<dfn>DTD</dfn>)</li>
   <li>The DTD can make additional constraints on the (logical) structure of
    the document:
    <ul>
     <li>What elements are permissible</li>
     <li>What attributes they can have</li>
     <li>What elements and other content a given element can contain</li>
    </ul>
   </li>
  </ul>
 </slide>

 <slide>
  <title>problems with XML</title>
  <ol>
   <li>Verbosity
    <ul>
     <li>XML documents are frustratingly and unnecessarily verbose for human
      authors</li>
     <li>Also implies high costs for storage and bandwidth (in the absence
      of compression)</li>
    </ul>
   </li>
   <li>Complexity
    <ul>
     <li>What little XML offers is more complex than it could be</li>
    </ul>
   </li>
   <li>Oversimplification
    <ul>
     <li>XML <cite xml:lang="la">per&nbsp;se</cite> is too simplistic to do
      what people actually need</li>
     <li>This leads to a huge number of other technologies to work around
      deficiencies in XML</li>
     <li>All of them put together are vastly more complicated than a
      reasonable solution would have been</li>
    </ul>
   </li>
  </ol>
 </slide>

 <slide>
  <title>verbosity</title>
  <ul>
   <li>XML is hideously verbose for humans &dash;  like SGML, but worse</li>
   <li>The earlier memo example could be written as a plain text email
    <ul>
     <li>Would be less verbose</li>
     <li>But needs a special-purpose processing tool</li>
     <li>Also doesn't encode as much information</li>
    </ul>
   </li>
   <li>Many alternative bracketed notations have been proposed
    <ul>
     <li>Avoid repeating the element name in the closing bracket</li>
    </ul>
   </li>
  </ul>
 </slide>

 <slide>
  <title>complexity</title>
  <ul>
   <li>Three main areas of unnecessary complexity in XML:
    <ol>
     <li>No guidance on data design &dash; should you use attributes or
      content?</li>
     <li>Ridiculously hard to parse</li>
     <li>No clear view of what XML is meant to accomplish
      <ul>
       <li>Is it for humans or machines?</li>
      </ul>
     </li>
    </ol>
   </li>
  </ul>
 </slide>

 <slide>
  <title>data design</title>
  <ul>
   <li>Designing XML data structures is harder than it ought to be
    <ul>
     <li>Unnatural distinction between element content and attributes</li>
    </ul>
   </li>
   <li>Which of these structures should you choose?
    <pre><![CDATA[
     <invoice id="12345">...</invoice>
     <invoice><id value="12345"></id>...</invoice>
     <invoice><id>12345</id>...</invoice>
     <element name="invoice" id="12345">...</element>
     <element><name>invoice</name><id>12345</id>...</element>
     <element>
       <name>invoice</name>
       <attribute name="id">12345</attribute>
       ...
     </element>
    ]]></pre>
   </li>
  </ul>
 </slide>

 <slide>
  <title>attributes or content?</title>
  <ul>
   <li>Many people try to use content for &lq;data&rq; and attributes for
    &lq;metadata&rq;
    <ul>
     <li>So metadata is never structured?</li>
    </ul>
   </li>
   <li>Others, recognizing that problem, tend to use content universally
    <ul>
     <li>For them, attributes clutter up the standard without offering any
      benefit</li>
    </ul>
   </li>
  </ul>
 </slide>

 <slide>
  <title>parsing</title>
  <ul>
   <li>One of the design goals of XML was that XML processing programs be
    easy to write
    <ul>
     <li>The working group wanted a typical Computer Science graduate to be
      able to write an XML processor in a week</li>
    </ul>
   </li>
   <li>Quite clear that XML does not meet this goal
    <ul>
     <li>Many items of XML syntax not mentioned here</li>
    </ul>
   </li>
   <li>Hard to find an XML parser which combines all of:
    <ul>
     <li>Completeness (including validation)</li>
     <li>Correctness</li>
     <li>Run-time efficiency</li>
    </ul>
   </li>
  </ul>
 </slide>

 <slide>
  <title>humans and machines</title>
  <ul>
   <li>Many of my complaints are about the suitability of XML for
    humans</li>
   <li>Some people have countered that &lqq;XML is for programs, not
    humans&rqq;</li>
   <li>Debatable
    <ul>
     <li>Design goal that &lqq;XML documents should be
      human-legible&rqq;</li>
     <li>Some physical structures that can only be useful for human
      authors: general entities, empty-element tags</li>
    </ul>
   </li>
   <li>Too much syntactic latitude for truly simple programs
    <ul>
     <li>But too little for human authors</li>
    </ul>
   </li>
   <li>Lack of clear goals makes it hard to decide whether and how XML
    should be used in your organization
    <ul>
     <li>XML is optimized neither for humans nor for computers</li>
    </ul>
   </li>
  </ul>
 </slide>

 <slide>
  <title>oversimplification</title>
  <ul>
   <li>The base XML specification explicitly refuses to consider any issues
    except data representation</li>
   <li>For example, it has nothing to say about:
    <ul>
     <li>Appropriate ways of processing XML documents</li>
     <li>How to constrain and validate an XML document in ways that can't be
      described with DTDs</li>
     <li>Combining content from multiple XML vocabularies into a single
      document</li>
     <li>How to transform one XML document into another</li>
     <li>Rendering XML data using existing technologies</li>
    </ul>
   </li>
  </ul>
 </slide>

 <slide>
  <title>oversimplification causes acronym proliferation</title>
  <ul>
   <li>The existence of these limitations is old news
    <ul>
     <li>So others have written add-on specifications to deal with XML's
    deficiencies</li>
    </ul>
   </li>
   <li>Learning the over-complex XML itself isn't enough for using XML to
    conduct your business</li>
   <li>Also have to acquire familiarity with some or all of a bewildering
    array of additional technologies
    <ul>
     <li>Some of which overlap in scope or even contradict each other</li>
    </ul>
   </li>
  </ul>
 </slide>

 <slide>
  <title>processing technologies</title>
  <ul>
   <li>Two semi-standard interfaces for XML processors (parsers) to present
    to applications
    <ul>
     <li>DOM (Document Object Model)
      <ul>
       <li>The parser reads the entire document into an in-memory tree
        structure</li>
       <li>It provides a wide variety of methods for examining or changing
        individual parts of the tree</li>
      </ul>
     </li>
     <li>SAX (Simple API for XML)
      <ul>
       <li>A streaming model: the parser reads the data bit by bit</li>
       <li>It tells the application whenever it finds a meaningful unit
        (start tag, end tag, chunk of text, etc.)</li>
      </ul>
     </li>
    </ul>
   </li>
  </ul>
 </slide>

 <slide>
  <title>improved validity checking</title>
  <ul>
   <li>DTDs are extremely problematic
    <ul>
     <li>Non-XML syntax, so you can't manipulate them with XML tools</li>
     <li>Extremely limited in what constraints they can specify</li>
     <li>Incredibly intricate rules for how <em>non</em>-validating parsers
      should handle DTDs</li>
    </ul>
   </li>
   <li>Many alternative validity-checking technologies:
    <ul>
     <li>W3C Schemas (probably the most widespread)</li>
     <li>RELAX (Regular Language Description for XML; obsolete)</li>
     <li>RELAX-NG (Next Generation)</li>
     <li>TREX (Tree Regular Expressions for XML)</li>
     <li>Schematron</li>
    </ul>
   </li>
   <li>But only DTDs will let you use entity references</li>
  </ul>
 </slide>

 <slide>
  <title>combining multiple vocabularies</title>
  <ul>
   <li>There is a Namespaces specification
    <ul>
     <li>Each element in a document is associated with a namespace</li>
     <li>Each namespace is identified by a unique URL</li>
    </ul>
   </li>
   <li>A simple example:
    <pre><![CDATA[
     <applicant xmlns="http://xmlsucks.org/xmlns/applicant">
      <name>Monty Python</name>
      <position>Cheese Seller</position>
      <cv>
       <html xmlns="http://www.w3.org/1999/xhtml">
        ...
       </html>
      </cv>
     </applicant>
    ]]></pre>
   </li>
   <li>Tricky to combine Namespaces with some validity checking
    technologies</li>
  </ul>
 </slide>

 <slide>
  <title>transformation</title>
  <ul>
   <li>Extremely common to need to transform one XML document into
    another
    <ul>
     <li>Exchange invoices, purchase orders, etc. as XML documents
      conforming to a standardized vocabulary</li>
     <li>But leave your internal data in some company-specific format</li>
    </ul>
   </li>
   <li>Two options for transformations:
    <ol>
     <li>Write a program in your language of choice that uses a SAX or DOM
      parser and manipulates the document as necessary</li>
     <li>Use XSLT: XSL Transformations</li>
    </ol>
   </li>
   <li>Simple transformations in XSLT are trivial (if obscenely
    verbose)</li>
   <li>More interesting ones can be absurdly hard</li>
  </ul>
 </slide>

 <slide>
  <title>rendering</title>
  <ul>
   <li>CSS (Cascading Style Sheets) can be used directly with XML
    documents</li>
   <li>Alternatively, transform your XML document into something else
    <ul>
     <li>XSLT was designed for translating arbitrary XML documents into
      XSL-FO (XSL Formatting Objects) documents</li>
     <li>XSL-FO is a fairly complete XML-based document rendering
      language
      <ul>
       <li>Would that any of the available implementations were equally
        complete&ellipsis;</li>
      </ul>
     </li>
     <li>XSL-FO processing applications typically handle conversion into
      HTML and/or PDF</li>
    </ul>
   </li>
   <li>Translating to non-XML markup languages for typesetting (like LaTeX)
    can be difficult</li>
  </ul>
 </slide>

 <slide>
  <title>doing it anyway</title>
  <ul>
   <li>XML <cite xml:lang="la">qua</cite> technology almost seems to be a
    bad joke</li>
   <li>But you have to do it anyway</li>
   <li>Two main reasons:
    <ol>
     <li>It's (just) good enough technologically</li>
     <li>More importantly: <em>everyone else is doing it</em></li>
    </ol>
   </li>
  </ul>
 </slide>

 <slide>
  <title>worse is better</title>
  <ul>
   <li>XML has many flaws</li>
   <li>But the &aelig;sthetic appeal of the technology is comparatively
    uninteresting</li>
   <li>In practice, XML's flaws can be lived with
    <ul>
     <li>An example of &lq;worse is better&rq;:
      <a>http://www.jwz.org/doc/worse-is-better.html</a></li>
     <li>Since XML is good enough, it isn't worth replacing it</li>
    </ul>
   </li>
  </ul>
 </slide>

 <slide>
  <title>a bigger boy made me do it</title>
  <ul>
   <li>Huge numbers of businesses already rely on XML
    <ul>
     <li>Not least because it is mandated by governments and by large
      corporations at the top of industry-wide supply chains</li>
    </ul>
   </li>
   <li>All the evidence suggests that XML will become ubiquitous (if not
    unavoidable) in the near future</li>
   <li>XML is already being used to accomplish hard tasks &dash; XSL-FO and
    SAX Machines come to mind</li>
   <li>Though adopting XML technologies implies significant costs for
    businesses, the costs of <em>not</em> using XML may well be bigger</li>
  </ul>
 </slide>

 <slide>
  <title>how to cope with XML</title>
  <ul>
   <li>The technological problems can be managed</li>
   <li>The verbosity is merely an annoyance</li>
   <li>XML's one big advantage is that most of the innate complexity can be
    dealt with once and for all</li>
   <li>The hard part is the complexity-through-oversimplification
    <ul>
     <li>Even here, the situation is improving</li>
     <li>It's becoming more obvious which XML technologies are really
      important and which are dead ends</li>
    </ul>
   </li>
   <li>Several specific tools and technologies for getting stuff done</li>
  </ul>
 </slide>

 <slide>
  <title>choosing technologies</title>
  <ul>
   <li>Which (if any) XML add-ons should you use?</li>
   <li>Namespaces are vital for combining data from disparate sources
    <ul>
     <li>Reused over and over in the useful technologies</li>
    </ul>
   </li>
   <li>DTDs are a waste of effort except in certain specific situations
    <ul>
     <li>Good to let human authors use DTD-aware editing tools</li>
     <li>Stick to W3C Schemas for validation</li>
    </ul>
   </li>
   <li>XSLT is fairly flexible and conceptually elegant
    <ul>
     <li>But your staff will need functional-programming experience to do
      interesting transformations</li>
     <li>The awkward syntax makes life harder</li>
    </ul>
   </li>
   <li>What domain-specific XML vocabularies and standards are used in your
    field?</li>
  </ul>
 </slide>

 <slide>
  <title>choosing fundamental libraries</title>
  <ul>
   <li>Avoid most of the Java tools unless you have a burning desire to buy
    lots of very fast hardware</li>
   <li>Use GNOME's libxml as the underlying parser in your programs
    <ul>
     <li>Written in fast, portable C</li>
     <li>Reasonable DOM- and SAX-like interfaces</li>
     <li>Bindings for many popular languages (Python, Perl, etc.)</li>
     <li>Reads legacy HTML documents as XHTML</li>
    </ul>
   </li>
   <li>Similarly, use GNOME's libxslt as an XSLT engine
    <ul>
     <li>Screamingly fast &dash; especially when compared to the well-known
      Java equivalents</li>
     <li>Also written in C, with bindings available for the language of your
      choice</li>
    </ul>
   </li>
  </ul>
 </slide>

 <slide>
  <title>additional tools</title>
  <ul>
   <li>Barrie Slaymaker's SAX-Machines library for Perl
    <ul>
     <li>Strongly reminiscent of the Unix pipeline/filter approach</li>
     <li>Write small, simple processors using the SAX interface</li>
     <li>Connect them together in arbitrary ways using SAX-Machines</li>
     <li>Similar tools include AxKit, a web application framework using
      Apache and mod_perl</li>
    </ul>
   </li>
   <li>XSL-FO looks extremely promising as a way of automating document
    production, but:
    <ul>
     <li>Currently only two open-source implementations: one in Java, one in
      (of all things) TeX</li>
     <li>Both incomplete and somewhat slow</li>
     <li>Probably advisable to find an alternative in the short term</li>
    </ul>
   </li>
  </ul>
 </slide>

 <slide>
  <title>conclusions</title>
  <ul>
   <li>XML does have a number of technological problems, but:
    <ul>
     <li>The technological problems can be managed</li>
     <li>The political and commercial advantages of working with the same
      open standard as everyone else are enormous</li>
    </ul>
   </li>
   <li>There are open-source and Linux-friendly tools which actually help you
    get your work done in an XML-besotted world, including:
    <ul>
     <li>GNOME libxml and libxslt</li>
     <li>SAX-Machines or similar systems</li>
    </ul>
   </li>
   <li>XML is here to stay</li>
  </ul>
 </slide>

</talk>
