XML Sucks by Doug Hoyte


UPDATE: Added a new read macro gem: #>

There used to be a long, poorly formulated rant about XML that I wrote and forgot about until I recently received some emails. I have decided to remove the rant for the following reasons:

  1. The XML stack moves so quickly that many criticisms had become dated and new criticisms or improvements had emerged that weren't addressed.
  2. The rant was partially parody but apparently not everybody appreciates the humour. Everyone knows apples are better than oranges!
  3. Its tone was more negative than I intended, detracting from its actually very positive message: XML's complexity is not fundamental to the problems it addresses, instead it is accidental in the solution and often easy to avoid.

The summary is that in my opinion XML is used for many tasks that it is very poorly suited. Maybe we don't need heavy-weight standards for many of these protocols and can instead use an extensible format to achieve portability and flexibility. I think S-expressions are the best extensible material for serial data formats. (Oh, and XML, don't worry, you're still in the hall!)

These quotes from Glenn are so amusing and informative I couldn't bear to remove them:

Glenn Reid

Glenn Reid, the inventor of encapsulated postscript and a renowned desktop publishing expert-turned XML heretic, distills it down to these memorable quotes on his site xmlsucks.com:

S-expressions

An S-expression is a light-weight, highly-extensible, non-redundant serial data format that has been more-or-less standardised (though note that they are standard in a different sense than XML is standard) and stable for decades.

The best discussion of one of the most feature-rich S-expression formats I know of is in Guy Steele's Common Lisp the Language, 2nd Edition (CLtL2). S-expressions and the Common Lisp reader/printer are completely specified in chapter 22: Input/Output. While this might be a daunting 100 pages to somebody not familiar with lisp, to those so acquainted it stands out as an island of simplicity and elegance in an ocean of XML turmoil.

This quote from chapter 22, page 509 of CLtL2 mostly covers what S-expressions are:

A Lightning Tour of S-Expressions (from CLtL2) Lisp objects in general are not text strings but complex data structures. They have very different properties from text strings as a consequence of their internal representation. However, to make it possible to get at and talk about Lisp objects, Lisp provides a representation of most objects in the form of printed text; this is called [an S-expression], which is used for input/output purposes and in the examples throughout this book. Functions such as print take a Lisp object and send the characters of its [S-expression] to a stream. The collection of routines that does this is known as the (Lisp) printer. The read function takes characters from a stream, interprets them as [an S-expression] of a Lisp object, builds that object, and returns it; the collection of routines that does this is called the (Lisp) reader.

You have a huge amount of control over the printer and the reader. You can change the numeric base that numbers are read or printed by changing the *read-base* and *print-base* variables. For cases where you don't trust the source of an S-expression you can read it in a "secure mode" by setting *read-eval* to nil. *print-circle* controls how cyclic/shared data structures are serialised. You can customise all sorts of levels of quotation through pretty printing.

Controlling the printer/reader is very powerful but doing so still doesn't really extend S-expressions. It's all right there in CLtL2. Unlike XML, however, S-expressions really can be extended. With S-expressions you can change the meaning of every character, define alternative data representations, and other fun things we will look at shortly. Unlike XML, which has the meaning of most characters permanently decided, lisp lets you extend any character through a construct called a read table. To extend the behaviour of the reader/printer you add functions called read macros to the read table.

And lisp coders stand by their data format and code these read macros in, like everything else, S-expressions.

Extending Your S-expressions

This page is not an introduction to S-expressions or read macros. For that, if you are an experienced lisp programmer I recomend either the CLtL2 link above or Paul Graham's On Lisp. If you have little-to-no lisp experience, this article is a good gentle introduction and a great read.

I plan on making this section a resource of S-expression and read macro gems that show ways you can extend S-expressions into being the kind of data representation format you need. The extensions are slightly polished bits of CL code I have created for my own use. They should all conform to ANSI/CLtL2 Common Lisp. The main focus for this article is on techniques that are very difficult or impossible in XML.

WARNING: If you aren't one of those aforementioned "experienced lisp programmers" this is going to get really hard really quickly. :)

Printing Circles

A useful provided read macro is #=. #= lets you create self-referential S-expressions. This allows you to do things like represent directed graphs and other interesting data structures with little or no effort.

But most importantly it allows you to serialise data without having to disassemble and reassemble an efficient in-memory data structure where large portions of the data are shared. Here is an example where the 2 lisp lists read in are distinct objects (not eq):

* (defvar not-shared '((t) (t)))

NOT-SHARED
* not-shared

((T) (T))
* (eq (first not-shared) (second not-shared))

NIL

But in the following example, with serialised data using the #= read macro, the 2 lists really are the same list:

* (defvar shared '(#1=(t) #1#))

SHARED
* shared

((T) (T))
* (eq (first shared) (second shared))

T

As another fun example, here's how you can print an infinite list by pointing the cdr of a cons to itself:

* (print '#1=(hello . #1#))

(HELLO HELLO HELLO HELLO HELLO HELLO HELLO HELLO HELLO HELLO HELLO HELLO HELLO
 HELLO HELLO HELLO HELLO HELLO HELLO HELLO HELLO HELLO HELLO HELLO HELLO HELLO
 HELLO HELLO HELLO HELLO HELLO HELLO HELLO HELLO HELLO HELLO HELLO HELLO HELLO
 HELLO HELLO HELLO HELLO HELLO HELLO HELLO HELLO HELLO HELLO HELLO HELLO HELLO
 ...

So unless you want that to happen, be sure you set *print-circle* when serialising data structures with cycles in them as in the following example:

* (let ((*print-circle* t))
    (print '#1=(hello . #1#))
    nil)

#1=(HELLO . #1#)
NIL

What's really interesting about the above example (and some others to come on this page) is the use of CL dynamic scope to "shadow" a global variable just for this one function invocation. This is a remarkably powerful feature of lisp: multiple different environments for reading/printing S-expressions can co-exist (at different times, but transparently) - each with its own unique "shadowed" values for global variables. Even read tables can be shadowed like this since the current read table is also stored in a dynamic scope variable (*readtable*).

Notice that you can't do this in Blub++.NET Enterprise Edition (Interesting thoughts about Scheme):

int global_var = 0;

function do_stuff_that_uses_global_var() {
  ...
}

function whatever() {
  int global_var = 1;
  do_stuff_that_uses_global_var(); // This function will still see global_var as 0!
}

Oh, and don't evaluate this. :)

(progn
  (defun ouch () #1=(progn #1#))
  (compile 'ouch))
A Random S-Expression

Another handy read macro that comes built in with Common Lisp is the #. read-time eval macro. This lets you embed objects into S-expressions that can't be serialised but can be created with a bit of lisp code. One fun example is making S-expressions that become different values each time they're read:

* '(football-game
     (game-started-at #.(get-internal-real-time))
     (coin-flip #.(if (zerop (random 2)) 'heads 'tails)))

(FOOTBALL-GAME (GAME-STARTED-AT 187) (COIN-FLIP HEADS))
* '(football-game
     (game-started-at #.(get-internal-real-time))
     (coin-flip #.(if (zerop (random 2)) 'heads 'tails)))

(FOOTBALL-GAME (GAME-STARTED-AT 309) (COIN-FLIP TAILS))
* (equal * (eval +))

T

Notice that this is not the same as using quasiquote:

* `(football-game
     (game-started-at ,(get-internal-real-time))
     (coin-flip ,(if (zerop (random 2)) 'heads 'tails)))

(FOOTBALL-GAME (GAME-STARTED-AT 791) (COIN-FLIP HEADS))
* (equal * (eval +))

NIL ; unless you're really fast and lucky :)

Let's see this more clearly by stopping evaluation of these forms (in other words, let's quote them):

* ''(football-game
      (game-started-at #.(get-internal-real-time))
      (coin-flip #.(if (zerop (random 2)) 'heads 'tails)))

'(FOOTBALL-GAME (GAME-STARTED-AT 2207) (COIN-FLIP HEADS))
* '`(football-game
      (game-started-at ,(get-internal-real-time))
      (coin-flip ,(if (zerop (random 2)) 'heads 'tails)))

`(FOOTBALL-GAME (GAME-STARTED-AT ,(GET-INTERNAL-REAL-TIME))
  (COIN-FLIP ,(IF (ZEROP (RANDOM 2)) 'HEADS 'TAILS)))

Understand that quasiquote is itself a read macro but that it works differently from read-time evaluation. Quasiquote reads as code that, when evaluated, becomes the desired list. A form handled by #. really reads in as the desired list/object. We can see this more closely by turning off pretty printing for the quasiquote form above:

* (let ((*print-pretty*))
    (print
      '`(football-game
          (game-started-at ,(get-internal-real-time))
          (coin-flip ,(if (zerop (random 2)) 'heads 'tails))))
    nil)

(LISP::BACKQ-LIST
  (QUOTE FOOTBALL-GAME)
  (LISP::BACKQ-LIST
    (QUOTE GAME-STARTED-AT)
    (GET-INTERNAL-REAL-TIME))
  (LISP::BACKQ-LIST
    (QUOTE COIN-FLIP)
    (IF (ZEROP (RANDOM 2)) (QUOTE HEADS) (QUOTE TAILS))))
NIL

LISP::BACKQ-LIST is exactly the same as the function list except for its pretty printing behaviour.

Exercise: Where did the commas go in the above "ugly-printed" form? (Once you understand this, you understand quasiquote)

A Common Lisp Reader Extension: #"

Now let's really get down and dirty and extend some S-expressions.

Like most languages, lisp normally delimits strings with the " (double quote) character and allows you to include " characters in the string by using the \ (backslash) character as an escape character. But sometimes we might want to include " characters in strings without going to the trouble of escaping them. Here is a handy S-expression extension:

(defun |#"-reader| (stream sub-char numarg)
  (declare (ignore sub-char numarg))
  (let ((chars))
    (do ((prev (read-char stream) curr)
         (curr (read-char stream) (read-char stream)))
        ((and (char= prev #\") (char= curr #\#)))
      (push prev chars))
    (coerce (nreverse chars) 'string)))

(set-dispatch-macro-character #\# #\" #'|#"-reader|)

The above extension adds a read macro called #" to the current read table. The read macro reads until it sees the two terminating characters "# . We can use it like so:

* #"string with " characters. look, no "escapes"."#

"string with \" characters. look, no \"escapes\"."

Another powerful feature of some read macros (though not #") is that they can be nested - an area XML is very weak in. See the #| nestable comment read macro for an example.

Another Common Lisp Reader Extension: #>

This extension is similar to the above but instead is modeled after Perl's >> operator which allows you to quote blocks of text up until a certain string is found in the output:

(defun |#>-reader| (stream sub-char numarg)
  (declare (ignore sub-char numarg))
  (let ((chars))
    (do ((curr (read-char stream)
               (read-char stream)))
        ((char= #\newline curr))
      (push curr chars))
    (let* ((pattern (nreverse chars))
           (pointer pattern)
           (output))
      (do ((curr (read-char stream)
                 (read-char stream)))
          ((null pointer))
        (push curr output)
        (setf pointer
              (if (char= (car pointer) curr)
                (cdr pointer)
                pattern))
        (if (null pointer)
          (return)))
      (coerce
        (nreverse
          (nthcdr (length pattern) output))
        'string))))

(set-dispatch-macro-character
  #\# #\> #'|#>-reader|)

Here is an example use:

* #>END
I can put anything here! ", \, "#, and ># are
no problem! The only thing that will terminate
the reading of this string is...END

"I can put anything here! \", \\, \"#, and ># are
no problem! The only thing that will terminate
the reading of this string is..."
Notes
  1. Sam Steingold mentioned his excellent S-expression extension which lets you parse XML using the CL reader.
  2. Bill Clementson has some interesting code.
  3. Michal Czardybon wrote and told me about his data representation language Harpoon.
  4. Interesting thoughts of Elliotte Rusty Harold on the future of XML: "I know I said that XML in software development was dead, but maybe a spark of life remains."

Questions? Comments? New examples of S-expression extensions you'd like added (with full credit, of course)? Send them to me.

All material is © Doug Hoyte and/or HCSW unless otherwise noted or implied.