Chapter 3. XSLT

As you are surely aware by now, XSLT is a language for transmutating XML. XSLT is something of a programming language and something of a markup language. You don't need to be afraid of learning it as it quite simple to learn. The first section of this chapter deals with the more technical side of what XSLT is and how it fits into the world of XML, HTML, and SGML; if you are already familiar with those other languages, or you just want to dive straight in to making your own templates, feel free to skip past it. Subsequent sections will give a brief introduction to writing XSLT. They borrow heavily from the tutorial*.xsl series.

3.1. History, Standards, and Specifications

Many decades ago, in the mid- and late-1960s, in the beginning era of the Internet, IBM had the vision of a generalized markup language— a single language that could handle the complexity of adding structure to data of all sorts. This GML later evolved into SGML— a Standard Generalized Markup Language. Since SGML is not a specific language as such, but rather a common syntax for writing markup languages, it's helpful to distinguish between the "language" itself and specific "instances" of that language. The two most successful instances of SGML are HTML and XML.

HTML (Hyper-Text Markup Language) is a formatting language for presenting pages over HTTP (Hyper-Text Transfer Protocol). In other words, it's all that stuff you see when you surf the internet.

One of the limitations of SGML is that its syntax is quite complicated and sometimes quite bizarre. This was one of the reasons for developing XML (eXtensible Markup Language). Like SGML, XML is not a language per se, but rather a syntax for creating languages, and is on the whole a reasonable, sensible subset of SGML. Like SGML, it is helpful to distinguish between XML itself and "instances" of XML. Some other ideas that are essential to XML are the notions of "well-formedness" and "validity". A file is well-formed if it conforms to the syntax of XML. But the XML specification has no "meaning" behind it. For a file to have meaning it must be associated with a certain definition of a type of documents (a weblog, a news feed, an article, a webpage,...). The file is said to be valid if it conforms to that definition.

So by analogy, the sentence "Sera kissed Joachim." is a valid, well-formed sentence in English. On the other hand, "gemorgle floozed the buimfap." may be well-formed in English—it seems by and large to be pronounceable and to follow the rules of English phonetics, we may be able to break it down and look at its structure saying it has a subject ("gemorgle") which in the past tense performed some action ("floozed") on some object ("the buimfap"), etc—, but those words aren't real, they have no understandable meaning, so the sentence can't be called valid English. And finally, sentences like "h~ti%791 gT", "ngrflp'mtk", or even "parlez vous françes?" and "doumo, o-sewa ni narimasita." do not even qualify as well-formed English (though some are valid and well-formed in other languages).

There are a large number of popular instances of XML like XHTML, RSS, RDF, DocBook, and so on. RSS and RDF you'll recognize as types of news feeds. Around the time HTML version 4.01 came out, XHTML version 1.0 was created. XHTML was designed as a redefinition of HTML (an instance of SGML) as an instance of XML. Because of this, files written in XHTML are basically identical to files in HTML but can be parsed and manipulated with XML tools rather than needing more esoteric SGML-parsing tools.

Another instance of SGML is DTD (DocType Definitions). DTDs are used to define what "valid" means for a particular type of XML. (Actually, DTDs are used for HTML and other SGML languages as well.) Discussing DTD is well beyond the depth of this guide, but is brought up to help give a more complete picture of the entire SGML/XML-family

It was soon realized that a standard language for transforming XML would be necessary. Much like SGML was used to define valid SGML (via DTD), XML is used to transform XML via XSLT (eXtensible Stylesheet Language Transformations). XSLT is an instance of XML and so subject to all the rules of well-formedness. Because XSLT can be used to transform XML into any text format, no single DTD for XSLT is available, rather a different one must be used depending on what language you're output is in.

One of the essential components to XSLT is XPath (XML Path Language), a simple language for selecting XML nodes. In it's most basic form it looks quite a lot like file paths in POSIX; for example "/foo/bar" selects the element <bar>, which is under <foo>, which is under the document root "/". In addition to being able to match different XML nodes, XPath also provides some minimal functions which are helpful for manipulations (things like basic mathematical operators, a function to give the count of how many nodes were selected, etc).

Other topics associated with XSLT which are beyond the scope of this guide but which may be of interest are XQuery (XML Queries), Qnames, and XSL:FO (XSL Formatting Objects).

So that was a quick breeze-through of the history and languages surrounding XSLT. For some more history about SGML, http://www.sgmlsource.com/history/ has some articles that may be of interest. In order to help developers keep all this stuff straight, there's a standards committee known as W3C (the World Wide Web Consortium) who drafts the official specifications for these and other topics. Their website is at http://www.w3.org/. Here are links to the specifications for XML 1.0, XSLT 1.0, and XPath 1.0.

If you're looking for less technical, more instructive descriptions one good source is W3Schools. Here are links to their tutorials on XML, XSLT, XPath, XHTML (and HTML), and DTD. They also have a good tutorial for CSS (Cascading Style Sheets) here. CSS is used to help provide better formatting with (X)HTML and to help with the separation of content and presentation of data. It is syntactically unrelated to SGML.

Warning

Be forewarned, the W3Schools tutorial for XPath is for XPath 2.0. Libxml and so Paperboy use XPath 1.0 instead. The general ideas are the same for both, the biggest difference is that XPath 2.0 added a lot more functions and moved them all under the http://www.w3.org/2005/02/xpath-functions namespace, using the prefix abbreviation "fn:". I've yet to locate a tutorial of equivalent quality that lists the functions available under XPath 1.0 since W3Schools updated theirs.