Give DOM/SAX a Try
The eXtensible Markup Language (XML) has really taken off over the last few years, to the point that you can’t escape it in any visit to the computer book store. But while XML has thoroughly permeated the world of IT, its presence in the scientific world is far less pervasive, which is a shame, because it offers a number of advantages over do-it-yourself formats. For example, because XML is standardized, it is supported by many tools and libraries, making parsing and probing XML documents a breeze. In this short tutorial, I want to show you how you can use XML to develop data interfaces for legacy applications.
The format of XML is actually pretty straightforward. If you have ever seen HTML code, you are already acquainted with an XML format. XML basically allows you to define elements, such as the <p> tag that defines a paragraph element in HTML. Each element can have attributes, like the href attribute in the HTML anchor tag (e.g. <a href="http://www.macresearch.org">), and embedded sub-elements and text.
To demonstrate, I am going to walk through a real example. I recently needed to use output from a Fortran program in some python scripts. The Fortran program wrote its output in a form that was not very easy to parse, making it difficult to use in other programs and scripts. So I added a second mode of output that generated XML. The XML printed looked like this:
<kffile>
<section id='General'>
<variable id='file-ident' length='6' type='3' />
<variable id='jobid' length='160' type='3' />
<variable id='title' length='160' type='3' />
<variable id='Molecular_Weight' length='1' type='2' />
<variable id='runtype' length='160' type='3' />
<variable id='nspin' length='1' type='1' />
<variable id='nspinf' length='1' type='1' />
<variable id='ldapot' length='1' type='1' />
<variable id='xcparv' length='1' type='2' />
<variable id='ldaen' length='1' type='1' />
<variable id='xcpare' length='1' type='2' />
<variable id='ggapot' length='160' type='3' />
<variable id='ggaen' length='160' type='3' />
<variable id='lhybrid' length='1' type='4' />
<variable id='hybrid' length='160' type='3' />
<variable id='iopcor' length='1' type='1' />
<variable id='ioprel' length='1' type='1' />
<variable id='electrons' length='1' type='2' />
<variable id='unit of length' length='1' type='2' />
<variable id='unit of angle' length='1' type='2' />
<variable id='lfrozend' length='1' type='4' />
<variable id='scfmod' length='160' type='3' />
</section>
<section id='Geometry'>
<variable id='grouplabel' length='160' type='3' />
<variable id='Geometric Symmetry' length='160' type='3' />
<variable id='symmetry tolerance' length='1' type='2' />
<variable id='orient' length='12' type='2' />
...
This is a fairly basic XML document. There is an all enclosing kffile element, which contains nested section elements. Each section element in turn nests a number of variable elements. The section and variable elements each have one or more attributes. For example, a variable has an identifier (id), data length (length), and data type (type). (Note that the data itself is not included in this particular scheme, but could easily be added.)
The advantage of using an XML format to dump structured data is that reusing that data then becomes a breeze. For example, here is a python program to read in the data above, and print out all of the section and variable names:
#!/usr/bin/env python
from xml.dom.minidom import *
dom = parse('dump.xml')
for section in dom.getElementsByTagName('section'):
print section.getAttribute('id')
for variable in section.getElementsByTagName('variable'):
print ' ', variable.getAttribute('id')
That’s all. Tiny. The parsing itself is a single line of code. And don’t think that this is only possible in Python, it’s not. Libraries to parse XML are commonplace in nearly all languages, and typically just as easy to use.
You can parse XML in a number of ways. One option is a so called SAX parser, that basically walks through the XML tree, and calls a function for each element that it encounters. The advantage of this approach is that you don’t need to read the whole document into memory.
The parser used here is a Document Object Model (DOM) parser. It reads in the whole document, and represents it internally as a tree-like structure called the DOM-tree. Once you have this tree, you can do anything you like with it. You can traverse it, like we have done here, or you can modify and print it back out again. The advantage of a DOM parser is that it is typically much easier to perform operations on the data. A disadvantage is that you have to load the whole document, which could be an issue if you have lots of data.
That’s it for this brief introduction to XML. In conclusion, adding an XML output option to your legacy C or Fortran application can be a simple way to make it much more useful, by providing better integration with other applications and scripting languages.



Comments
FoX: Fortran/XML
Hi - if you'll excuse the quick plug, you might well be interested in FoX, which is a library in pure Fortran allowing XML input/output from Fortran without requiring any additional dependencies.
It lets you write out XML in a natural Fortran-esque idiom, and guarantees well-formedness (which is easy to get wrong if you're not careful). It also has both a SAX (level 2) and DOM (level 3) input interface.
It's a very quick and easy way of adding XML capabilities to existing Fortran codes, without having to worry about escaping characters and other XML minutiae. And it's in use in several computational physics/chemistry simulation codes.
You'd write the above output something like:
Which can seem verbose but it's easy enough to write a subroutine:
And the above parser could be written:
Anyway - it's freely available with full documentation - see its homepage
Remember standard XML types exist!
Another good thing to remember is that a variety of standardized science/math XML formats already exist. So consider a bit of Google research and see if you can output an existing XML data format, rather than creating a custom one.
For example, in chemistry, there's CML:
CML homepage
For a variety of physical science data, there's CDF in XML (CDFML):
CDF homepage
In short, the idea of XML is to help standardize formats, so if you're considering XML output, take the time to see if there's already something existing. The result will be an improvement for data interchange.
NSXML
...and of course us Cocoa junkies have been very happy since 10.3 introduced (and 10.4 significantly improved) XML support into the Cocoa frameworks. NSXML is pretty easy to use and very powerful, build on libxml et al. it allows you to read, create and write XML files, either through the use of a parser or using interaction with the DOM model. Finally you can use both XPath and XQueries to do specific lookups and XSLT to convert one XML format to the other. Cool stuff! Drew, perhaps a nice subject for your next Cocoa tutorial ;-)
Standard XML Data Definitions
Good advice. You should certainly see if there is a standard for what you would like to export.
But XML is also useful when there is no standard data type definition. Often when you are working with an application, like the one I was working on, the data you wish to export is fairly specific to the software in question. In that case, there's nothing wrong with creating your own definition.
---------------------------
Drew McCormack
http://www.maccoremac.com
http://www.macanics.net
http://www.macresearch.org
XML in Fortran
Nice link. Thanks!
---------------------------
Drew McCormack
http://www.maccoremac.com
http://www.macanics.net
http://www.macresearch.org
XML is not suitable for everything
XML is good - especially when you don't reinvent the wheel. It is also very verbose, and as such it is not suitable for either the very simple (a windows ini-like file for simple non-hierarchical configuration information) or the very large. In one project proposal, a contractor proposed to do everything in XML, until we made it clear that the prototype used raw binary Fortran output, 10GB a file.
The choice was later made between HDF5 and NetCDF. For both tools exist to translate a file from the native representation into an XML version. We went with HDF5 as it allows internal and transparent (upon read) compression (right now).
In the near future NetCDF 4 will use the same dataformat as HDF5, and the API will handle both.
At present there is no clean Objective C interface for HDF-5, but there is a very good one in Python (pytables).
Maarten
Large Data Sets in XML
Good point Maarten. I've used HDF5 in some of my projects, and can recommend that route if you have a lot of data.
Another option for large datasets is to use the iTunes/iPhoto approach: describe your data with an XML file, and store the raw binary in separate files that are referred to in the XML document by path.
Drew
---------------------------
Drew McCormack
http://www.maccoremac.com
http://www.macanics.net
http://www.macresearch.org
Binary files
While on the subject of binary files: if your files are going to be used by anyone but yourself, save those others the trouble of writing their own reader from your documentation by just using a well supported format (like HDF 5). Only if you store straight homogenous arrays you may be able to get away with a binary file.
Case in point: there is a series of earth observing satellites orbiting the earth in a polar orbit. Each day in the early afternoon (around 13:30) the make an overpass over you. Most of these instruments use HDF (4 or 5, depending on when they where designed) to store and share their data. This common data format makes it a lot easier to use the data from all these sensors. The single instrument that doesn't use a well supported and self describing* format is used much less frequently in comparisons. You really need the help of the institute where it originated to use the data. This may be a good idea, since they are aware of all the warts in the data, but having more people investigate the data ensures that problems are spotted earlier. Besides, HDF was designed by people with more experience and knowledge of computer systems than most scientists. That avoids a lot of issues (endian-ness, …). I tend to consider HDF-5 as a binary for of xml, suitable for huge data-sets.
Remember that any piece of software you write now as a quick hack, will be around far longer than you dare to imagine right now. Just for that reason you want to be sure that the data it writes is readable and documented.
* self describing means that the different fields in a data-file are described as well. You do not need external clarification to read an unknown HDF file. You may need some extra instructions to fully understand the data (like some xml formats), but that just depends on the field you are applying it to. Another advantage is that you can add a field without breaking existing software.
More than just use by others...
You have a great point -- software is around far longer than you dare to imagine! Particularly true for scientific software.
But it's not just the benefit of others who might use your binary data -- it's also your own benefit to use a common and/or self-describing file format. Imagine you come back to some code 5-10 years later and need to update for the Xphone Extreme version.
If your binary format is difficult, you might not be able to deal with new programming features. I've cursed myself for using poor file export when I realize I want to add feature X or Y and the code for writing the binary file is a hack.
Don't believe me? Ever run into problems with Microsoft Word documents not quite perfectly translating between different versions of ... Word itself?
So formats like HDF and XML are really good for saving yourself in the future.