ABSTRACT

A widespread problem in all branches of information technology is the storage and interchange of data. Each application has its own particular way of storing the generated information, which is often a problem, especially when we don’t have the application that generated the data. For example, DNA sequencers made by Applied Biosystems generate data

and store it in files with the extension .ab1. If we want to access data stored in such a file, we need to know how it is structured internally. In this case, the creator of the format has released the specification of the file;1 and it would be possible, though not necessarily easy, to write code to extract our data from these files. Usually we do not have such good luck, and it is very common to find data file formats poorly documented, or not documented at all. In many cases those who have wanted to open these files have had to resort to “reverse engineering,” with mixed results. To avoid this type of problem and to make more fluid exchange of data between applications from different manufacturers, the W3C2 developed the eXtensible Markup Language, better known as XML. XML is a way of representing data. What kind of data? Practically any

type can be represented using XML. Configuration files, databases, web pages, spreadsheets, and even drawings can be represented and stored in XML. For some specific applications, there have been defined subsets of XML,

prepared for representing a particular type of data. So, mathematical formulas can be stored in an XML dialect called MathML,3 vector graphics in SVG,4

chemical formulas in CML.5, and page printouts can be represented with

In addition to the above formats, more applications store their data in XML. This means that, by learning to read XML, we can access a multitude of files from the most diverse origins. Before going into details on how to process this type of file, I want to share

a W3C document called “XML in 10 points” that can shows the big picture:

1. XML is for structuring data: Structured data includes things like spreadsheets, address books, configuration parameters, financial transactions, and technical drawings. XML is a set of rules (you may also think of them as guidelines or conventions) for designing text formats that let you structure your data. XML is not a programming language, and you don’t have to be a programmer to use it or learn it. XML makes it easy for a computer to generate data, read data, and ensure that the data structure is unambiguous. XML avoids common pitfalls in language design: it is extensible, platform-independent, and it supports internationalization and localization. XML is fully Unicode-compliant.