Plain text, XML trees and graphs – and why reality takes more than this

Plain text, XML trees and graphs – and why reality takes more than this

Did you ever wonder why plain text files cannot contain structured data very well, for example, why the „good old“ ini files look so awful and chaotic in larger applications? Of course, it’s because plain text is plain. But not „plain“ as in „not colorful and with nice font variations“ but „plain“ as in „flat, linear“. It has lines, one after another, and each line has characters, one after another. And that’s basically all.

Ini files

Ini files, though never really specified, may have some more: headings in square brackets, and key value pairs. Maybe even comments. But still, complex data is hard to handle that way. (So is there are formal distinction between plain text files and ini files? Think about it, because the rest of the text has 11 other possibilities to ask a similar question, which I won’t do.)

XML

This is where XML comes in. It is completely hierarchical, with a root element and any number of nested child elements. Then there are text nodes and attribute nodes, where text nodes can contain any text they like, and attributes consist of keys and values. This is a lot more structure than a plain text file, and is suited to represent almost every information you can think of. If you are used to XML, you just think in hierarchical structures, as if the whole world would consist of nodes with subnodes. And this seems only natural, since reality surely doesn’t consist of something linear like a plain text file. Sometimes, it’s not exactly clear what goes where: When you put statistics of what has been bought where and when into an XML file, is the <location> a child of <time>, or vice versa? Who cares, as long as it is a tree. If you really need it the other way round, you can always come up with some simple 143 line XSLT to transform it on-the-fly.

Graphs

So let me tell you something important: The world is not a hierarchical thing, it’s not a tree. There is more than just linear structures and trees, for example there are graphs. To step up from XML to something graph based is like stepping up from plain text to XML. Some of the XML advocates would jump in yelling „But XML can represent graphs, like with RDF or OWL“. That’s true, because an RDF document is just a special form of XML document. Just as an XML document is just a special case of a plain text document. And because the is-a-relationship is transitive, an RDF document also is a plain text file, and as such, even plain text is capable to represent a graph.

Even more?

We’ve just seen three levels of structures and higher-order structures. So is an RDF-like graph the highest order of structure? No, because in RDF, you have triples of subject, predicate and object. You can express much with it, but there are lower level graphs, were connections are not directed, or higher level ones, where a connection can have more participants than just subject, predicate and object, like an adverb. Or you can bind the adverb to the triple, which needs reification (treating the relation as an entity, so it may be the endpoint of some other relation), but RDF has only very poor support for reification. Or you can have any number of objects bound by a single relation… the possibilities are endless.
Yet, when a programmer has to deal with information and thinks about graphs, it’s too easy to get stuck at what RDF gives you, just as it is too easy to take an RDF graph for its XML tree representation, and take this tree for the linearized plain text file. This is a mental issue, something each programmer has to overcome (and even I suffered from). Sometimes, it’s even wrong to think in documents, and you have to deal with the entities on another level, which my also be a mental problem.

Limiting technology

On the other hand, the technologies we use to manipulate that data is mostly web based, that means built on top of http, which is a protocol to request – guess what – plain text files. We do not have the technology to address actual entities or relations as nicely as we address plain text files. There are efforts to do this, but I call them „suboptimal“ at best.
Then there is tabular data, which is best kept in a relational database. Of course, you can always put it into texts, trees or graphs, but it does not really fit. Just as you can put any text, tree or graph into some database tables.

Summary

To summarize my key points:

  • XML is better than plain text, because trees are more advanced than linear structures
  • Graphs are even better than trees
  • There’s even something „above“ graphs, but this is too complex to be addressed here
  • RDF seems to be a nice representation of graphs, but it only models one special case of general graphs
  • It’s nice to be able to map graphs to trees and map them to linear plain text
  • Tables are just another way to store information, sometimes better than text, trees and graphs, sometimes worse
  • But we need protocols that work on another layer than just addressing plain text documents by their URL

These are some of the points that Cubenet aims to do better. I write this partly because I’m sure I already know how to do better, and partly because I must remind myself that some of this is still a great mystery to myself.

Ein Gedanke zu „Plain text, XML trees and graphs – and why reality takes more than this

  1. Ideally such a format should have the following traits:
    – human-readable
    – ability for fast non-linear parsing
    – flexible
    – extendable
    – validatable
    – resitant to encoding issues
    But in reality you will have to make sacrifices.
    Flat data will never be fast and fast data will usually not be human-readable. At least not if you are working with large amounts of data.
    Internally I would work with some kind of cached binary format while using some kind of XML for the exchange. Transfering data is way slower than parsing it in most cases anyway.

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht.