Chapter 4 XML

Even if this section mostly deals with Xml features, every explained technique can be used with other data-structures as well.

Note: this section is not very up-to-date. In particular, the explanations about genericTraversal functions should be replaced by some strategies.

4.1 Manipulating Xml documents

This example is inspired from a practical study. It involves a simple modeling of the Platform for Privacy Preferences Project (P3P), developed by the World Wide Web Consortium, which is emerging as an industry standard providing a simple, automated way for users to gain more control over the use of personal information on Web sites they visit.

Given a client and a server, the problem consists in verifying that the client's policy is compliant with the server's policy. Both policies preferences are expressed in APPEL (A P3P Preference Exchange Language) and written in Xml. The server's policy is the following file server.xml:

<POLICIES xmlns="http://www.w3.org/2002/01/P3Pv1">
  <POLICY name="mypolicy" discuri="http://www.ibm.com/privacy"
          opturi="http://www.ibm.com/privacy" xml:lang="en">
    <STATEMENT>
      <RECIPIENT> <delivery/> </RECIPIENT>
      <PURPOSE> <contact/> </PURPOSE>
      <DATA-GROUP>
        <DATA ref="#dynamic.clickstream"/>
        <DATA ref="#dynamic.http"/>
        <DATA ref="#dynamic.clientevents"/>
        <DATA ref="#user.home-info.postal.country"/>
        <DATA ref="#dynamic.cookies"> <CATEGORIES> <content/> </CATEGORIES> </DATA>
      </DATA-GROUP>
    </STATEMENT>
  </POLICY>
</POLICIES>

The client's policy is the following file client.xml:

<RULESET appel="http://www.w3.org/2002/04/APPELv1"
         p3p="http://www.w3.org/2000/12/P3Pv1"
         crtdby="W3C" crtdon="2001-02-19T16:21:21+01:00">
  ...
  <RULE behavior="limited1" prompt="yes">
    <POLICY>
      <STATEMENT>
        <PURPOSE connective="and-exact"> <current/> </PURPOSE>
        <RECIPIENT connective="or">
          <other-recipient/>
          <public/>
          <unrelated/>
        </RECIPIENT>
      </STATEMENT>
    </POLICY>
  </RULE>
  ...
  <RULE behavior="request" prompt="yes">
    <POLICY>
      <STATEMENT>
        <DATA-GROUP>
          <DATA ref="#dynamic.clientevents"> </DATA>
          <DATA ref="#dynamic.clickstream"/>
        </DATA-GROUP>
      </STATEMENT>
    </POLICY>
  </RULE>
  ...
</RULESET>

For expository reasons, we consider only a sub-problem in which we say that a client's policy is compatible with the server's policy if all ref attributes from a <DATA></DATA> node also appear on the server side. In the considered examples, the policies are compatible because

<DATA-GROUP>
  <DATA ref="#dynamic.clientevents"> </DATA>
  <DATA ref="#dynamic.clickstream"/>
</DATA-GROUP>

is included in:

<DATA-GROUP>
  <DATA ref="#dynamic.clickstream"/>
  <DATA ref="#dynamic.http"/>
  <DATA ref="#dynamic.clientevents"/>
  <DATA ref="#user.home-info.postal.country"/>
  <DATA ref="#dynamic.cookies"> <CATEGORIES> <content/> </CATEGORIES> </DATA>
</DATA-GROUP>

The problem consists in implementing such a verification tool in Tom and Java.

4.1.1 Loading Xml documents

The first part of the program declares imports, and defines the main and the run methods.

import tom.library.xml.*;
import tom.library.adt.tnode.*;
import tom.library.adt.tnode.types.*;
import aterm.*;
import java.util.*;

public class Evaluator {

  %include{ adt/tnode/TNode.tom }

  private XmlTools xtools;

  public static void main (String args[]) {
    xtools = new XmlTools();
    TNode server = (TNode)xtools.convertXMLToATerm("server.xml");
    TNode client = (TNode)xtools.convertXMLToATerm("client.xml");
    boolean compatible = compareDataGroup(getDataGroup(server.getDocElem()),
                                          getDataGroup(client.getDocElem()));
    System.out.println("result = " + compatible);
  }
  ...
}

As explained in Part II (Tom language description), to each Xml document corresponds a DOM (Document Object Model) representation. In Tom, we have defined a mapping from DOM sorts to abstract algebraic sorts: TNode and TNodeList, which correspond respectively to Node and NodeList, defined by the Java DOM implementation.

This mapping has to be initialized in the following way:

the import of tom.library.adt.tnode.* and tom.library.adt.tnode.types.* defines the sorts TNode and TNodeList and allows us to build objects over these two sorts.
the %include{ TNode.tom } Tom construct is similar to the #include C preprocessor construct. In this case, it imports the definition of Tom algebraic constructors needed to manipulate Xml documents.

In complement to Tom, we provide a runtime library which contains several methods to manipulate Xml documents. In particular, we provide the XmlTools class (defined in the tom.library.xml.* package).

The run method takes two filenames as arguments, reads the associated files and convert their Xml content into TNode terms (using the convertXMLToATerm function).

The getDataGroup function is used to retrieve a <DATA></DATA> node in the considered Xml document. Note that getDocElem is first applied to Xml documents in order to consider subterms of the DocumentNode element. Then, the compareDataGroup function is used to check that the client's policy is compatible with the server's policy. The content of these functions is detailed in the next sections.

4.1.2 Retrieving information

Given an Xml subtree, the problem consists in extracting a <DATA-GROUP></DATA-GROUP> node. For expository reasons, we first consider that such a <DATA-GROUP></DATA-GROUP> node can only appear at two different specific places: one for the client's definition and one for the server's definition.

Thus, in the Tom program we naturally consider two cases:

  private static TNode getDataGroup(TNode doc) {
    %match(doc) {
      <POLICIES>
         <POLICY>
           <STATEMENT>
             datagroup@<DATA-GROUP></DATA-GROUP>
           </STATEMENT>
         </POLICY>
      </POLICIES> -> { return datagroup; }

      <RULESET>
        <RULE>
          <POLICY>
            <STATEMENT>
              datagroup@<DATA-GROUP></DATA-GROUP>
             </STATEMENT>
          </POLICY>
        </RULE>
      </RULESET> -> { return datagroup; }
    }

    return `xml(<DATA-GROUP/>);
  }

The first pattern means that a <DATA-GROUP></DATA-GROUP> node has to be found under a <STATEMENT></STATEMENT> node, which should be under a <POLICY></POLICY> node, which should be under a <POLICIES></POLICIES> node. Once such a pattern is found, the mechanism allows us to give a name (datagroup) to this subterm and reuse it in the action part: return datagroup;.

The Xml notation implicitly extends the given patterns by adding context variables. Thus, the <DATA-GROUP></DATA-GROUP> pattern means that a subterm those head symbol is <DATA-GROUP></DATA-GROUP> is searched. But this subterm can contain attributes and children even if it is not explicitly defined by the notation. To make the definition of patterns more precise, the user can use the explicit notation and define the following pattern: <DATA-GROUP (_*)>(_*)</DATA-GROUP>. A more detailed explanation can be found in Part II (Tom language description).

4.1.3 Comparing two Xml subtrees

Given two <DATA-GROUP></DATA-GROUP> nodes, the problem consists in checking that all <DATA></DATA> nodes from the first tree also appear in the second one. This can be easily implemented by the following couple of functions:

  private static boolean compareDataGroup(TNode server, TNode client) {
    boolean res = true;
    %match(client) {
      <DATA-GROUP><DATA ref=reftext></DATA></DATA-GROUP>
      -> { res = res && appearsIn(reftext,server); }
    }
    return res;
  }

Given a <DATA-GROUP></DATA-GROUP> tree, we look for a <DATA></DATA> subtree which contains an attribute named ref. When such an attribute exists, its content (a string) is stored into the Java reftext variable. The appearsIn function is then used to check that the attribute also appears on the server side.

  private static boolean appearsIn(String refclient, TNode server) {
    %match(server) {
      <DATA-GROUP><DATA ref=reftext></DATA></DATA-GROUP>
      -> {
         if(reftext.equals(refclient)) {
           return true;
         }
       }
    }
    return false;
  }

This piece of code is interesting because it introduces what we call "conditional rules". Given the string refclient we look for a <DATA></DATA> subtree containing a ref attribute with exactly the same string. Since it is not possible to use an instantiated variable in a pattern (something like <DATA ref=refclient>...</DATA>, we have to introduce a fresh variable reftext and check in a condition that this variable is equal to refclient. This is done via a Java condition (the if clause): the action part (return true;) is executed only when the condition is satisfied, and clearly it will end the method. If the condition is not satisfied, then the next <DATA></DATA> subtree is searched. It is exactly the case of a switch statement, when the action part is not exited by a return, break or goto, and the control flow is transferred to the next matching solution or the next matching pattern.

If no such subtree exists, this means that the two policies are not compatible and false is returned.

4.1.4 Retrieving information using traversal functions

The Tom runtime library provides a set of generic functions in order to perform various kinds of traversal over a term. This is useful when searching a pattern somewhere in a term, at any position. As the Xml documents are seen by Tom as terms, the different traversals on terms are also valid when dealing with Xml.

In the current example, the getDataGroup functions looks for a <DATA-GROUP></DATA-GROUP> node. Instead of statically specifying the path (POLICIES, POLICY, STATEMENT, DATA-GROUP and RULESET, RULE, POLICY, STATEMENT, DATA-GROUP) we could have used the generic traversal mechanism provided by the package tom.library.traversal.*.

  private static GenericTraversal traversal = new GenericTraversal();

In the following, we generalize the previous problem by considering that we can have more than one appearance of the <DATA-GROUP></DATA-GROUP> node per Xml document.

We further expose the use of the generic traversal mechanism. The collectDatagroup method creates an inner class Collect1 with one method, apply, and then it calls the genericCollect method of the traversal object created previously with an object of this inner class as a parameter.

  protected static void collectDatagroup(final Collection collection, TNode subject) {
    Collect1 collect = new Collect1() {
        public boolean apply(ATerm t) {
          if(t instanceof TNode) {
            %match(t) {
              <DATA-GROUP> </DATA-GROUP> -> {
                 collection.add(t);
                 return false;
               }
            }
          }
          return true;
        } // end apply
      }; // end new

    traversal.genericCollect(subject, collect);
  }

Let us now analyze the inner class: firstly, it extends the class Collect1, where 1 stands for the number of arguments of the apply method, and secondly it implements the method apply. The method call traversal.genericCollect will call this method for all subtrees of the subject term. Since subject may contain subterms of different sorts (integers or strings for example), the instanceof construct is used as a first filter to select only the subterms which correspond to Xml nodes. When such a subterm is rooted by a <DATA-GROUP></DATA-GROUP>, the subterm is added to the collection (by a side-effect). The return false; statement indicates that it is no longer necessary to traverse the considered term (in our example, a <DATA-GROUP></DATA-GROUP> node cannot be found inside a <DATA-GROUP></DATA-GROUP> itself). Respectively, the return true; statement indicates that the apply function should be recursively applied to the current term in order to continue the traversal.

Given a collection, the getDataGroup method uses an iterator to select a <DATA-GROUP></DATA-GROUP> node.

  private static TNode getDataGroup(TNode doc) {
    HashSet c = new HashSet();
    collectDatagroup(c,doc);
    Iterator it = c.iterator();
    while(it.hasNext()) {
      TNode datagroup = (TNode)it.next();
      return datagroup;
    }

    return `xml(<DATA-GROUP/>);
  }

4.2 Building and sorting Xml/DOM documents

In this section, we consider a DOM mapping for Xml documents. This means that Xml documents are still considered as algebraic terms (built over TNode and TNodeList), but their internal representation is backed by the W3C DOM library.

In the following, we consider a small data-base represented by an Xml document person.xml:

<Persons>
  <Person Age="30"> <FirstName> Paul   </FirstName> </Person>
  <Person Age="28"> <FirstName> Mark   </FirstName> </Person>
  <Person Age="21"> <FirstName> Jurgen </FirstName> </Person>
  <Person Age="21"> <FirstName> Julien </FirstName> </Person>
  <Person Age="24"> <FirstName> Pierre-Etienne </FirstName> </Person>
</Persons>

The problem consists in sorting this document according to different criteria (age or first-name for example).

4.2.1 Loading Xml documents

The first part of the program declares imports, and defines the main method.

import org.w3c.dom.*;
import javax.xml.parsers.*;
import javax.xml.transform.*;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import java.io.File;

public class PersonSort1 {
  private static Document dom;
  %include{ dom.tom }

  public static void main (String args[]) {
    try {
      dom = DocumentBuilderFactory.newInstance()
        .newDocumentBuilder().parse("person.xml");
      Element e = dom.getDocumentElement();
      dom.replaceChild(sort(e),e);

      Transformer transform = TransformerFactory.newInstance().newTransformer();
      StreamResult result = new StreamResult(new File("Sorted.xml"));
      transform.transform(new DOMSource(dom), result);
    } catch (Exception e) {
      e.printStackTrace();
    }
  }

  ...
}

The mapping has to be initialized in the following ways:

the import of org.w3c.dom.* and javax.xml.* packages in order to use the DOM library.
the %include{ dom.tom } construct imports the definition of Tom algebraic constructors needed to manipulate DOM objects.
the dom variable has to be instantiated. This variable corresponds to the notion of Document in the DOM terminology.

4.2.2 Comparing two nodes

In order to implement a sorting algorithm, we need to know how to compare two elements.

The following function takes two Xml documents in argument and compares their attribute Age, using the string comparison operator compareTo.

  private static int compare(Node t1, Node t2) {
    %match(t1, t2) {
      <Person Age=a1></Person>, <Person Age=a2></Person> -> {
        return `a1.compareTo(`a2);
      }
    }
    return 0;
  }

In this example, it is interesting to note that an Xml node is seen as an associative operator. This feature, combined with the implicit notation, is such that <Person Age=a1></Person> will match any Xml node headed by Person which contains the attribute Age. This Xml node may contain several sub-nodes, but they will not be stored in any variable.

4.2.3 Sorting a list of nodes

In this section we use a swap sort algorithm which consists in finding two elements that are not in the correct order. Once they are found, they are swapped before calling the algorithm recursively. When no two such elements exist, this means that the list is sorted.

  private static Node sort(Node subject) {
    %match(subject) {
     <Persons>(X1*,p1,X2*,p2,X3*)</Persons> -> {
        if(compare(`p1,`p2) > 0) {
          return sort(`xml(dom,<Persons>X1* p2 X2* p1 X3*</Persons>));
        }
      }
    }
    return subject;
  }

The pattern <Persons>(X1*,p1,X2*,p2,X3*)</Persons> looks for two elements p1 and p2 and stores the remaining contexts in X1*, X2*, and X3*. In order to give a name to contexts (and then retrieve them to build a new list), the pattern is written in explicit notation. This notation prevents Tom to add extra variables. Otherwise, the expanded form of <Persons>X1* p1 X2* p2 X3*</Persons> (written in explicit notation) would have been <Persons>(_*,X1*,_*,p1,_*,X2*,_*,p2,_*,X3*,_*)</Persons>.

Note that the action part is guarded by a condition (if(compare(p1,p2) > 0)). This means that return sort(...) is executed only when two bad-ordered elements are found. When p1 and p2 are correctly ordered, the matching procedure continues and extracts another couple p1 and p2. As mentioned in Part II, when Tom cannot find two elements p1 and p2 such that the condition is satisfied (the list is sorted), the return sort(...) statement is not executed and the control flow is transferred to the following statement: return subject;

As mentioned previously, when two elements have to be swapped, a new list is built. This done via the ``' (backquote) construct. This construct, used before to retrieve instantiated variables, can also be use to build a new term. In order to build an Xml document, the xml(...) function has to be used. The number of parameters depends on the Xml mapping which is used. In our case, when manipulating DOM objects, a new subtree is built is the context of a main document. This extra-parameter is given in the first argument of the xml function. Thus, `xml(dom,<Persons>...<Persons>) builds a new <Person></Person> node, as a child of the dom document. When using the TNode.tom mapping, a library based on Gom is used. In this case, it is no longer necessary (and even not correct) to use this extra argument.

4.2.4 Sorting by side effect

In the previous section, we considered a sorting algorithm expressed in a pure functional programming style: when two elements have to be swapped, a new list is built and returned.

In the following example, we exploit OO-style of the DOM library and perform a swap in place: the list is updated by swapping two elements (with side effect):

  private static void sort(Node subject) {
    %match(subject) {
      r @ <_>p1@<_ Age=a1></_> p2@<_ Age=a2></_></_> -> {
        if(`a1.compareTo(`a2) > 0) {
          `r.replaceChild(`p2.cloneNode(true),`p1);
          `r.replaceChild(`p1,`p2);
          `sort(r);
          return;
        }
      }
    }
  }

In this example, we use the notion of anonymous Xml node (<_>...</_>) which allows to describe more generic algorithms.