Previous Up Next

Chapter 10  Tom

10.1  Notations

The syntax of the language is given in BNF-like notation. Terminal symbols are set in typewriter font (like this). Non-terminal symbols are set in italic font (like that). Square brackets [...] denote optional components. Parentheses with a trailing star sign (...)* denotes zero, one or several repetitions of the enclosed components. Parentheses with a trailing plus sign (...)+ denote one or several repetitions of the enclosed components. Parentheses (...) denote grouping.

10.1.1  Lexical conventions

Identifier::=Letter ( LetterDigit_- )*
Integer::=(Digit )+
Double::=(Digit)+ [.] (Digit)* ∣ . (Digit)+
String::=" (Letter ∣ (\ (ntbrf\") ) )* "
Letter::=A ... Za ... z
Digit::=0 ... 9
Char::= (LetterDigit)

10.1.2  Names


10.2  Tom constructs

A Tom program is a host language program (namely C, Java, or Caml) extended by several new constructs such as %match, %strategy, %include, %gom, or backquote. Tom is a multi-language compiler, and therefore its syntax depends on the host language syntax. But for simplicity, we only present the syntax of its constructs and explain how they can be integrated into the host language.

Using Java as a host-language, the following Tom program is correct:

public class HelloWorld {
  %include { string.tom }

  public final static void main(String[] args) {
    String who = "World";
    %match(who) {
      "World" -> { System.out.println("Hello " + who); }
      _       -> { System.out.println("Don't panic"); }

10.2.1  Tom program

A Tom program is a list of blocks, where each block is either a Tom construct, or a sequence of characters (host language code). When compiling a Tom program, the Tom constructs are transformed into host language code, and the result is a valid host language program. For instance, in the previous example, %include and %match constructs are replaced by function definitions and Java instructions, making the resulting program a correct Java program.

Syntax of a Tom program:

 { BlockList }

10.2.2  Match construct

The %match construct (MatchConstruct) is one of the main contributions of Tom. This construct can be seen as an extension of the SwitchCase construct in C or Java, except that patterns are no longer restricted to constants (chars or integers). Given an object (the subject) and a list of patterns, our goal is to find the first pattern that matches the subjects (i.e. that has a compatible shape). More formally, a pattern is a term built over variables and constructors. The latter ones describe the shape of the pattern, whereas the variables are holes that can be instantiated to capture a value. When we consider the term f(a(),g(b())), it has to be viewed as a tree based data-structure with f as a root and a() the first child. Similarly, b() is the unique child of g, which is the second child of the root f . We say that the pattern f(x,y) matches this term (called subject), because we can give values to x and y such that the pattern and the subject become equal: we just have to assign a() to x and g(b()) to y. Finding this assignment is called matching and instantiating. This is exactly what Tom is supposed to do. A pattern may of course contain subterms. Therefore, f(x,g(b())) or f(a(),g(y)) are valid patterns which match against the subject.

Assuming that s is a Java variable, referencing a term (the tree based object f(a(),g(b())) for example), the following Tom construct is valid:

%match(s) {
  f(a(),g(y)) -> { /* action 1: code that uses y */ }
  f(x,g(b())) -> { /* action 2: code that uses x */ }
  f(x,y)      -> { /* action 3: code that uses x and y */ }

The %match construct is defined as follows:

MatchConstruct::=%match ( MatchArguments ) { ( PatternAction )* }
 %match { ( ConstraintAction )* }
MatchArguments::=[Type] Term ( , [Type] Term )*
PatternAction::=[LabelName:] PatternList -> { BlockList }
PatternList::=Pattern( , Pattern )* [ (&&||) Constraint ]
ConstraintAction::=Constraint -> { BlockList }
Constraint::=Pattern << [Type] Term
 Constraint && Constraint
 Constraint || Constraint
 ( Constraint )
 Term Operator Term

A MatchConstruct is composed of two parts:

Since version 2.6 of the language, an additional syntax, based on constraints, is proposed. In this case, the MatchConstruct can be simply seen as a list of pairs (constraint,action) corresponding to ConstraintAction from the above syntax. For instance, the example we presented before can now be written:

%match {
  f(a(),g(y)) << s -> { /* action 1 : code that uses y */ }
  f(x,g(b())) << s -> { /* action 2 : code that uses x */ }
  f(x,y) << s      -> { /* action 3 : code that uses x and y */ }

The two versions are semantically equivalent. The expression p << s denotes a match constraint between the pattern p and the subject s. The advantage of this new syntax is the modularity that it offers. For instance, multiple constraints can be combined with the boolean connectors && and ||, leading to a greater expressivity, sometimes closed to the one offered by regular expressions.

The two syntaxes are compatible. For instance the following construct is valid:

%match(s) {
  f(a(),g(y)) && ( y << a() || y << b() ) -> { /* action 1 : code that uses y */ }
  f(x,g(b())) -> { /* action 2 : code that uses x */ }  
Note: The right-hand sides of the constraints may contain variables that can be found in the left-hand side of other constraints. For instance we may write:
%match(s) {
  f(a(),g(y)) && ( g(z) << y || f(z,_) << y ) -> { /* action 1 : code that uses y and z */ }
  f(x,g(b())) && f(y,a()) << someHostFunction(x) -> { /* action 2 : code that uses x and y */ }

where someHostFunction can be an arbitrary function from the host language.

Note: Constraints built using an operator are trivially translated into host code. For instance, the constraint x == g(a()) ... is translated into the Java code
if (x == g(a())) {

For expository reasons, we consider that a %match construct is evaluated in the following way:

The semantics of a match construct may remind the switch/case construct. However, there is a big difference. For instance, in Tom we can have patterns (list-patterns) that can match in several ways a subject. Informally, when considering the subject conc(a(),b(),c()), and the pattern conc(_*,x,_*), there are three possible match for this problem: either x=a(), either x=b(), either x=c(). Note that _* is a special hole which can capture any sublist of conc(…). The list-matching is also known as associative matching. Besides this, some other matching theories are supported in Tom.

When taking this new possibility into account, the evaluation of a %match construct gets a little bit more complex:

As mentioned in the BNF-syntax, a label may be attached to a pattern. In that case, in C and Java, it becomes possible to exit from the current PatternAction (using a goto or a break), without exiting from the whole %match construct. The control is transferred to the next pattern which matches the subjects (respectively to the next constraint). This feature is useful to exit from a complex associative-matching for which, for any reason, we are no longer interested in getting other solutions.

Note: the behavior is not determined if inside an action the subjects of the %match instruction under evaluation are modified.

When using the new syntax based on constraints, there are several restrictions that are imposed by the compiler (an error is generated if they are not respected):

10.2.3  Tom pattern

As we can imagine, the behavior of a %match construct strongly depends on the patterns which are involved. The formalism which defines the syntax of a pattern is also an essential component of Tom. But because there exist several ways to define patterns with equivalent behavior, its formal definition is not so simple. Although, the different shortcuts help the programmer to simplify the definitions of patterns.

Informally, a pattern is a term built with constructors and variables (please note that x denotes a variable, whereas a() is a constructor). A variable can also be anonymous, and it is denoted by _. Let’s look at some examples of patterns: x, a(), f(a()), g(a(),x), or h(a(),_,x). When a pattern matches a subject, it may be useful to keep a reference to a matched subterm. The annotation mechanism (z@g(y) for example) can be used for this purpose. Thus, considering the matching between the pattern f(x,z@g(y)) and the subject f(a(),g(h(b()))), y is instantiated by h(b()), and z is instantiated by g(h(b())). This can be useful in C, to free the memory for example.

When identical actions have to be performed for a set of patterns which share a common structure, the disjunction of symbols may be used: pattern (fg)(a()) is equivalent to the set {f(a()), g(a())}. The disjunction of symbols may also be used in subterms, like in h( (fg)(x) ).

More formally, a Tom pattern and a Tom term has the following syntax:

 Name ([Term ( , Term )*])
Pattern::=[ AnnotedName @ ] PlainPattern
PlainPattern::=[!]VariableName [ * ]
 [!] HeadSymbolList (ExplicitTermListImplicitPairList)
HeadSymbolList::=HeadSymbol [ ? ]
 ( HeadSymbol ( HeadSymbol )+ )
ExplicitTermList::=( [ Pattern ( , Pattern )*] )
ImplicitPairList::=[ [ PairPattern ( , PairPattern )*] ]
PairPattern::=SlotName = Pattern

Concerning the Term syntax, both Tom and host-language variables can be used and Name can be either a function symbol declared in Tom or the name of a method from the host language.

Note: since version 2.4, negative patterns, called anti-patterns have been introduced. See section 10.2.4 for a detailed description.

A pattern is a term which could contain variables. When matching a pattern against a subject (a ground term), these variables are instantiated by the matching procedure (generated by Tom). In Tom, the variables do not have to be declared: their type is inferred automatically, depending on the context in which they appear.

As described previously, Tom offers several mechanisms to simplify the definition of a pattern:

The use of ? after a list operator enables real associative with neutral elements (AU) matchings. By using conc? for example, we simply specify that the subject may not start with a conc symbol. For instance, the following code would produce the output matched, because x and respectively y can be instantiated with the neutral element of conc (please see the documentation on Gom for further details about specifying symbols’ type and their neutral elements).

 L l = `a();
 %match(l) {
   conc?(x*,y*) -> { System.out.println("matched"); }

Note that without the use of ?, the subject must start with a conc in order to have a match.

Note: since version 2.5 we can use, inside a list, a subpattern that also denotes a list; moreover, this subpattern can be annotated. For instance, patterns like conc(_*,p@conc(a(),_*,b()),_*) are now valid.

10.2.4  Tom anti-pattern

The notion of anti-pattern offers more expressive power by allowing complement constructs: a pattern can describe what should not be in the matched subject.

The notion of complement is introduced by the symbol !, as illustrated by the following grammar fragment:

PlainPattern::=[!] VariableName
 [!] HeadSymbolList ( ExplicitTermListImplicitPairList )

The semantics of anti-patterns can be best understood when regarding them as complements. For example, a pattern like car[color=blue()] will match all the blue cars. If we add an ! symbol, the anti-pattern !car[color=blue()] will match everything that is not a blue car, i.e all objects that are not cars or the cars that have a color different from blue. The grammar allows also car[color=!blue()] - matches all cars that are not blue, or !car[color=!blue()] - either everything that is not a car, or a blue car.

Using the non-linearity combined with anti-patterns allows to express interesting searches also. For example, car[interiorColor=x,exteriorColor=x] will match the cars that have the same interior and exterior color. By using the anti-pattern car[interiorColor=x,exteriorColor=!x], the result is as one would expect: it will match all the cars with different interior - exterior colors.

It is also possible to use the anti-patterns in list constructions. Please refer to the tutorial for more examples.

Note: The use of annotations (@) is forbidden bellow a ! symbol and is prevented by a compilation error. This is due to the fact that what is under this symbol generally will not match, therefore it cannot be used in the action part of the rules. You will also get an error if you try to put ! before an _ or _* because a construct like !_ in a pattern will never match anything, therefore it is useless.

10.2.5  Backquote construct

Backquote construct () can be used to build an algebraic term or to retrieve the value of a Tom variable (a variable instantiated by pattern-matching).

BackQuoteTerm::=[] CompositeTerm

The syntax of CompositeTerm is not fixed since it depends on the underlying language.

However, CompositeTerm should be of the following form:

In general, it is sufficient to add a backquote before the term you want to build to have the desired behavior. The execution of ‘f(g(a())) will build the term f(g(a)), assuming that f, g, and a are Tom operators. Suppose now that g is no longer a constructor but a function of the host-language. The construction ‘f(g(a())) is still valid, but the semantics is the following: the constant a is built, then the function g is called, and the returned result is put under the constructor f. Therefore, the result of g must be a correct term, which belongs to the right type (i.e. the domain of f).

To simplify the interaction with the host-language, it is also possible to use “unknown symbols” like f(x.g()) or f(1+x). The scope of the backquote construct is determined by the scope of the most external enclosing braces, except in two case: ‘x and ‘x* which allow you to use variables instantiated by the pattern part. In that case the scope is limited to the length of the variable name, eventually extended by the ’*’. Sometimes, when writing complex expression like if(‘x==‘y || ‘x==‘z), it can be useful to introduce extra braces (if( ‘(x==y || x==z) )) in order to extend the scope of the backquote.

Note: since version 2.5, type inference is available when building lists. For instance, when building a list, the following code is now valid: ‘conc(f(…)) if the type of f is known by Tom — if previously declared using a %op. Before version 2.5, supposing that the type of f was L, one had to write: L X = ‘f(…);‘conc(X*).

10.2.6  Meta-quote construct

Tom provides the construct %[]% that allows to build formatted strings without the need to encode special characters such as tabulations and carriage returns as it is usually done in Java. For example, to build a string containing the HelloWorld program, one can simply write:

String hello = %[
  public class Hello {
    public static void main(String[] args) {
      System.out.println("Hello\n\tWorld !");

Additionally, it is possible to insert in this string the content of a String variable, or the result of a function call (if it is a String) using the @ as escape character: the code contained between @ in this construct will be evaluated and the result inserted in the surrounding formatted string. Then, to add to the HelloWorld example a version string, we can use:

String version = "v12";
String hello2=%[
  public class Hello {
    public static void main(String[] args) {
      System.out.println("Hello\n\tWorld   @version@");

Even if the contents of the meta-quote construct is a formatted string, it is required that this string contains correctly balanced braces.

Note: the expression between two @ can contain Tom constructs, like backquote constructs for example.
Note: use @@ to insert the character @

10.3  Tom signature constructs (*)

10.3.1  Sort definition constructs

To define the mapping between the algebraic constructors and their concrete implementation, Tom provides a signature-mapping mechanism composed of several constructs. In addition to predefined mapping for usual builtin sorts (int, long, double, boolean, string, and char), all other algebraic sorts have to be declared using the %typeterm construct.

To use predefined sorts (in a %match construct or in the definition of a new operator), it is sufficient to use the %include construct (%include { int.tom } for example).

When defining a new type with the %typeterm construct, some information has to be provided:

Given a Java class Person we can define an algebraic mapping for this class:

 %typeterm TomPerson {
   implement { Person }
   is_sort(t) { t instanceof Person }
   equals(t1,t2) { t1.equals(t2) }

Here, we assume that the method equals implements a comparison function over instances of Person. Note that we used TomPerson to make a clear distinction between algebraic sorts (defined in Tom) and implementation sorts (defined in Java, via the use of classes). In practice, we usually use the same name to denote both the algebraic sort and the implementation sort.

The grammar is the following:

IncludeConstruct::=%include { FileName }
GomConstruct::=%gom [( optionString )] { GomGrammar }
GoalLanguageBlock::={ BlockList }
TypeTerm::=%typeterm Type {
  KeywordImplement [KeywordIsSort] [KeywordEquals]
KeywordImplement::=implement GoalLanguageBlock
KeywordIsSort::=is_sort GoalLanguageSortCheck
KeywordEquals::= equals ( Name , Name ) GoalLanguageBlock
Note: since version 2.6, the variables used in the host-code may be prefixed by a $ sign. This allows the compiler to inline the definition, making the code often smaller and more efficient.

10.3.2  Constructor definition constructs

Once algebraic sorts are declared (using %typeterm), Tom provides a mechanism to define signatures for constructors of these sorts using %op, %oplist or %oparray constructs. When defining a new symbol with the %op construct, the user should specify the name of the operator, its codomain, and its domain. The later one is defined by a list of pairs (slot-name, sort).

Let us consider again the class Person, and let us suppose that an instance of Person has two fields (name and age), we can define the following operator:

%op TomPerson person(name:String, age:int)

In this example, the algebraic operator person has two slots (name and age) respectively of sorts String and int, where String and int are pre-defined sorts.

In addition to the signature of an operator, several auxiliary functions have to be defined:

Coming back to our example, checking if an object t is rooted by the symbol person can be done by checking that t is an instance of the class Person. Building a person can be done via the Java function new Person(...). Accessing to the slots name and age could be implemented by an access to the variables of the class Person. In practice, the following operator definition should work fine:

%op TomPerson person(name:String, age:int) {
  is_fsym(t)  { t instanceof Person }
  make(t1,t2) { new Person(t1,t2) }
  get_slot(name,t) { } // assuming that 'name' is public
  get_slot(age,t)  { t.age  } // assuming that 'age' is public

When defining a new symbol with the %oplist construct, the user has to specify how the symbol is implemented. In addition, the user has to specify how a list can be built and accessed:

Similarly, when defining a new symbol with the %oparray construct, the user has to specify how the symbol is implemented, how an array can be built, and accessed:

The %oplist or %oparray is complex but not difficult to use. Let us consider the ArrayList Java class, and let us define a Tom mapping over this data-structure. The first thing to do consists in defining the sort for the elements and the sort for the list-structure:

  %typeterm Object {
    implement     { Object        }
    equals(l1,l2) { l1.equals(l2) }
  %typeterm TomList {
    implement     { ArrayList    }
    equals(l1,l2) { l1.equals(l2) }

Once defined the sorts, it becomes possible to define the list-operator TomList conc( Object* ). This operator has a variadic arity: it takes several Object and returns a TomList.

  %oparray TomList conc( Object* ) {
    is_fsym(t)       { t instanceof ArrayList }
    make_empty(n)    { new ArrayList(n)       }
    make_append(e,l) { myAdd(e,(ArrayList)l)   }
    get_element(l,n) { (Object)l.get(n)        }
    get_size(l)      { l.size()                }

  private static ArrayList myAdd(Object e,ArrayList l) {
    return l;

An auxiliary function myAdd is used since the make_append construct should return a new list. The get_element should return an element whose sort belongs to the domain (Object) in this example. Although not needed in this example, in general, a cast ((Object)l.get(n)) is needed.

The grammar for the mapping constructs is the following:

Operator::=%op Type Name
  ( [ SlotName : Type ( , SlotName : Type )* ] )
   { KeywordIsFsym ( KeywordMakeKeywordGetSlot )* }
OperatorList::= %oplist Type Name ( Type * )
   { KeywordIsFsym ( KeywordMakeEmptyList
  KeywordGetTailKeywordIsEmpty )* }
OperatorArray::= %oparray Type Name ( Type * )
   { KeywordIsFsym ( KeywordMakeEmptyArray
  KeywordGetSize )* }
KeywordIsFsym::= is_fsym ( Name ) GoalLanguageBlock
KeywordGetSlot::= get_slot ( Name , Name ) GoalLanguageBlock
KeywordMake::= make [ ( Name ( , Name )* ) ] GoalLanguageBlock
KeywordGetHead::= get_head ( Name ) GoalLanguageBlock
KeywordGetTail::= get_tail ( Name ) GoalLanguageBlock
KeywordIsEmpty::= is_empty ( Name ) GoalLanguageBlock
KeywordMakeEmptyList::= make_empty [ ( ) ] GoalLanguageBlock
KeywordMakeInsert::= make_insert ( Name , Name ) GoalLanguageBlock
KeywordGetElement::= get_element ( Name , Name ) GoalLanguageBlock
KeywordGetSize::= get_size ( Name ) GoalLanguageBlock
KeywordMakeEmptyArray::= make_empty ( Name ) GoalLanguageBlock
KeywordMakeAppend::= make_append ( Name , Name ) GoalLanguageBlock
Note: since version 2.5, is_fsym and make functions are optional (therefore the %op construct could be empty). When not declared, is_fsym default value is false. When nothing is declared, there is a make default value: a call to the function which has the same name as the considered operator. This shortcut is useful to provide type information about a Java function for example.

10.3.3  Predefined sorts and operators

See Section 13.1.

10.3.4  Gom construct

The grammar for the Gom construct is as follows:

GomConstruct::=%gom [( optionString )] { GomGrammar }

It allows to define a Gom signature (for more details about Gom see Chapter 11). The Gom compiler is called on the GomGrammar, and the construct is replaced by the produced Tom mapping. The optionString is composed by a list of command line options to pass to the underlying Gom compiler, as described in section 16.1.

10.4  XML pattern

To deal with XML documents, the XML notation can be used (<A><B attribute="name"/></A> for example).

When manipulating XML documents, we distinguish two main kinds of operations: retrieving information and transforming a document. Tom provides three different XML notations that ought to simplify the definition of patterns: the “standard ” and the “implicit” XML notations are used to define compact (but incomplete) patterns. This notation is well suited to retrieve information. The “explicit” XML notation is used to precisely describe an XML pattern and all the variables that have to be instantiated. This notation is particularly well suited to perform XML transformation since it allows the programmer to precisely describe how variables have to be instantiated.

To make the XML notation understandable, we have to explain how XML documents are handled by Tom. To each XML document corresponds a DOM (Document Object Model) representation. In Tom, we have defined a mapping from DOM sorts to abstract algebraic sorts: TNode and TNodeList, which correspond respectively to Node and NodeList, defined by the Java DOM implementation.

Thus, a Node object becomes a ternary operator Element whose first subterm is the name of the XML node, the second subterm is a list of attributes and the third subterm is a list of subterms (which correspond to XML sub-elements). The second and the third elements are terms of sort TNodeList (because they are implemented by NodeList objects in DOM).

Thus, when considering the <A></A> XML document, the corresponding algebraic term is Element("A",[],[]), where [] denotes the empty list. Similarly, <A><B attribute="name"/></A> is encoded into Element("A",[],[Element("B",[Attribute("attribute","name")],[])]).

When defining an XML pattern, the user has to introduce extra list-variables to precisely describe the XML pattern and capture the different contexts. Suppose that we are interested in finding a node <B></B> which is a subterm of a node <A></A> (but not necessary the first subterm). The algebraic pattern should be Element("A",[_*],[_*,Element("B",[_*],[_*]),_*]). Using the XML notation, this pattern should be expressed as follows: <A(_*)>(_*,<B(_*)>(_*)</B>,_*)</A>. This notation (called explicit) is precise but error prone. This is why we have introduced the explicit notation, where all context variable can be removed (and () are replaced by []): <A[]><B[]>[]</B></A>. The last notation (called standard XML notation) allows the user to remove the [] and replace the list-separator (,) by spaces. The previous pattern can be written: <A><B></B></A>.

These three different notations allow the user to choose the level of control he wants to have on the XML pattern matching algorithm.

The formal description of the syntax is the following:

XMLTerm::=< XMLNameList XMLAttrList />
 < XMLNameList XMLAttrList > XMLChilds </ XMLNameList >
 #TEXT ( IdentifierString )
 #COMMENT ( IdentifierString )
  ( (IdentifierString) , (IdentifierString) )
 ( XMLName ( XMLName )* )
XMLAttrList::=[ [ XMLAttribute (, XMLAttribute)* ] ]
 ( [ XMLAttribute (, XMLAttribute)* ] )
 ( XMLAttribute )*
 VariableName *
 AttributeName = [AnnotedName @] ( IdentifierString )
 [AnnotedName @] _ = [AnnotedName @] ( IdentifierString )
XMLChilds::=( Term )*
 [ Term ( , Term )* ]

Previous Up Next