LLOOP Index | GSP Language | GSP Library | Framework Classes | Component Classes

LLOOP Beginner's Guide and Tutorial

This page explains how to create parsers, in a progressive approach guided through simple examples.


How to create a parser

Steps Overview

  1. Set up your project
  2. Write the GSP specification file(s) describing the syntax of the input to parse
  3. Generate the parser from the GSP file(s)
  4. Optionally, compile and test your parser standalone
  5. Integrate the parser into the target application

The next paragraphs give further details and guidance for the fulfilment of these steps.

For each step, it is always useful to have a look on the provided examples. Open an example project (.lloop files) with the graphical interface, check the GSP files part of the project, use the interface to generate, compile and run the example.

Project Setup

Use preferably the graphical interface, GFE LLOOP. Create your project and configure your project parameters from there. Refer to the page how to use LLOOP.

GSP specification file(s)

The input to parse is described formally in GSP files (text files with .gsp extension) using the GSP language. The syntax and structure of the input to accept is described there with syntax rules according to the Extended BNF notation. There is a tutorial provided later on this page describing how to write syntax rules in BNF, as well as their relationship with the way that parsing actually works.

Parse Code Generation

Refer to the page how to use LLOOP. Ensure that your project is setup properly and that all required GSP file(s) are part of the project. Check the Generated Code Reference page for further information about generated files.

Standalone Compilation and Test

Refer to the page how to use LLOOP. Ensure that your project is setup to build an executable, and that all required external items (additional source code files, libraries...) are specified in the project.

The generator creates the necessary Makefiles and main programs to run the parser. The user has just to click on the appropriate buttons for compiling or executing. Check the Generated Code Reference page for further information about generated Makefiles and main programs.

Integration

A class named Parser is created from the GSP spec file(s). An instance of this class allows to parse an input source (file, string, main arguments...) according to the syntax rules specified in the GSP spec files. See this example. This class is defined in a namespace which is the name of the gsp file, without extension, from which the class was generated.

The GSP language also offers facilities to integrate the parsed data on application level. Further information are provided later on this page here .

Defining Syntax Rules in Extended BNF

Introduction

Contents of structured files are usually made of constant strings on the one hand, and of other strings that are identified by their belonging to a family of strings featuring a common meaning and construction form on the other hand.

A constant string is often referred to with the shortened term constant and a family of string is often referred to as a token.

Whitespaces are those specific chars used to separate from each other constants and tokens.

A syntax rule describes the sequence of constants and tokens as they appear from left to right in the input, regardless of the in-between whitespaces.

Constants

Any random sequence of chars can be considered and used as a constant string. Constants are commonly used as delimiters, for organising and making data distinguishable, or as keywords for commands identification, or more simply to represent fixed values.

Tokens

A token corresponds to a family of strings for which the user specifies the criteria of acceptance. There are numerous well-known string families commonly used by developers, like all strings representing integer numbers, which are made of a random sequence of numeric chars, possibly starting with a minus char for negative values.

Whitespaces

Constants and tokens are usually - but not mandatory - made distinct from each other by introducing whitespaces, i.e. separator chars, like space, new line or tabulation chars. Whitespaces don't count as relevant information and are skipped. Whitespaces are also often useful for organising visually the information and arrange their presentation to make them readable.

Example of constants, tokens and whitespaces

Here's now an example illustrating all this. This example serves only teaching and understanding purposes, regardless of the actual usefulness and suitability for any other purpose. Let's suppose we have the following input giving co-ordinates of successive points of a geometric shape:

Point (x=0, y=0)
Point (x=0, y=4)
Point (x=2, y=2)
Point (x=4, y=4)
Point (x=4, y=0)

Here 'Point' is a constant used as a keyword for declaring a point. The parenthesises '(' ')' are also constants, used to delimit the co-ordinates from the other information. The coma constant ',' is used to make a clear separating between x co-ordinate and y co-ordinate. The assignment chars '=' are constants used to make the co-ordinate values assignments explicit.

The new-line char is used here to make the declared points clearly distinguishable. This whitespace char is not an information in itself, it just helps to arrange visually the declared points.

The actual co-ordinate values are given as integer tokens. These data have a particular meaning and the drawing of the shape depends on their reading and interpretation.

Syntax rules

A syntax rule describes the sequence of constants and tokens as they appear from left to right in the input, regardless of the in-between whitespaces.

Using the BNF notation, a syntax rule describing formally the syntax of a single point for the above mentioned example would be :

'Point' '(' 'x' '=' int ',' 'y' '=' int ')'

Note : Items surrounded by simple quotes correspond to the expected constants. The item 'int' stands here for an integer token.

It is often useful to associate a name to a syntax rule, i.e. for identifying the rule and referring to it through a simple and short reference. This is specified as follows:

Shape ::= 'Point' '(' 'x' '=' int ',' 'y' '=' int ')'

which means the syntax rule named Shape is defined by the sequence of constants and tokens at the right of the meta-symbol ::=.

A rule can refer to another rule using its name, to express that an expected sequence of constants and tokens is defined by the mentioned rule. This allows to define more complex hierarchical syntactical structures.

The name of a rule referred to from another rule is called a non-terminal symbol or more simply a non-terminal, in contrast to constants and tokens which are considered as terminal symbols or terminals, as they do not stand for another syntax rule.

It is also often handy to refer to tokens and non-terminals through the abstract term Symbol.

The syntax rule given above is not complete, because a point can be followed by another point till there is no one left. The exact syntax rule which describes this is actually:

Shape ::= 'Point' '(' 'x' '=' int ',' 'y' '=' int ')' [ Shape ]

The meta-symbols [ ] mean that all symbols enclosed within them are optional.

Generally, complex input can not be represented with just linear sequences. It is often useful to be able to define a set of alternative rules. For example, let's suppose that holes can be defined for geometric shapes as follows:

Point (x=0, y=0)
Point (x=0, y=4)
Hole
Point (x=2, y=2)
Point (x=4, y=4)
Point (x=4, y=0)

meaning that no straight lines shall be drawn between the point defined just before and the one defined just after the hole (here (0,4) and (2,2)). A rule describing formally the syntax of a list of points with holes would be:

Shape ::= 'Point' '(' 'x' '=' int ',' 'y' '=' int ')' [ Shape ]
| 'Hole' [ Shape ]
|

Note :

Now, let's assume that a colour among blue, white and red, respectively identified by the keywords blue, red and white, can optionally be given for a point as illustrated in this example :

Point (x=2, y=2, blue)

To support this, we could introduce the following rule:

Color ::= 'blue' 
| 'red'
| 'white'

and complete the initial Shape rule as follows :

Shape ::= 'Point' '(' 'x' '=' int ',' 'y' '=' int [',' Color] ')' [Shape]
| 'Hole' [ Shape ]
| 

Finally, it is interesting to note that expressing symbols repetitiveness like here is a recurring requirement when specifying syntax rules. Following purely the principles described above and respecting this natural and intuitive process consisting in specifying from left to right the expected symbols, repetitiveness can be introduced by referring to a rule in the rule itself.

BNF defines a special writing that shortens and makes more readable the definitions of repetitive items. For example, the above rule could also be written as follows:

Shapes ::= { Shape }
Shape ::= 'Point' '(' 'x' '=' int ',' 'y' '=' int [',' Color] ')' 
| 'Hole'

The meta-symbols { } mean that all symbols enclosed within them are repetitive (zero or more occurrences).

Syntax rule reduction code

The reduction code is a code which is associated with a syntax rule and is executed whenever the rule could be successfully parsed.

A reduction code is generally used to get and process the parsed values.

While BNF is a programming language-independent notation, a reduction code is written in the target programming language in which the matching parser will be generated.

Grammars

A grammar consists of a top (or main) syntax rule and a set of other syntax rules for all non-terminals. Order of appearance of the other rules is not relevant, provided that they all come after the top rule.

The non-terminal symbol standing for the top syntax rule is called the root non-terminal.

The set of all syntax rules of a grammar forms the syntactical parse tree, because the structure defined by the rules can always be represented with a tree whose leaves are either constants or tokens, and nodes the non-terminals of the grammar. For instance, the syntax rule mentioned in the previous paragraph constitutes by itself a grammar whose syntactical parse tree is the following :

Note : Optional symbols have a dashed frame. Tokens are depicted in red and non-terminals in blue.

LL Grammars

LL Grammars are that class of grammars on which rely LL parsers for their running. LL parsers attempt to read symbols of syntax rules from left to right and top-down in the syntactical parse tree. They perform a leftmost derivation of the input information in that sense that they always attempt to read the leftmost symbol in the parse tree while scanning it from top to bottom till to find tokens or constants. If the leftmost symbol could be read, it attempts to read the next symbol in the rule. If this latter is a non-terminal, it scans down the parse tree below this symbol by performing the same leftmost derivation. It is only once the symbol was successfully read that the next symbol of the rule can be processed.

If the whole rule could be successfully read, the parse engine goes up in the parse tree and attempts to read the symbol of the upper syntax rule that immediately follows the non-terminal standing for the read rule, using the same leftmost derivation. If the parser failed to read the rule, it backtracks up in the parse tree through the previously scanned symbols until to find a valid alternative rule.

The way LL grammars are written must take into account the way LL parsers work. For example, it is obvious that the first symbol of a syntax rule can't be the non-terminal standing for that rule, because this would lead to an infinite left recursion.

The following example shows how a LL Parser does work. Let's assume we have the following input to parse using the above mentioned syntax rules:

Point (x=0, y=0, blue)
Point (x=0, y=4)
Hole
Point (x=2, y=2)

Here's how the parsing process would unwind in LL step by step. The table given below divides into three columns. The first gives the next symbol to be read in the input, the second the expected symbol currently scanned in the grammar and the last the action undertaken by the parser according to the read and expected symbols :

Next token or constant to be read Scanned / Expected symbol in grammar Action undertaken by parser
'Point'Shade (root) - 1st rule Go down in parse tree and attempt to resolve first rule for 'Shade' non-terminal.
'Point''Point'OK : Read and process next symbol in rule.
'(''('OK : Read and process next symbol in rule.
'x''x'OK : Read and process next symbol in rule.
'=''='OK : Read and process next symbol in rule.
'0''int' tokenOK : Read and process next symbol in rule.
','','OK : Read and process next symbol in rule.
'y''y'OK : Read and process next symbol in rule.
'=''='OK : Read and process next symbol in rule.
'0''int' tokenOK : Read and process next symbol in rule.
','','OK : Read and process next symbol in rule.
'blue'ColorGo down in parse tree and attempt to resolve first rule for 'Color'.
'blue''blue'OK : 'Color' non-terminal resolved. Process next symbol in upper rule.
')'')'OK : Read and process next symbol in rule.
'Point'Shade (second) - 1st rule Go down in parse tree and attempt to resolve first rule for 'Shade'.
'Point''Point'OK : Read and process next symbol in rule.
'(''('OK : Read and process next symbol in rule.
'x''x'OK : Read and process next symbol in rule.
'=''='OK : Read and process next symbol in rule.
'0''int' tokenOK : Read and process next symbol in rule.
','','OK : Read and process next symbol in rule.
'y''y'OK : Read and process next symbol in rule.
'=''='OK : Read and process next symbol in rule.
'4''int' tokenOK : Read and process next symbol in rule.
')'','FAILURE : Failed to read optional symbol sequence. Process symbol following optional sequence in rule.
')'')'OK : Read and process next symbol in rule.
'Hole'Shade (third) - 1st rule Go down in parse tree and attempt to resolve first rule for 'Shade'.
'Hole''Point'FAILURE : Constant mismatch. Try an alternative rule of the 'Shade' non-terminal.
'Hole'Shade (third) - 2nd rule Go down in parse tree and attempt to resolve second rule for 'Shade'.
'Hole''Hole'OK : Read and process next symbol in rule.
'Point'Shade (fourth) - 1st rule Go down in parse tree and attempt to resolve first rule for 'Shade'.
'Point''Point'OK : Read and process next symbol in rule.
'(''('OK : Read and process next symbol in rule.
'x''x'OK : Read and process next symbol in rule.
'=''='OK : Read and process next symbol in rule.
'2''int' tokenOK : Read and process next symbol in rule.
','','OK : Read and process next symbol in rule.
'y''y'OK : Read and process next symbol in rule.
'=''='OK : Read and process next symbol in rule.
'2''int' tokenOK : Read and process next symbol in rule.
')'','FAILURE : Failed to read optional symbol sequence. Process symbol following optional sequence in rule.
')'')'OK : Read and process next symbol in rule.
(nothing)Shade (fifth) - 1st rule Go down in parse tree and attempt to resolve first rule for 'Shade'.
(nothing)'Point'FAILURE : Try an alternative rule of the 'Shade' non-terminal.
(nothing)Shade (fifth) - 2nd rule Go down in parse tree and attempt to resolve first rule for 'Shade'.
(nothing)'Point'FAILURE : Try an alternative rule of the 'Shade' non-terminal.
(nothing)'Hole'The fifth 'Shade' (last symbol in rule) is an optional symbol of the fourth 'Shade'. The fourth 'Shade' non-terminal is resolved. Process next symbol in upper rule.
(nothing) The third 'Shade' non-terminal is resolved. Process next symbol in upper rule.
(nothing) The second 'Shade' non-terminal is resolved. Process next symbol in upper rule.
(nothing) The first 'Shade' non-terminal is resolved. This is the root symbol. As there is nothing else to parse and that there is nothing left to parse from the input, the parsing is recognised as successful.

LR Grammars (informative only)

This section has only informative purposes and gives a presentation of the alternative parsing principle (LR parsing). Its intent is primarily to explain how LR parsers work in comparison with LL parsers. It does not provide complete and detailed information about LR parsing nor any example. As LLOOP only generates LL parsers, the following information is not relevant for the strict use of LLOOP and can be skipped.

LR Grammars are that class of grammars on which rely LR parsers for their running. LR parsers attempt to read symbols of syntax rules from left to right and bottom-up in the syntactical parse tree. They perform a rightmost derivation of the input information in that sense that they always attempt to read the rightmost symbol of the rule that will be searched for correspondence with the read symbol sequence.

The way LR grammars are written must take into account the way LR parsers work, which is different of LL parsers. Therefore, LR and LL grammars are most time incompatible. However, the class of grammars that can be accepted for LR parsers is wider than the class of grammars accepted by LL grammars.

Due to their complexity, LR parsers can not be hand-written and their working is difficult to understand. In constrast, LL parser can theorically be designed and written by hand. Their matching grammars are much more natural and easier to understand.

Pre-processsors

It is sometimes useful to pre-process an input before it is actually parsed, e.g to strip the input of useless and irrelevant information (like comments) or to make some string replacements.

The function or object that performs the pre-processing is called a Pre-processor.

Post-processsors

Post-processors are used to scan parsed symbols and process them once the whole input was parsed. The order of processing depends on the post-processor's implementation, but in general it is in the same order as the one symbols were parsed from the input.

The function or object that performs the post-processing is called a Post-processor. In object-oriented language, it is also called a Visitor because the object-oriented postprocessor relies on the Visitor design pattern to visit each symbol.

GSP Grammars and GSP Parsers

Overview

GSP grammars are LL grammars and the generated GSP parsers are LL parsers.

Parsers: Principle

A class named Parser is created from the GSP spec file(s). An instance of this class allows to parse an input (file, string, main arguments...) according to the syntax rules specified in those GSP spec files.

There is class generated for each symbol, respectively each non-terminal and each token. This class defines the following functions:

All generated classes are defined in a C++ namespace which is the name of the gsp file, without extension, from which the class was generated.

When the parser needs to parse an input according to one or several alternative syntax rule(s), it creates an object for the matching non-terminal and calls its parse function.

If the parsing was successful, the reduction code of the matching syntax rule is executed inside the object. Otherwise, the object becomes useless and is destroyed.

Each non-terminal object keeps a reference to all right-hand non-terminal and token objects which were created during parsing according the syntax rule. When a non-terminal object is destroyed, all objects created during the parsing of the non-terminal are also destroyed.

Preprocessors: Principle

There is class generated for each preprocessor. This class defines the following functions:

The pre-processors are invoked by the parser just before the parsing takes place. When preprocessors are defined, parsing is carried out on the preprocessed input, not the original.

Postprocessors: Principle

Post-processors have to be written and handled manually directly in C++.

The framework defines the gsp::Visitor class which should be derived to implement any postprocessor. To carry out post-processing, the post-processor object must be created and the root symbol object must be visited using the visit function.

The Expander and the Backtracer are typical examples of post-processors.

Application-level Integration

A specific set of commands is available to allow integration of the parsed data on application level:

Note that all these commands must be provided prior to the BNF syntax rules definition.

Making inherit a symbol from a user-defined application class is one of the most useful features, because it allows to initialise application objects of that class directly from within the reduction code executed for the symbol object.

The following example is part of the release package. All specification files and generated code can be checked there.

Suppose the following class is used to hold the RGB color values for a pixel:

class CPaletteColor
{
  public:

   CPaletteColor(int r, int g, int b) 
      : m_iRed(r), m_iGreen(g), m_iBlue(b)
   {
   } 

   int red() const { return m_iRed;}
   int green() const { return m_iGreen;}
   int blue() const { return m_iBlue;}

  protected:

    int m_iRed;   // red color
    int m_iGreen; // green color
    int m_iBlue;  // blue color
};

Suppose now we want to read a palette of RGB colors from an input like the following and put the color values into CPaletteColor objects:

white = rgb(255, 255, 255)
black = rgb(0, 0 ,0)
green = rgb(0, 255 ,0)
red = rgb(255, 0, 0)
blue = rgb(0, 0, 255)
yellow = rgb(255, 255, 80)

The following gsp spec allows read such an input. It is the symbol PaletteColor that defines the needed syntax rule. The symbol PaletteColor also inherits from the class CPaletteColor. When a PaletteColor is created and successfully parsed, it initialises directly the matching base CPaletteColor object from within the reduction code.

import word; /* import token defined in word.gsp */
import uint; /* import token defined in uint.gsp */

symbol PaletteColor extends class CPaletteColor(0, 0, 0);

Palette ::= { PaletteColor } 
{{ 
  printf("\nPalette defines %d color(s)\n", $1(#));
}}

PaletteColor ::= word '=' 'rgb' '(' uint ',' uint ',' uint ')' 
{{
  m_iRed = $2;
  m_iGreen = $3;
  m_iBlue = $4;
}}

This gsp spec does not only allow to parse the required input, but it also allows to get the user application objects created and initialised at the same time.

The following C++ code shows how to use the code generated from the above gsp specification in order to parse an input according to this specification and how to get back the created CPaletteColor objects:

#include "palette__Parser.h"
#include "palette__Palette.h"

#include "palette__PaletteColor.h"// Inserted manually to initial generated program

using namespace std;

int main(int argc, char** argv)
{
  /**
   * The main program takes the name of a file to parse
   */

  if (argc != 2) 
    {
      cerr <<  "Usage: " << argv[0] << " " << endl;
      return -1;
    }
      
  /**
   * Run the parsing and check for failures.
   */

  palette::Parser parser(argv[1]);  

  parser.run();

  if (parser.fail())
    {
      if (parser.syntaxerror())
	{
	  parser.outputFailureContext(cerr,4);
	}
      parser.outputFailureMessage(cerr);
      cerr << endl;
      cerr << parser.getStreamName() << ": parse error." << endl;
    }

  /* On success, expand parsed object.
   */

  if (!parser.fail())
    {
      cout << "This is what you have parsed:" << endl;
      parser.root().expand(cout);

      /* This was inserted manually to initial generated program
       * to show how to get the CPaletteColor objects
       */

      unsigned long i = 0;
      palette::Palette* pRoot = &parser.root();
      for (i = 0; i  < pRoot->getPaletteColor()->getNbComponents(); i++)
	{
	  palette::PaletteColor* p = pRoot->getPaletteColor()->getComponent(i);
	  CPaletteColor& color = *p;
	  printf("rgb values: %d %d %d \n", color.red(), color.green(), color.blue());
	}
    }

  /**
   * Destroy parser and root object.
   */
  
  if (!parser.fail())
    {
      palette::Palette* pRoot = &parser.root();
      parser.root().factory().destroy(pRoot);
    }

  return 0;
}

Links

Check the GSP Language Reference page for further information about how to write gsp grammars and all available features.

Check the Generated Code Reference page for further information about generated files and all other functions available for non-terminal and token objects.

Check also the examples.


This file is part of the LLOOP Reversible Object-Oriented Parser Generator. Copyright (c) 2005-2006 Michel MEHL, France. All rights reserved. LLOOP is distributed by the company ERSA SaRL.


Copyright (c) 2005-2006 Michel MEHL, Haguenau, France
LLOOP version 1.1