[CS Dept logo]

Specifying Programming Language Syntax

Lecture Notes for Com Sci 221, Programming Languages

Last modified: Fri Jan 27 15:14:16 1995


Comments on these notes

Continue Monday 16 January

Syntax is a kind of structure

Definition time again, using Webster:

syntax:
the arrangement of words as elements in a sentence to show their relationship; sentence structure
The Oxford English Dictionary is a tad more abstract:
syntax:
orderly or systematic arrangement of parts or elements; constitution (of body); a connected order or system of things
In any case, syntax properly refers to the structure of an utterance in some language (it might be natural language or programming language, spoken or written, etc.) in terms of the way that it is constructed to carry meaning. In written language, syntax lies in between orthography or typography, which are concerned with the way a text is laid out on paper or a computer display, and semantics, which is concerned directly with meaning. In English, the interesting syntactic concepts include sentence, subject, predicate, object, prepositional phrase, noun, verb, adjective, etc. In programming languages they typically include program, header, declaration, body, command, expression, identifier, etc.

Programming language syntax has been formalized very successfully into a cookbook discipline. So, we can deal with it briefly and efficiently, and get on to our informal concern with the much more difficult issues of semantics.

Formal notations for programming language syntx

There are three famous methods for specifying programming language syntax:

The good news is that all three are essentially the same: they differ only in minor notational ways. Even if you think that you haven't heard of these, you have almost certainly used at least one of them informally. If you ever diagrammed sentences in grammar school, you should see that the rules for sentence diagrams are essentially the same as CFGs, BNFs, and syntax charts, while the diagrams themselves are essentially the same as the derivations that are made using CFGs, BNFs, syntax charts. Pages 19-20 in the text show an example of the same information presented in BNF and in syntax charts. The CFG version is shown in Figure 2 below
E --> T
E --> T + E
E --> T - E
T --> F
T --> F * T
T --> F div T
F --> ( E )
F --> V
F --> C
Figure 2: example of a Context-Free Grammar

In CFG jargon, each line is a production. E, T, F, V, and C are called nonterminal symbols. +, -, *, and div are called terminal symbols. The terminal symbols are those that are actually used in the language. The nonterminals are only intended to be replaced by terminals, following the rules given in the productions. The example in Figure 2 is incomplete: there should be more productions to allow Vs and Cs to be replaced by terminals.

There is no essential difference between a CFG and a BNF. In a BNF, the derivation symbol in each production is written ::=, instead of -->. The nonterminals are written as words or phrases inside pointy brackets (< ... >), instead of capital letters. The terminals are quoted. None of this should be very exciting. Syntax charts look quite different at first glance, but Homework 2 shows by example that they are also essentially the same as CFGs and BNFs. From now on, I will refer to CFGs when I discuss the qualities shared by all three notations.

There are some serious notational conveniences, such as the iterative and conditional notations in EBNF, that do not affect the basic power of CFGs and similar systems. Make sure that you understand how each extended notation may be translated into pure CFG notation.

What's so great about Context-Free Grammars?

Books and articles in theoretical computer science, and even in programming languages and linguistics, emphasize the use of CFGs to define which strings of symbols are and are not syntactically correct members of a given language. That is not really the important value of CFGs. The truly important use of CFGs is to define a way to parse syntactically correct strings: that is to associate with a string a tree structure (called a derivation tree or parse tree, or sometimes just a parse) presenting the syntax of the string. The parse of a program is a much better starting point for interpreting or compiling the program than the plain source text.


Make sure that you understand precisely how parse trees are associated with strings by a CFG. See Sections 10.1 through 10.2 of the text for details.

CFGs are wonderful because they are simultaneously readable by humans, and suitable as the basis for completely automatic parsing. In effect, CFGs are a sort of highly self-documenting programming language for parsers. They are included in programming language manuals as the last resort documentation of syntactic issues. And, they are processed by parser generators, such as Yacc and Bison, which compile them into parsing code.

CFGs represent an incredible success story in computer science. In the olden days, when FORTRAN was just being invented, the problem of parsing a program was the subject of Ph.D. dissertations. Now, the automatic processing of CFG specifications allows college students in a compiler writing course to solve parsing problems routinely. The first automatic parser generators were so exciting that people called them "compiler compilers." Of course, a parser generator merely frees the implementor of a compiler to spend her time on the really hard part: generating good code.

The marvellously self-documenting quality of CFGs arises because, when they are constructed wisely, the nonterminal symbols of CFGs represent sensible syntactic categories. For example, in the CFG for arithmetic expressions in Figure 2 above, the nonterminal symbol E represents the category of expressions, V represents variables, and C represents constants. Similarly meaningful categories, such as statements, declarations, etc., also correspond to particular nonterminal symbols in a complete CFG for a whole programming language. But, there are also nonterminal symbols, such as T and F in Figure 2 above, that do not correspond to grammatically meaningful and useful categories. Sure, T is supposed to stand for "term," and F for "factor," but those are not particularly important grammatical categories in a program. Rather, T and F are gimmicks, added to the grammar to enforce the normal rules giving * and div precedence over + and -.

You must learn to distinguish, based on intuition and common sense, the grammatically meaningful parts of the structure determined by a CFG from the gimmicks. There are some extended notations for CFG that reduce the dependence on gimmicks, but a lot of programming language manuals still give the gimmicks equal status with the meaningful symbols.

End Monday 16 January
My lecture in class did not follow the order of the lecture notes very well here, so this cut point is particularly fuzzy.
%<----------------------------------------------
Begin Wednesday 18 January

What is programming language syntax, precisely?

I refuse to be absolutely precise, but I'll come much closer than with other definitions. For almost all purposes, almost all of the syntactic qualities of almost all programming languages may be regarded as a syntax tree. Most people in computer science say "abstract syntax tree" instead of "syntax tree," because they never looked up the definition of syntax, and they think that "syntax" by itself (or "concrete syntax" when they want to be more pedantic) means typography, rather than structure.

So, syntax is a tree. But what tree? Well, given the right CFG for a language, the syntax tree of a program is almost the parse tree, except that the terminal symbols and gimmicks are taken out, and the natural conceptual operators are put in. This is best understood by example. Consider the expression

x + y * 3 + z * (w - 4)
The parse tree, using the grammar in Figure 2 (with some obvious additional productions to get rid of Vs and Cs) is shown in Figure 3. The most usual idea of the syntax tree is shown in Figure 4. Notice that the extra steps involving T and F have been omitted, since they are really just gimmicks to enforce precedence. At each node of the tree, instead of the nonterminal from the parse tree, we have the operator that is being applied. Reasonable people may disagree over fine points in the construction of syntax trees. For example, if + and * are understood as operations combining more than 2 operands (which is suggested by the EBNF version on p. 19 of the text), then we might prefer the syntax tree of Figure 5, which treats the iteration of the production E --> T + E as a gimmick, rather than a structural step.

I wrote out "add," "mult," etc. in this example to emphasize that the operation is not the same thing as the terminal symbol (+ or *) that corresponds to it so naturally. In the future, I will use the most convenient and mnemonic symbols in syntax trees, which will often be the same as the symbols in the "concrete syntax." In other examples, such as the if ... then ... else ... fi example in Figure 6, there is no clear 1-1 correspondence between "concrete" symbols and "abstract" operators. Notice that in popular mathematical notation, there is often no terminal symbol at all to denote multiplication. Also, parentheses do not correspond to anything in the syntax tree. Rather, they are a gimmick involving terminal symbols, used to control the shape of the syntax tree.

Entire programs, as well as expressions, have natural syntax trees. In principle, there is nothing at all subtle about associating a syntax tree with a program, but many students confuse syntax trees with the similar looking, but very different, flow charts. A syntax tree shows the structure of the program as it is constructed from its parts. A flow chart shows the structure of the execution of the program. The best way to understand syntax trees for programs is to study carefully the example in Figure 7, which gives a syntax tree for the program in Figure 8 below.


read(i);

while i>1 do

   if odd(i) then

      i := 3i+1

   else

      i := i/2

   fi

od
Figure 8
Puzzle: does the program above halt for all inputs?
Notice that the syntax tree is more closely connected to the typical indentation of the program than to a flow chart.


End Wednesday 18 January
My lecture in class did not follow the order of the lecture notes very well here, so this cut point is particularly fuzzy.
%<----------------------------------------------
Begin Friday 20 January


There is a small amount of additional material, which I have no time to type in, regarding precedence, associativity, ambiguity, and in particular the dangling else problem.