Creating a compiler for your own language

Creating a Compiler for your own Language August 2013 University of Catania Andrea Tino How to design a programming language and develop a compiler/interpreter to make it alive. Today’s most common tools and basic theoretical background. Tech article

Andrea Tino - 2013 This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. To view a copy of this license, visit http:// creativecommons.org/licenses/by-nc-sa/3.0/. License This work is not complete. You may find errors, mistakes and wrong information inside. Draft

pag 3 Andrea Tino - 2013 About this work This work is the result of a few weeks I spent studying for my exam about compilers and interpreters; I had to struggle to find all the infromation I needed: this took time both for retrieving good studying material and then sorting and putting all pieces in the correct place and order. Well I could’ve just followed classes, you re definitely right, but I could not as I was abroad for an internship and once back I had to sustain an exam without having attended any lesson. I thought I could provide students with a good reference for their exams, something they could use both for studying and having a fast look when something does not come in mind. Definitely this is not (at the moment) something meant to 100% replace a real book about such a big and complex subject. So do not misunderstand me: do not skip classes, take your notes and use them. Maybe you can just avoid buying one of those holy bibles about compilers with 1000 pages which you will read only 200. Who is this for? This (quasi)book is for everyone but especially students in an Engineering course at university. It requires some background in languages, programming and Computer Science. Respect the rights This work is meant to be free and it is licensed accordingly as reported in the previous page. You can download and read it, you can copy it and distribute it as it is. Please do not take parts from here without attributing those to me (since I spent time on this and nothing inside these pages comes from a cut & paste). You are NOT allowed to sell this book or parts of it. Thank you. Reporting errors This work might contains errors, typos, mistakes. If you detect a problem and you want to help me fix it, please report it to me at andry.tino@gmail.com. Thank you :)

Andrea Tino - 2013 Introduction to compilers Human language appears to be a unique phenomenon, without significant analogue in the animal world. “ “ Noam Chomsky http://www.brainyquote.com/quotes/quotes/n/ noamchomsk398152.html

pag 5 Andrea Tino - 2013 What is a compiler? A compiler(#) is a program allowing the translation from a language to another. So a compiler is not necessarly a program like GCC or G++ that translate C or C++ languages into machine code. Adobe Acrobat Reader is a compiler for example: it enables the translation from an Adobe proprietary language to a readable document. A compiler is also a program that transforms a sequence of mathematical calculations into an image; vector graphics work this way. Basically a compiler is nothing more than a translator from a format to another one. The input of a compiler is a language called Source Language. What a compiler returns is the output language: Object Language. If something goes wrong during the compilation process, the compiler might return some errors as well together with a non-working output. Theoretically speaking When building a compiler, a lot of elements get involved into this process. The theoretical basis is quite large spanning from Memory Management Algorithms, Graph Theory to many Mathematical fiends like Lattice Thory, Posets, State Machines, Markov Chains, Stochastic Processes and so on. Such a knowledge is not an option. You’re not alone But bad news is not everything. Some good news are as well. When designing and developing a compiler, some tools can be of help out there in the network. Many of them can assist the developer during the most important phases of a compiler programming. Here we will see some of them at work and we will also detail how to take advantage of them for our purposes. Classification It is not possible to speak about a programming language and its compiler by keeping them separated. When trying to create categories for compilers, we need to create a classification on languages. There are many ways to classify compilers. Number of steps (#) With the word “compiler“ we also refer to interpreters.

pag 6 Andrea Tino - 2013 When a compiler runs, it processes the source code in order to produce the target laguage. This operation can be performed in one step or more. A step is the number of times that the source code is read by the compiler. One-step compilers are very fast, while multiple-step compilers usually take more time to complete the task. A multiple-step compiler may also need to jump from one part of the code to another, this includes the possibility to go back to some statement read before. Optimization Some compilers are developed with specific targets in mind. When talking about optimization, one must always think about a target function depending on some parameters; the purpose of an optimization problem is to minimaze or maximize the value of this function. Depending on the quantity expressed by the target function, the optimization is focused on a specific parameter. • No optimization: No optimization is performed by the compiler. • Memory optimization: The compiler tries to minimize the memory occupation of the output. • Time optimization: The compiler tries to be as fast as possible during the translation process. • Power optimization: The compiler tries to minimize the power usage during the translation process. In software industry, large source files take hours to compile; using compilers optimizing power allows companies to save energy (ans, so, money). Optimization is not a rare feature of compilers. Interpreted vs. compiled languages We can develop a compiler or an interpreter. They both carry the same target, but the difference between a compiler and an interpreter relies on how the language is processed before being executed and during its execution. Compiled languages For a compiled language, when the programmers writes a program, this is compiled and turned into some other language. This process occurs only once. After it, the compiler’s output is something that does not require the compiler anymore. When a language is compiled into a program, considering the special case of executable code, this can be executed without calling the compiler anymore, the machine can execute all operations since it has everything needed. Interpreted languages When a language is interpreted, on the other hand, the compiler will process one language statement at a time, the output will be provided to the machine (or to some other component) in order to execute some operation. The source code will be inspected many times one statement at a time; every statement, after compiled, is passed to the machine. Such an approach is faster than a single compilation, but the user will experience a slower output as

pag 7 Andrea Tino - 2013 every new part of the source language is compiled everytime it is needed. This also makes the interpretation process much less resource-demanding. Hybrid solutions The world is not black & white. The fight is not a one on one between compiled and interpreted languages. In the last 20 years, new hybrid solutions were developed. Compiled languages Interpreted languages Hybrid solutions Compilation time They take much time to be processed. For large sources, it might require even days. Each statement process- ment is very fast and takes few seconds. Compilation to bytecode and to machine code happen on different times. This makes the whole compilation faster. Execution time Very fast. For long sources, execution may cause freezing and low performance. Virtual code must be interpreted, this slows down performance. JITs can be used to fasten the process. Resources at compilation time They may need a lot of memory to complete. Very resource demanding. Allocated space is not much and (nearly) freed after every statement is processed, Very few resources are used since the bytecode is machine independent. Output code size Outputs can be very large depending on sources. Lightweight outputs. Bytecode is very light. Design complexity Many components involved, many software patterns can be applied. Hich complexity. Very simple to design. Few components and loose coupling. Easy to design. The compiler to bytecode is simple do design. The virtual machine must be designed for each supported architecture. Implementation complexity Many components, many tools. More programming languages may be used. High complecity. Easy to implement. Fast algorithms and many tools. Compiler to bytecode is simpler to implement. The virtual machine has different implementations for each supported architecture. Target language type: machine-dependence Basing on the type of the target language, we can find three different types of compilers: • Pure machine code: These compilers generate machine code for a specific machine instruction set, thus able to run on a specific architecture. Such code does not make assumptions on the system it is run on, no external library, no system function is used. The code is pure, meaning that it is handled by the machine in the lowest level of the architecture. This approach ensures high performance, for this reason it is quite common when developing

pag 8 Andrea Tino - 2013 operating systems. • Augmented machine code: The object code is generated basing on a specific machine istruction set, but it is enriched with calls to some system functions and libraries (system calls). They are often intended to handle very basic operations such as memory allocation, deallocation, I/O access and so on. Augmenting code makes the output not so machine- dependent; the operating system ensures that a considerably large part of the target code depends on the operating system too. All calls to the system are compiled as they are and not really translated, in this case, however, the compiler will not produce a working code, an external component is necessary in order to link all system calls to the corresponding system functions (thus, the Linker). Loose coupling between the compiler and the machine guarantees high performance and good levels of maintenance. Many general purpose applications can be developed using this approach. • Virtual machine code: These compilers generate virtual machine code. It often reffered to as Intermediate Code or Bytecode as it cannot really be handled by the machine; it needs a Virtual Machine to translate this code into real machine code. With this approach a new layer is added between the compiler and the machine, the virtual code is interpreted by a program that must be installed into the operating system. The pro of this solution is that it is possible to have real machine independent languages, the machine dependence is moved onto the virtual machine which is in charge of translating the virtual (machine independent) code into machine code. Today, virtual machine code compilers are very common. Sun Microsystems was the pioneer of this technology and after Java, many other languages started blooming: Haskell, Erlang, Python, Perl, .NET Framework and son on. They all use virtual code. Again on virtual code Although such compilers ensure good machine decoupling and maintenance, some cons are to be pointed out. First of all is execution time: virtual machines can slow down execution by 3 to 10 times. It is a considerable decay of performance but not unexpected. To solve the problem, JIT compilers are used. A Just-In-Time compiler is a virtual code compiler that runs in the virtual machine and compiles all those parts of the virtual code that might slow down the application. So when the virtual machine compiles the bytecode and encounters heavy-work parts, it will find them already compiled into machine code. Everything’s virtual today Modern compilers for general purpose applications, today, are nearly all virtual code. Target language format How does a compiler format the output language? Compilers can create target codes in nearly 3 different formats: • Assembly language: Assembly text code is generated. This code provides a very little

pag 9 Andrea Tino - 2013 abstraction from the machine code. The final translation to machine code is left to the Assembler compiler. The abstraction level is not that high but provides a way to write the assembly manually. Memory addresses, jump destinations and other elements are translated into machine format after assembling. Assembly code is very good especially when targeting more machines, they only need to share the same architecture (cross-compilation). • Relocatable binary: Output code is generated in binary format but some external links are made to functions and libraries. This makes the output a non-working one since all components must be compiled separetely and then linked together to generate a final executable binary. The linkage is a very important operation and this approach allows the possibility to create modules to be handled separately, though they are tight-coupled. • Memory image: Output is a final executable binary. This approach ensures fast compilation but produces a single output without links and components. Which means that every single change in the source causes the compilation process to be run again and replace the old binary. Memory-image compilers are not very common today. The code format actually defines the level of abstraction of the output code from the machine code. Dissecting a compiler A compiler, in its most general representation, can be easily seen a sequence of different sequential components. These components can be divided into two groups: • Source code analysis: Every component is in charge of creating a different representation of the initial source code. Each new representation enables the next representation to be generated. All of them are a translation of the source code, no output code is considered yet. • Output code synthesis: The final component in the analysis chain will return a representation starting from which the synthesis components can generate the output code. We can find 6 different components inside a compiler:

pag 10 Andrea Tino - 2013 # Component name Output Description Analysis phase 1 Lexical Analyzer Tokens Converts the source code into a sequence of recognized tokens. 2 Syntax Analyzer AST Converts the teken sequence into an Ab- stract Syntax Tree for parsing. 3 Semantic Analyzer AAST Analyzes the AST and looks for errors and performs type checking. Synthesis phase 4 Intermediate Code Genera- tor IR Produces an initial representation of the final target code. 5 Code Optimizer OIR Optimizes the code given a certain policy. 6 Code Generator Target code Generates the final code. Each phase transforms the input source from one representation to another until the target output. Phase 1: Lexical analysis The first thing to do is transforming the source code into a uniform and compact representation: a sequence of tokens. This process is called tokenization and is carried out in order to recognize lexems inside the source stream. In Lexycography, a branch of Linguistics, a lexem is a unit of morphological analysis. Lexical analysis recognizes valid lexems inside the source stream and converts it into a list of tokens. Each token is a lexem and each one of them is associated with a particolar meaning (namely, the type of the token). In order to generate the final list of tokens, the scanner (or tokenizer) first removes every comment or not needed elements from the source. Code inclusion directives are processed and a cleaned version of the source stream is ready for tokenization. To better understand the work of a scanner, let us consider the following line of C code and try to scan it: int number = 3 + myvariable; When this statement is evaluated by the C scanner, the returned token list is the following: int number = 3 + myvariable ; Keyword Identifier Operator Literal Operator Identifier EOS In order to create a scanner 2 things are usually needed: • Regular Expressions: They can be used to describe tokens.

pag 11 Andrea Tino - 2013 • Finite State Automaton: FSAs can be used to check the lexical correctness of the source stream to process. Using them it is also possible to recognize lexical errors. Generally, a lexer (lexical analyzer) can be fully described by a FSA. Although a lexer can be developed, there are a lot of tools for automatic lexer generation. Phase 2: Syntax analysis When the token list is returned, it can be processed by the parser. The parser is responsible for grouping tokens together in order to form phrases. To create phrases it is necessary to understand what sequences of tokens can be accepted by a language. To do so, grammars are considered. A grammar is nothing more than a structure made of recursive rules that define the syntax of a language. When a grammar is defined, the parser can group tokens together (according to the grammar) and generate the Abstract Syntax Tree (AST) of the source code. For example consider the following fragment of code: a = (3 * a) / ((b - c) * (b + c)); When the token list is returned, the AST generated by the parser for the reported fragment of code is shown in the figure. Scanner vs. parser When compiling a language, it is not necessary to have both a scanner and a parser; sometimes it is possible to talk about scannerless compilers. Having a unique component for lexical and syntax analysis is not so rare, as they cover different aspects of the source code. One thing to point out is that lexers can only handle language non-recursive constructs, while recursive ones can be treated by parsers. b c = a / +- b c * * 3 a For a very simple expression like a=(3*a)/((b-c)*(b+c)) a parser will generate the AST shown in figure. Each node of the tree represents an operation and children represent operands. In this case, for operators, the AST is very simple. Expressions involving binary operators are parsed into binary trees. When the code gets more generic and does not involve operators only, the AST can become more complex.

pag 12 Andrea Tino - 2013 Parsing approaches Syntax analysis can be performed in 2 ways: • Top-Down: Also called descending parsers, they start building the AST from root node to the final leaves. Being an intuitive way to proceed, these parsers can be easily developed manually with efficient implementations. • Bottom-Up: Also called ascending parsers or shift-reduce parsers, they start from the leaves in order to reach the root of the AST. Efficient bottom-up parsers are not easy to implement without software tools. For each type of parser may algorithms can be considered. Phase 3: Semantic analysis The final step for source analysis is the semantic analyzer. While the parser checks that all tokens are used together to form a proper phrase in the language, the semantic analyzer verifies the sequence of statements to check whether they form a sensible set of instructions in the language. For example, in almost all procedural and OO programming languages, variables must first be declared and then used; if an attempt to access an undefined variable is made in the source code, the analyzer will report it as an error, even though the lexer and the parser did not encounter any problem on their way. Type checking One of the most important tasks carried out by the semantic analyzer is type checking. Almost all languages support explicit or implicit type definition, the time to deal with types comes at semantic analysis. When scanning the AST, type checking is perfomed for those leaves involved in type semantics (identifiers, operators, functions, etc.). If type local type check succeeds for a leaf (local check), then that leaf is decorated with attributes adding info about types. Sometimes type checking is also performed at parse time, for example when creating the AST for an operation, the parser can check whether all operands have the expected types. Finally, consider that types are something related exclusively to the source code. Scope and naming Names and scope management are handled during semantic analysis. Some languages introduce concepts similar to namespaces (packages for Java and Python), in this case name resolution is performed to look for undefined identifiers used in the code. Scope as well is analyzed in order to deal with member shadowing for example. In OO languages method overriding and overloading are handled by the semantic analyzer as well. The semantic analyzer returns the same AST provided as input by the parser but with more information added to nodes. We can call it an Augmented AST because of the attributes in each node. If errors are found, errors as well are returned. Phase 4: Intermediate code generator After analysis, it is time for output code synthesis. An Intermediate Representation is generated; this

pag 13 Andrea Tino - 2013 is not the final code, but an higher abstraction which is independent from the machine. From the AST the translator extracts elements that can be converted, typically, into more instructions of the IR. If one element is semantically correct, it can be translated. For example consider the following statement in a generic intuitive language: variable1 <= variable2 * variable3 + 1; It is semantically correct, so identifiers are translated with their ids in the symbol table: _id1 <= _id2 * _id3 + 1; And then the statement is translated into the following 3 instruction using a very simple and intuitive 3-addr instructions: MULT _id2, _id3, _id4 ADD _id4, ‘1‘, _id5 MOV _id5, , _id1 The IR plays an important role for compilers. It is also important to point out that translation is a process strongly related to semantics. When translating a construct into IR, semantics drives the process; the real meaning of a construct inside the AST is not evident until it is processed by the semantic checker. Furthermore, concerning the IR, the fewer are information regarding the target language the better for the compiler. Double IR Some compilers can generate 2 IRs. The first one is a high level IR which is strictly bound to the source language. The second one is low level IR more target language oriented. This approach can be very efficient especially when the source language changes syntax or semantics. No IR compilers Some simple compilers for simple languages can omit the IR step. Direct translation is performed. This approach can be adopted for very simplistic languages as no modularization allows code optimization. Such compilers, in fact, cannot have an optimizer. Components Generally speaking a code generator is made of some common elements: • Instruction selector: Decides what instruction must be issued. • Instruction scheduler: Arranges chosen instructions in a sequence for execution. • Register allocator: Decides which registers should be allocated for variables. Each component may work also in order to (lightly) optimize a certain objective function (in case a code optimizer is not considered by the compiler). Phase 5: Code optimizer When the IR is submitted to the next component of the compiler, the optimizer tries to re-arrange

pag 14 Andrea Tino - 2013 the code in order to provide, at the end of the process, another IR satisfying the same semantics of the previous code, but optimizing a certain objective function at runtime. Code optimization is not a simple task; on the contrary, it raises a wide range of mathematical problems whose solution is not always guaranteed. An optimizer usually works with the same components of a normal code generator. Considering that the way how instructions are arranged plays a very important role in the process, the instruction scheduling algorithm often represents one of the most enhanced elements in the architecture of an optimization block. However, optimization cannot always be carried out in every context. As previously stated, code optimization cannot be always achieved given the type of all mathematical problems raised. Some of them can show NP complexity for example. Among such problems we can find, for example, the Register Allocation problem and the Orphan Code Elimination problem; both of them have NP complexity and sometimes they can be undecidable. Phase 6: Code generator The final step is converting the IR (optimized or not) into the output code. The code generator needs many information regarding the machine onto which the final code will run (in the special case of programs and applications). Some noticeable compiler architectures The structure of a compiler presented so far is the most generic one. Real scenarios include many compilers built using fewer components. Some of them can be monolothic programs without components. Some of these compilers can be considered as valid examples compiler lightweight design (which does not imply that the compiler itself is a simple structure). One-step compilers A one-step compiler is one which performs lexical, syntax and semantic analysis (nearly) at the same time, and generates the output code upon source code’s statement processing. Of course not every language can be processed by such a compiler. In the specific case, only simple languages can be handled in this way. For example, a language for which every statement does not raise any anbiguity is a valid possibility. Consider the following fragment of C++ code: element1 element2(element3);

pag 15 Andrea Tino - 2013 Without any knowledge regarding the 3 identifiers, can we tell whether this line represents a function definition or a variable declaration? The answer is no because we need more information regarding the 3 identifiers. A one step compiler cannot process the C++ language because of anbiguities raised during each statement evaluation. Two-step compilers Another common architecture is the two-step compiler. Designing such compilers is quite simple because only two components are involved: a Front-End and a Back-End. The source language comes as input of the front-end which translates it into an Intermediate Language (IL) which is, again, provided as input in the back-end. The back-end will finally translate the IL into the output- language. Each block can have a different structure: they can be simple one-step compilers or more complex ones. Such a design was proved being really successful in the last 20 years in order to create portable and machine-independent programming languages. The front-end generates the IL qhich is strongly bound to the source code. The complexity of this block depends on the source language. In case of simple languages, a one-step compiler can be considered, but for a programming language like Fortran or C++ this cannot be the case. The back-end, on the other hand, is strongly bound to the output language. The IL usually is a simple language (sometimes resembling the Assembly code), so the language parsing can be performed in one step, all the complexity is moved onto the code generator in this case. Typically the back-end has a much higher complexity when compared to the front-end. Machine independence & retargeting Separating the source dependance from the target dependance is a very good approach especially when considering maintenance: changes on the language will not affect the back-end. Retargeting is also possible. If a compiler is written targeting certain machines or technologies, it is possible to develop different back-ends without touching the front-end. Java and the JVM was one of the first languages supporting this philosophy. Back-end reuse If a back-end is well developed, by keeping a strong binding to the target language and making it nearly fully independent from the source code, it is possible to use it to compile many languages on the same machine/technology. Back-end reuse is a key strategy for some languages targeting the same architectures and can save a lot of time when developing compilers. It is possible to have many different front-ends, all of them compiling to the same IL and using the same back-end. Multilanguage platforms By using back-ends and front-ends in different combinations, it is (#) Although the .NET Framework is a Microsoft technology targeting Windows systems, today many projects are trying to rebuild the framework in order to make it open-source and cross-platform, like Mono: http://www.mono-project.com.

pag 16 Andrea Tino - 2013 possible to create a framework of languages for many computer architectures. Today it would be possible to describe Microsoft as the company who invested the most in such an approach. The .NET Framework, in fact, is a collection of different languages all of them compiled to MSIL (Microsoft’s .NET proprietary IL), ehich is, again, compiled/interpreted targeting all Windows systems and almost all architectures. No matter the language used in the framework (C#, VB.NET or F#), the CLR (.NET virtual machine) it is possible to generate the correct code for the specific architecture(#). Just a bit more on interpreters Although we introduced interpreters at the beginning of this section, now we have a little more knowledge on compilers allowing us to understand some other details regarding this topic. Actually, when talking about interpreters, we recognize 2 different types: • Machine interpreters: They simulate the execution of a program compiled for a specific architecture. For example, the JVM is a program emulating a machine (but that actually runs on another machine). The Java bytecode is not an IR, but a language targeting the JVM. These interpreters must be running during all the time that the code needs to be executed, they act like functious machines. The .NET Framework and the CLR is another example. • Language interpreters: They simulate the effect of the execution of a program without compiling it to a specific instruction-set. An IR is used (generally an AST). Javascript in web browsers is a good example. Advantages of using interpreters can be many.

Andrea Tino - 2013 Theoretical basics on languages Ibelieve that at the end of the century the use of words and general educated opinion will have altered so much that one will be able to speak of machines thinking without expecting to be contradicted. “ “ Alan Turing http://www.brainyquote.com/quotes/quotes/a/ alanturing269232.html

pag 18 Andrea Tino - 2013 Basic definitions In this section we mainly focus on theoretical definitions regarding languages, they are important in order to formulate concise expressions to describe a language. Mathematical definitions for languages An alphabet Σ is defined as a non-empty set of symbols. A string or word ω over a certain alphabet Σ is defined as a finite concatenation (or juxtaposition) of symbols from that alphabet. Given a word Σ over an alphabet, Σ ∈ will denote its length as the number of symbols (with repetitions) inside that string.  will denote the empty or null string. Of course we have that:  = 0 . The set of all strings over an alphabet Σ is denoted as Σ∗ . We also define the set of fixed-length strings Σn over an alphabet Σ : Σn = ω ∈Σ∗ : ω = n{ } And we finally define the positive clusure of the alphabet as: Σ+ = Σn n≥1  So that it is possible to express the universal language Σ∗ on Σ as: Σ∗ = Σ+ ∪ { } Defining a language Now we are ready to define a language. A language L on an alphabet Σ is a subset L ⊆ Σ∗ of the universal language on that alphabet. Language equivalence Two languages are equal or equivalent only if the are desined onto the same alphabet and they are equal as sets: ∀L1,L2 L1 ≡ L2 ⇔ L1 ⊆ Σ∗ ∧ L2 ⊆ Σ∗ ∧ L1 ⊆ L2 ∧ L2 ⊆ L1 Language properties The language cardinality L is the number of strings contained in the language. It is possible to have languages whose cardinility is not finite: L = ∞ ; for them it is not possible to list all words. Furthermore, consider that the empty language Φ = ∅{ } (the language having no words) is a proper language. Working with strings Strings and words of a language can be manipulated in order to produce other strings or words. Please note that the term word refers to a string as part of a language.

pag 19 Andrea Tino - 2013 Concatenation Two strings can be adjointed together to create a new one. x = a; y = bc; x∙y = xy = abc; Of course if the concatenation is performed with the null string, we get the same string. It is possible to say that the empty string is the identity element of the concatenation operator. s∙ε = ε∙s = sε = εs = s; Power The power of a string is simply the concatenation between that string with itself for the specified number of times. sn = sss...s; For example: s = wqa; s3 = wqawqawqa; When the exponent is zero, the convention is to have, as result, the empty string. s0 = ε; Working with languages Like strings, it is possible to combine languages together. There is nothing trascendental if we consider that a language is nothing more than a mathematical set. Concatenation For languages, concatenation is the operation of concatenating each word of the first language to each word of the second. L1 ⋅ L2 = ω1 ⋅ω2 :ω1 ∈L1,ω2 ∈L2{ } For example: L1 = {a, b}; L2 = {c, d}; L1∙L2 = {ac, ad, bc, bd}; Given its defintion, as for strings, concatenation is not commutative: L1 ⋅ L2 ≠ L2 ⋅ L1 . Union As the term suggests, union between languages is equivalent to the union of both sets. L1 ∪ L2 = ω :ω ∈L1 ∪ L2{ } For example: L1 = {a, b}; L2 = {c, d};

pag 20 Andrea Tino - 2013 L1 ∪ L2 = {a, b, c, d}; Power Taking advantage of the notion of concatenation, the n-th power of a language is defined as the concatenation between the (n-1)-th power of the language and the language itself. Ln = Ln−1 ⋅ L We have that: L0 = Ε = { } (the empty string language). Furthermore, the power to zero of the empty language is, again, the empty string language: Φ0 = ∅{ }0 = Ε = { }. Consider the example: L = {a, b}; L^3 = L^2 ∙ L; L^3 = (L ∙ L) ∙ L; L^3 = {aa, ab, ba, bb} ∙ L; L^3 = {aaa, aba, baa, bba, aab, abb, bab, bbb}; The empty string language The empty string language is denoted by Ε = { } and is the language containing the empty string only. Given its definition, it is also the identity element for the language concatenation operator: LΕ = ΕL = L . The empty language The empty language Φ = ∅{ } is not such an intuitive and simple object. It has some important properties that are to be detailed. First of all, the empty language is not the same as the empty string language: Φ = ∅{ }≠ Ε = { }. By definition the emtpy language is the language containing no strings, an empty set. What happens when the null language is concatenated with another one? The empty language is the zero element of the language concatenation operator, so we have that: LΦ = ΦL = Φ . A very important equation relates the empty string language to the empty language: Φ∗ = ∅{ }∗ = Ε = { } Again on concatenation Language concatenation is a powerful operator as it allows the possibility to define the universal language and the positive closure of an alphabet as the limit of the union of the power of the language. L∗ = Ln n=0 ∞  = L0 ∪ L1 ∪ L2 ∪…∪ Ln The universal language of a language is the language having as words all possible combinations of words from the original language. While the positive closure represents the same, without the empty language: L+ = L1 ∪ L2 ∪…∪ Ln n=1 ∞  = L  Ε

pag 21 Andrea Tino - 2013 The same can be applied for an alphabet (as done before). If we think about it a bit more, an alphabet is nothing more than a language consisting in atomic strings (symbols). Kleene operators A set of strings is, by definition, a language. However, without worrying too much about all formalism introduced so far, we can consider a simple set V of strings and define two operators on such sets. We consider the following axioms on V : V0 = { } V1 = V ⎧ ⎨ ⎩ Together with the following recursive definition: Vi+1 = v⋅w :v ∈Vi ∧ w ∈V{ },∀i > 0 Kleene star The star operator acts on the original set of strings in order to return a non-finite set of all possible strings that can be generated using the original ones. The null string is included in the final set. V∗ = Vn n=0 ∞  = V0 ∪V1 ∪V2 ∪…∪Vn Kleene plus The star plus operator acts like the star operator with one difference only: the empty string is not contained in the final set. V+ = Vn n=1 ∞  = V1 ∪V2 ∪…∪Vn One important equation relates the star operator to the plus one (when acting on the same set of strings): V  = V+ ∪ { } Introducing grammars A language is not just a collection of words. If this were the case, we would just define sets of keywords and use them with no further care. However, as we all well know, a language not only is made of some words, it is also made of rules explaining how to create sequences of words to generate valid phrases. That’s why, together with the language definition, a grammar must be

pag 22 Andrea Tino - 2013 considered. A grammar is a way to define all admittable phrases in the language. A grammar alone, however, is not enough; the formal specification of allowed phrases is useless without an algorithm to determine phrases’ structure. Generative grammars A generative grammar for a given language L is a set of rules generating all and only the allowed phrases of a language. It is used to define the syntax of a language. From a mathematical point of view, a grammar is defined as a 4-tuple G = V,T,P,S( ) including: • A set of non-terminals: It is denoted by V ⊆ L and includes all non-terminal symbols. • A finite set of terminals: It is denoted by T ⊆ L and includes all terminal symbols. • A finite set of production rules: It is denoted by P ⊆ L ( ) L and contains the transformation rules between sequences of symbols. • A start symbol: Also called axiom, it is denoted by S ∈V and is one non-terminal symbol. In this context the term symbol refers to a string of the language. However the previous definition can be considered also when having an alphabet instead of a language. The production rules is a set containing functions from sequences of symbols of the language to other sequences. However such sequences does not include all possible combinations of symbols in the language, it would be more accurate to say that production rules contain sequences of terminal and non-terminal symbols in the language. That is: P ⊆ ΛΛ Λ ≡ V ∪T( )∗ As every symbol in the grammar is a terminal or a non-terminal. Also, in normal conditions, we have that terminal and non-terminal sets share no elements: V ∩T = ∅{ }. A production rule appears in the form: α → β and a finite number of them generate all valid phrases in a language. Consider the following language and a grammar with the specified sets: L = {S,A,B,a,b,ε}; V = {S,A,B}; T = {a,b,ε}; The start symbol is S and the production rules of the grammar are defined below: S -> ABS; S -> ε; BA -> AB; BS -> b; Bb -> bb; Ab -> ab;

pag 23 Andrea Tino - 2013 Aa -> aa; Such a grammar defines a language whose phrases are all in the form a{n}b{n} where n is a generic natural non-zero number. Although the expression used before is a Regular Expression (something we’ll treat later), it’s meaning is quite inquitive. To understand what types of phrases a grammar can generate, one simply needs to start creating derivations of each rule. A derivation is a whatever sequence of production rule applications. Starting from the start symbol, it is possible to transform a sequence of symbols into another until terminals are reached. When terminals are reached, it is possible to have sequences for which no more rules can be applied. In that case we get a phrase of the language. Now consider another grammar for the same language with the same sets but using the following production rules: S -> aSb; S -> ε; Grammar uniqueness If we start creating and expanding derivations, we will end up getting the same productions for the same language. So this grammar is equivalent to the one of before. This is an important aspect related to grammars: more grammars can generate the same syntax for a given language. Backus-Naur forms The way production rules are formed follows Backus-Naur forms. BNFs are transformation rules between sequences of symbols expressed using the following syntax: <symbol> ::= __expression__; The left part of the rule (LHS) is a symbol, while the right part (RHS) consists of a sequence of other symbols. Using recursive applications of all rules, the algorithm must converge to some terminal sequences (sequences whose symbols do not conform to any rules’ LHS). The following is an example of BNFs forU.S. postal addresses: <postal-address> ::= <name-part> <street-address> <zip-part>; <name-part> ::= <personal-part> <last-name> <opt-suffix-part> <EOL> | <personal-part> <name-part>; <personal-part> ::= <first-name> <initial> “.”; <street-address> ::= <house-num> <street-name> <opt-apt-num> <EOL>; <zip-part> ::= <town-name> “,“ <state-code> <zip-code> <EOL>; <opt-suffix-part> ::= “Sr.“ | “Jr.“ | <roman-numeral> | ““; <opt-apt-num> ::= <apt-num> | ““ To get some productions, it is necessary to specify a start symbol, in our case, it is, obviously,

pag 24 Andrea Tino - 2013 <postal-address>. From there, every sequence containing a symbol that can be further expanded (because part of the LHS of a rule) will generate other sequences and, eventually, productions. A possible production for the above rules is through the following derivation: 0. <postal-address>; 1. <name-part> <street-address> <zip-part>; 2. <personal-part> <last-name> <opt-suffix-part> <EOL> <house-num> <street-name> <opt-apt-num> <EOL> <town-name> “,“ <state-code> <zip-code> <EOL>; 3. <first-name> <initial> “.” <last-name> “Sr.“ <EOL> <house-num> <street-name> <apt-num> <EOL> <town-name> “,“ <state-code> <zip-code> <EOL>; The last sequence is a production as no more rules can be applied to any symbol. Meaning that all symbols in the sequence are terminals. BNFs are a very efficient way to describe production rules of a grammar, although production rules follow BNFs, the syntax is not really the same. The Chomsky-Schützenberger hierarchy Generative grammars define a language syntax. So basically, all properties and facts concerning the grammar of a language will reflect on the language itself and vice-versa. In order to group languages having the same aspects, Noam Chomsky(#) and Marcel-Paul Schützenberger(##) created a grammar hierarchy which, still today, is one of the most important means to understand properties of a language and techniques to build a compiler for it. The hierarchy has a containment structure and is composed by 4 grammar types, each one being assigned a number. Every grammar of type n is also a grammar of type n-1. Furthermore each grammar defines a type of language and each language type is associated with an abstract computational model used to create its parser. Grammar name Language type Automaton Production rules Type 0 Unrestricted Recursively enumerable Turing Machine - TM α → β (#) Considered today the living father of formal languages: https://en.wikipedia.org/wiki/ Noam_Chomsky. (##) Had a crucial role in the development of the theory of Computetional Linguistics and formal languages: https://en.wikipedia.org/wiki/Marcel-Paul_Sch%C3%BCtzenberger.

pag 25 Andrea Tino - 2013 Grammar name Language type Automaton Production rules Type 1 Context-sensitive Context-sensitive Linear-Bounded Automaton (linear-bounded non-deterministic Turing machine) - LBA αAβ → αγβ Type 2 Context-free Context-free Non-deterministic Push-Down Automaton - PDA A → γ Type 3 Regular Regular Finite State Automaton - FSA A → a ∨ A → aB When handling generic production rules, there is a common agreement concerning symbols. • Roman lowercase letters denote terminals. • Roman uppercase letters denote non-terminals. • Greek (lowercase) letters denote strings of terminals and non-terminals (can be empty). • Special greek (lowercase) letter lambda γ usually denote a non-empty string of teminals and non-terminals. The reason why we call it hierarchy derives from the fact that a type extends other types. So a regular language has also the poperties of a context-free language. But the most important aspect concerning the hierarchy is the possibility to have a sistematical approach to create parsers for languages. Once the language type is recognized, by identifying its generative grammar type, the abstract computetional model to create the parser is the one specified by the hierarchy. Decision problems and language decidability In Computational Theory, a decision problem is a class of formal systems to ask questions to. A question can be answered with two possible values only: yes or no. The system is made of an algorthm taking as input the question and returning the answer. Solving a decision problem means finding an algorithm which is always able to provide the answer to the given question. Decidability is a property for some computational problems whose resolution can be redirected to solving a decision problem. If the corresponding decision problem is solveable, then the problem is decideable! An example - the primality test: The problem of recognizing prime numbers is a decision problem. In fact we can build a formal system to answer to the question: “Is this number prime?“ with a binary value. Is primality test decideable? Yes it is since many algorithms were found to answer such a question, for example the Pocklington-Lehmer algorithm(#). (#) The test relies on the Pocklington theorem: http://en.wikipedia.org/wiki/Pocklington_pri- mality_test.

pag 26 Andrea Tino - 2013 Decidability for languages: The problem of designing and implementing a parser for the syntax of a language can be mapped onto a decision problem. The question is: “Can a parser be developed for the given language?“. The answer is yes in almost all cases, it depends by the type of language, which depends on the grammar generating that language. Basically all grammars of type 3, 2 and 1 generate decidable languages. When it comes to type-0 grammar generated languages, the problem is building a Turing Machine to solve the problem. However a generic Turing Machine with no absolute restriction is an impossible task, so for such grammars we need to evaluate case by case. If the gramar is too complicated and recursion rules cannot be normalized somehow, the language might even be undecidable. (#) The test relies on the Pocklington theorem: http://en.wikipedia.org/wiki/Pocklington_pri- mality_test.

Andrea Tino - 2013 Type-3 grammars and regular languages Iread one or two other books which gave me a background in mathematics other than logic. “ “ Stephen C. Kleene http://www.brainyquote.com/quotes/quotes/s/ stephencol346123.html

pag 28 Andrea Tino - 2013 What is a regular language? A regular language is a language that can be generated by type-3 grammars. Recalling the definition of type-3 grammars, we have that a grammar whose production rules are in one of the following forms: A → aB A → a ⎧ ⎨ ⎩ ∨ B → Ab B → b ⎧ ⎨ ⎩ Is a regular grammar. Linear grammars A linear grammar is a particular type of context-free grammars whose production rules have at most one non-terminal in in the RHS. Linear grammars are very important for regular grammars because the latter are special cases of the former. In particular, the set of all regular grammars can be separated into two complementary subsets: • Left-linear grammars: They are regular linear grammars whose production rules are in the form B → Ab , where the non-terminal in the RHS can be empty. • Right-linear grammars: They are regular linear grammars whos production rules are in the form A → aB , where the non-terminal in the RHS can be empty. Such grammars are very restrictive and cannot be used for the syntax of powerful programming languages. However parsers for these grammars are very fast and efficient. Properties of regular languages Regular languages L on a certain alphabet Σ have the following properties: • The empty language ∅{ } is a regular language. • A singleton language (a language with cardinality 1) s{ } generated by a whatever symbol of the alphabet s ∈Σ is a regular language. • Given two regular languages L1,L2 , their union L1 ∪ L2 and their concatenation L1 ⋅ L2 are still regular languages. • Given a regular language L , the Kleene star applied on it L∗ is still a regular language. • No other languages over the same alphabet other than those introduced in the previous points are regular. Regular languages have many interesting properties other than those listed so far. Regular languages and finite languages

pag 29 Andrea Tino - 2013 A very important theorem relates finite languages with regular languages. [Theo] Language finiteness theorem: A finite language is also a regular language. Please note that the inverse is not true, if a language is regular it is not said that it is finite. For this reason some algorithms are used to check whether a language is regular or not. Deciding whether a language is regular The problem of telling whether a language is regular or not is a decision problem. Such a problem is decidable and some approaches can be used to check whether or not a language is regular. One of the most common is the Myhill-Nerode theorem. A parser for regular languages Regular languages can be parsed, according to the Chomsky-Schützenberger hierarchy, by FSAs. In fact the decision problem for regular languages can be solved by FSAs. We can have two types of FSAs: • Deterministic: DFA or DFSA. They are fast but can be less compact. • Non-deterministic: NFA or NFSA. They are not as fast, but more compact in most cases. Both of them can be used to recognize phrases of a regular language. When developing lexers for a generic language, DFAs are used. Finite State Automata (FSA) They are also called Finite State Machines (FSM). Since a regular language can be parsed by an FSA, we need to study these abstract computational models. A generic FSA (deterministic or non- deterministic) can be defined as a 5-tuple A = Σ,S,δ,s0,F( ). • A finite set of symbols denoted by Σ = a1,a2,…,an{ } and called Alphabet. • A finite set of states denoted by S = s0,s1,s2,…,sm{ }. • A transition function denoted by δ :S × Σ  2S , responsible for selecting the next active state of the automaton. The function can be described as a table states/symbols whose entries are subsets of the symbols set. • A special state denoted by s0 ∈S and called initial state. • A subset of states denoted by F ⊆ S whose members are called final states.

pag 30 Andrea Tino - 2013 An FSA is used to read strings of a language. If a string is provided to the automaton and it is able to read it, we say that the automaton is able to recognize the string and it accepts such a sequence of symbols; otherwise the string is rejected. Using more formalism, we say that a FSA A = Σ,S,δ,s0,F( ) accepts a string ω = x1x2 …xs ∈Σ∗ , where xi ∈Σ,∀i = 1…s , if there exists a finite sequence of states R = r0,r1,r2,…,rs{ }∈2S , called machine run, so that the following equations hold: 1. The first state of the run corresponds to the initial state of the automaton: r0 ≡ s0 . 2. Each state in the run is the result of a valid transition: ri+1 = δ ri ,xi+1( )∈R,∀i = 0…s −1. 3. The run ends with a final state: rs ∈F . Otherwise the string is rejected and not recognized by the automaton. Formalism aside, a FSM is a very simple structure, especially when it is evaluated graphically. Starting from the initial state, one only needs to make transitions in order to reach a final state and get a word of the language. Dummy FSA Consider the FSA reported below: a b c s0 {s1} {} {} s1 {} {s2} {} s2 {} {} {s3} s3* {s1} {} {s3} s0 s1 s2 s3 a b c c a By using regular expressions (a more concise and powerful way to define a regular languages’ syntax instead of grammars), it recognizes all phrases in the form: (abc+)+. The table describes transitions from each state (rows) to others basing on symbols (columns). States decorated with an asterisk are final states. Deterministic and non-deterministic FSAs All definitions so far fit both deterministic and non-deterministic FSAs. Now it is time to separate them. Basically a DFA is a FSA in which all outgoing transitions, for each state, are labeled by a different symbol. Conversely, a NFA is a FSA in which it is possible to find, for some or all states, two or more outgoing transitions labeled with the same symbol. Using formalism, we have that the transition function δ is not injective and can return more than one state. Meaning that the following equation holds:

pag 31 Andrea Tino - 2013 ∃i = 1…m,∃x ∈Σ : δ si ,x( ) >1 The example of before shows a DFA, now consider this other example showing a NFA: s0 s1 s2 s3 0 1 0|1 11 0 1 s0 {s1} {} s1 {} {s1,s2} s2 {s3} {s3} s3* {} {s3} This automaton recognizes binary string in the form 01+(0|1)1*. It is evident that some non- determinism can be found here because some cells in the table cause the automaton to reach different states with the same symbols. In the example, state s1 can reach states s1 or s2 with the same symbol 1. Finally, be careful, the non-determinism of this automaton does not reside in the transition from s2 to s3, in that case different symbols make the automaton pass to the same state, the row in the transition table referring to s2 has one state (the same) for each symbol. For a DFA, on the contrary, the following equation holds: δ s,x( ) = 1,∀s ∈S,∀x ∈Σ In fact, in the transition table of a DFA, each cell is filled with one state only. This is also the reason why NFAs do tend to be more compact and more complex than DFAs. Comments in C/C++ We now consider a real case scenario. How to recognize C-like comments? The following DFA is able to do so: s0 s1 s2 s3 / / EOL ^EOL / EOL ^EOL s0 {s1} {} {} s1 {} {s2} {} s2 {s2} {s3} {s2} s3* {} {} {} This DFA can recognize all comments written in C, C++, Javascript and other C-like syntax

pag 32 Andrea Tino - 2013 languages. Recognized phrases are in the form: //(^(EOL))*(EOL). Handling ε-transitions: NFA-ε Some NFAs can also have spontaneous transitions. These types of transitions are special and, when supported, cause the automaton to redefine something. An ε-transition is defined as a transition which does not consume any symbol (it actually consumes the empty string symbol). The transition function is redefined in this way: δ :S × Σ ∪ { }( ) 2S . We now define the ε-closure function ψ :S  2S as the function returning all states reacheable from a given one (included in the closure) through ε-transitions. More formally, let p and q be two states, we have: p ∈ψ q( )⇔ ∃q1,q2,…,qk ∈S :δ qi ,( )= qi+1∀i = 1…k ∧ q1 = q ∧ qk = p Namely, there exists a path through ε-transitions, connecting state q to state p . The way a NFA-λ works is as follows. We say that a NFA-ε A = Σ,S,δ,s0,F( ) accepts a string ω = x1x2 …xs ∈Σ∗ , where xi ∈Σ,∀i = 1…s , if there exists a finite sequence of states R = r0,r1,r2,…,rs{ }∈2S so that the following equations hold: 1. The first state of the run corresponds to one of the states in the ε-closure of the automaton’s initial state: r0 ∈ψ s0( ). That is, the automaton starts with any state which can be reached from the initial state via ε-transitions. 2. Each state in the run is the result of an ordinary transition: ri+1 = δ ri ,xi+1( )∈R,∀i = 0…s −1. 3. A state can be reached through ε-transitions as well: t = δ ri ,xi+1( )⇒ ri+1 ∈ψ t( ),∀i = 0…s −1 . That is, after reading symbol xi , the machine experiences a transition from state ri to state t (a possible temporary state); later, the machine is (possibly) lead to state ri+1 by ε-transitions. Remember that an ε-transition occurs on empty string, so it cannot be enumerated in the list of symbols of the input string. 4. The run ends with a final state: rs ∈F . Otherwise the string is rejected and not recognized by the automaton. Non-deterministic automata When a FSA allows the possibility to have ε-transitions, that automaton becomes a NFA. It is not possible to have DFAs supporting ε-transitions. Why? Consider this first example. The first diagram shows a DFA with ε-transitions. So DFAs can actually have ε-transitions? The answer is yes, but if we look closer to that diagram, we can understand that the ε-transition connecting the two b ε ca b ca Diag. 1 - Not correct Diag. 2 - Correct

pag 33 Andrea Tino - 2013 states on the right is quite useless as it can be safely removed (together with its destination state) to get the second diagram. Although we say that the first diagram is not correct, we actually mean that it is redundant. a db|c b ε a ε d c Diag. 2 - Correct Diag. 1 - Not correct Let us consider another example. In this case ε-transitions have a reason as the first diagram needs them to create an option and then specify a fixed symbol. However the second diagram shows how to remove ε-transitions and create a transition mapping the option between the same symbols of before. Once again, ε-transitions were not strictly necessary. The final conclusion is that DFAs have no reason to use ε-transitions, if they do, then there is a way to remove them (and probably have a more compact DFA). When ε-transitions are used, they are used to create more connections from one state to another. So we create non-determnism in the diagram, which is actually the reason why NFAs use ε-transitions! Regular expressions Before getting to the main topic: how to develop parsers using FSAs, we introduce another important argument. Regular expressions, also called regexp or regex, are a more compact and concise way to define a regular language without using regular grammars. When dealing with regular languages, regular expressions can be a very powerful tool. In fact, even before, we have reported next to each FSA, the corresponding regexp. A regular expression consists in a sequence of symbols for a given alphabet. Finite State Automata Regular grammars Regular expressions Regular languages TRANSLATED INTO... USEDTO DEFINETHE SYNTAX OF... USEDTO RECOGNIZE ALLOWED PHRASES IN...

pag 34 Andrea Tino - 2013 Each symbol in a regexp can by a meta-symbol, thus carrying a special meaning, or a normal symbol with its literal meaning. Although regular expressions can have some variations depending on the system in use, a common specification will be described here. The list of meta-symbols is reported below: R = {., *, +, , {, }, [, ], (, ), ?, ^, $, |}; When we want a regular expression to match a meta-symbol literally, the meta-symbol gets escaped by a special meta-symbol: . What are they used for? Regexp are used to define all words of a language and/or the syntax of a language. Because of this, regexp can be mapped onto FSAs and vice-versa. Furthermore, because regular grammars define a regular language’s syntax, they can be converted into regular expressions and vice-versa. Everything can be summarized by the following property of regexp: regular expressions have the same expressive power as regular gramars. What is a regex? A regexp is a string of characters (of a certain alphabet) and meta-characters that can match a finite or not-finite set of strings on the same alphabet. Character classes Meta-characters in a regular expression are listed in the following table. Each one of them carries a special meaning. Literals are the opposite, they match themselves and carry no special meaning. Meta-symbol(s) Example Descrirption Literals a,b,... hello Every alphabet symbol that is not a meta-symbol is a literal; thus it matches itself. Ranges [ ] [abc] It matches only one single character among those inside the brackets. [^ ] [^abc] It matches one single character that is not listed inside the brackets. - [a-z]; [a-zA-Z] When used inside square brackets, this matches a known range of alphabetical characters and numbers. Classes . f.fox The dot matches all non line-break characters. Anchors ^ ^Fire The caret matches the position before the first character in the string. $ fox$ The dollar matches the position after the last character in the string.

pag 35 Andrea Tino - 2013 Meta-symbol(s) Example Descrirption Quantifiers + a+; abc+ The Kleene plus will match the preceeding symbol one or more times. * a*; abc* The Kleene star will match the preceeding symbol zero or more times. ? a?; abc? The question mark will match the preceeding symbol zero or one times. {n} a{3}; ab{2} The preceeding symbol is matched exactly n times. {n,m} a{1,4} The precedding symbol is matched between n to m times. {n,} a{3,} The preceeding symbol is matched n times or more. Others () (abc)+ Every sequence of symbols inside is considered as a single symbol. | (a|b) Exclusive or selection. There are many regex engines in the world and many of them come embedded to larger solutions as well. Technology Programming languages Namespace Microsoft .NET Framework C#, VB.NET, F# System.Text.Regex Oracle Java Java java.util.regex Boost libraries C++ boost::regex Perl Perl Part of the language syntax Javascript Javascript RegExp As it is possible to see, some regex utilities are part of a language syntax, this is true for Unix systems in the case of Perl and Bash scripting languages. Examples of regex When using regexes two strings are necessary: a pattern and a source text. A pattern is regular expression used to match strings inside the source text. For example, let us consider the following text we are going to use in our examples here:

pag 36 Andrea Tino - 2013 text = “Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce 5 vitae fermentum enim. 12 Sed vitae consectetur libero, ac3 hendrerit augue. Cras13 auctor 456 lectus eu lacus455 fringilla, a4 auctor leo 23 euismod.“; Now let us consider the following patterns and let us also consider matched elements. Pattern (Some) Matched text “(L|l).l” “Lorem ipsum dol”, “lor sit amet, consectetur adipiscing el“ “[0-9]+“ “5“, “12“, “456“, “455“, “23” “[a-zA-Z]+“ “Lorem“, “ipsum“, “dolor“, “sit“, “amet“, “consectetur“ “[a-zA-Z]+[0-9]“ “ac3“, “Cras1“, “lacus4“, “a4“ “[0-9][a-zA- Zs]+[0-9]“ “5 vitae fermentum enim. 1“, “3 auctor 4“ These examples show very simple patterns, but it is possible to generate very complex matches. Also, due to the large numbers of regex engines available today, not all patterns always return the same matches. Reported examples should cover the most common functionalities and behave mostly the same for all engines. Regular definitions Regex patterns can become very wide and extremely messy structures as they try to match more complex content. Also it is common to encounter lengthy patterns with identical portions inside them like the following for example: [a-zA-Z]+(a|b|c)[a-zA-Z]*[0-9]?[a-zA-Z]+[0-9]. We find the segment [a-zA-Z] 3 times in the expression and the segment [0-9] 2 times. Considering that they match strings that are common to match (sequences of alphabetical or numeric characters), it would probably be reasonable to use aliases to create references. So let us consider the following regular definitions: char -> [a-zA-Z]; digit -> [0-9]; We can rewrite the previous pattern in a more concise way as: char+(a|b|c) char*digit?char+digit. This also enables reuse, as we can take advantage of the same set of definitions for more patterns. Basic properties Regular expressions, no matter the engine, have important properties. Some of them can be used to

pag 37 Andrea Tino - 2013 avoid long patterns or to simplify existing ones. • Operator | is commutative: a|b = b|a. • Operator | is associative: (a|b)|c = a|(b|c). • Concatenation is associative: (ab)c = a(bc). • Concatenation is distributive over operator |: a(b|c) = (ab)|(ac), and: (a|b)c = (ac)|(bc). • The empty string ε is concatenation’s identity element: aε = εa = a. • The following equation holds for the Kleene star: a* = (a|ε)*. • The Kleene star is idempotent: a** = a*. From regular expressions to FSAs Regular languages are defined through regular grammars; however regular languages’ grammar can also be defined through regular expressions in a more concise and simpler way. In the Chomsky-Schützenberger hierarchy, type-3 grammars can be parsed by FSAs; now we are going to see how to create a FSA to develop the parser of a regular language. In particular the transformation process will always start from regular expressions. Thompson’s rules There are many methodologies to construct a FSA from a regular expression. [Theo] Regex2FSA: A FSA can always be constructed from a regular expression to create the corresponding language’s parser. The synthesized FSA is a NFA whose states have one or two successors only and a single final state. Among these, Thompson’s rules are a valid option. [Theo] Thompson’s Construction Algorithm (TCA): A NFA can be generated from a regular expression by using Thompson’s rules. TCA is methodology to build a FSA from a regex but not the only one. Some advantages can be considered but this methodology is not the best. • The NFA can be built following very simple rules. • In certain conditions TCA is not the most efficient algorithm to generate the cirresponding

pag 38 Andrea Tino - 2013 NFA given a regex. • TCA guarantees that the final NFA has one single initial state and one single final state. TCA’s rules are listed following expressions to be translated. The initial regex is split into different subregexes; the basic rule is considering that a whatever regex maps onto a certain NFA. A generic expression A generic regex s is seen as a NFA N(s) with its own initial state and its own final state. Since TCA generates NFAs having only one initial state and only one final state, the previous assumption is valid. Empty string The empty string expression ε is converted into a NFA where the empty string rules the transition between a state and the final one. Symbol of the input alphabet A symbol a of the input alphabet is converted in a similar way for empty string expressions. The symbol rules the transition between a state and the final one. Union expressions A union expression s|t between two expressions s and t, is converted into the following NFA N(s|t). Epsilon transitions rule the optional paths to N(s) and to N(t). Please note that the union expression is also valid for square bracket expressions [st]. Concatenation expressions When concatenating two expressions s and t into expression st, the resulting NFA N(st) is constructed by simply connecting the final state of the first expression to the initial state of the second one. The final state of the first expression and the initial state of the second one are merged into a single crossing state. s_i s_f N(s) q f ɛ q f a s_i s_f N(s) q f ɛ t_i t_f N(t) ɛ ɛ ɛ

pag 39 Andrea Tino - 2013 s_i s_f N(s) s_f t_i N(t) t_f Kleene star expressions When an expression s is decorated with a Kleene star s*, the resulting NFA N(s*) is constructed by adding 4 epsilon transitions. s_i s_f N(s) q f ɛ ɛ ɛ ɛ The final NFA will have a messy shape but it will work as desired. Parenthesized expressions Expressions appearing in parenthesis (s) are converted into the inner expression NFA. So we have that N((s))=N(s). Converting a NFA into a DFA As DFAs are faster than NFAs, very often it can be reasonable to convert a NFA into a DFA to get a more efficient parser for regular languages. This process can sometimes be a little tricky when the NFA being trnsformed has a complex structure. Closures Before diving into the algorithm, the notion of state closure is necessary. [Def] Closure: Given a FSA and one of its states s ∈S , the closure of that state is represented by the set of all states in the automaton that can be reached from it including the original state. The notion of generic closure can be useful, but not as useful as the a-closure’s one. [Def] a-Closure: Given a FSA, a symbol a ∈Σ and one of its states s ∈S , the closure of that state over symbol a is represented by the set of all states in the automaton that can be reached from it through transitions labelled with symbol a including the original state.

pag 40 Andrea Tino - 2013 A particular type of a-closure is necessary for pur purposes. [Def] ε-Closure: Given a FSA and one of its states s ∈S , the ε-closure of that state is represented by the set of all states in the automaton that can be reached from it through ε-transitions including the original state. How to get the ε-closure When having a FSA and willing to get the ε-closure of a state, it is possible to follow a simple procedure. 1. The closure set is initially represented by the original state. 2. Each element in the closure set is considered and all transitions from that state are evaluated. If a ε-transition is found, the destination state is added to the set. 3. The previous point is performed everytime until a complete scan of all states in the set causes the closure not to grow. In pseudo-code the algorithm is very simple to define. procedure get_closure(i,δ) set S <- {i}; for ∀s ∊ S do for ∀x ∊ δ(s,ε) do S <- S ∪ {x}; end end The algorithm can be really time consuming considering its complexity is O n2 ( ). Conversion The conversion from a NFA to a DFA is performed in a 2-step process: 1. All ε-transitions are removed. 2. All non-deterministic transitions are removed. Removing ε-transitions Given the original NFA N = Σ,S,δ,s0,F( ), we are going to build a new one ′N = Σ,S, ′δ ,s0, ′F( ) by safely removing all ε-transitions. The final states subset is: ′F = F ∪ s0{ } Ε s0[ ]∩ F ≠ ∅ ′F = F otherwise ⎧ ⎨ ⎪ ⎩⎪ While the transition function is modified so that ε-transitions are removed in a safe way. In particular, for any given symbol s ∈S , we have that all transitions ruled by non epsilon symbols are kept: δ s,a( )= ′δ s,a( ),∀a ∈Σ :a ≠ , while ε-transitions are replaced by more transitions; in particular the following equation holds.

pag 41 Andrea Tino - 2013 ′δ s,a( )= δ x,a( ) x∈Ε s[ ]  ,∀s ∈S,∀a ∈Σ The trick relies in adding more transitions to replace empty ones. When considering a state and a symbol, new transitions to destination states (labelled with such symbol) are added whenever ε-transitions allow the passage from the original state to destination ones. The result is allowing a direct transition to destination states previously reached by empty connections. Example of ε-transitions removal Let us consider the NFA on the side accepting strings in the form: a*b*c*. The second NFA is the one obtained by removing ε-transitions. To understand how such a result is obtained, let us consider each step of the process. First we consider the final states subset. In the first diagram we have only one final state; however when considering the initial state, we have that its ε-closure contains all other states as well; that’s why we need to make all states as final. We can now consider states and symbols. For example, let us consider couple (s1,b) in the first diagram, the ε-closure of such state is: Ε s1[ ]= s1,s2,s3{ }, by following the definition provided before, we can remove the ε-transition to state s2 and create one transition to s2 labelled with symbol b. Now we consider couple (s2,c) in the first diagram, the ε-closure of such state is: Ε s2[ ]= s2,s3{ }, as before, we can remove the ε-transition to state s3 and create one transition to s3 labelled with symbol c. To understand how transition from s1 to s3 is generated, we need to consider couple (s1,c) and following the same process. The process must be performed for all couples state/symbol; when considering a couple some more states are to be considered. Because of this the process has a complexity equal to: O Σ ⋅ S 2 ( ), which makes the approach lengthy for complex NFAs. NFA integrity Please note that, at the end of the ε-transitions removal process, the resulting NFA is one accepting the same language of the original automaton. The subset construction algorithm After removing ε-transitions, we can proceed removing non-deterministic transitions too from the NFA. We consider a NFA without ε-transitions N = Σ,S,δ,s0,F( ) and transform it into a DFA D = Σ,S D( ) ,δ D( ) ,s0 D( ) ,F D( ) ( ) where: s1 s2 s3 ε ε ca b s3 b c c c a b s2s1 Diag. 2 - Empty transitions removed Diag. 1 - Original NFA

pag 42 Andrea Tino - 2013 • New states identifies groups of the previous ones: S D( ) ⊆ 2S . • The final states subset contains those new states containing at least one final state in the original final states subset: F D( ) = s ∈S D( ) :s ∩ F ≠ ∅{ }. • The new initial state is that set containing the original initial state: s0 D( ) = s0{ }. The new transition function can be obtained by means of the following equation: δ D( ) s,a( )= δ x,a( ) x∈s  ,∀s ∈S D( ) ,∀a ∈Σ The definition implies the evaluation of all subset extractable from the initial state set, which is the reason why this algorithm has an exponential complexity: O 2S ( ). However this is long and time consuming; when performing the process manually, it is better to follow a more sistematic methodology. 1. The states set is initialized with a set containing the initial state only: S D( ) = s0 D( ) = s0{ }{ }. 2. The state transition function δ D( ) is initialized as an empty table with no associations. 3. For each not-marked state s ∈S D( ) , and for each symbol a ∈Σ , we add to S D( ) the set δ x,a( ) ∀x∈s  . At the same time, add the following association s,a( ) δ x,a( ) ∀x∈s  to δ D( ) . Mark the state. 4. Repeat the same process described in the previous point until all states are marked. Using pseudo-code, the transformation process can be modeled as follows: procedure nfa2dfa(δ) set S <- {{i}}, U <- {}, ψ <- {}; for ∀s ∊ S, ∀a ∊ Σ do set R <- {}; for ∀x ∊ s do R <- R ∪ δ(x,a); end S <- S ∪ {R}; ψ <- ψ ∪ {(s,a,R)}; U <- U ∪ {s}; end end Example of NFA2DFA transformation Let us consider the NFA shown below. Following the procedure, we are going to transform it into a DFA accepting the same language. 1. We start by initializing the state set by adding the initial state set to it S D( ) = d0 = s0{ }{ }. Also we make this a final state as it contains a final state in the original NFA. 2. Looking in the state set, we find d0 = s0{ } which is unmarked. The only one state inside the set is s_0, so we consider all symbols. We have δ s0,a( )= d1 = s1,s2{ } and δ s0,b( )= d2 = ∅ . At the same time we add transitions labelled a and b from d_0 to d_1 and from d_0 to d_2.

pag 43 Andrea Tino - 2013 We can now mark d_0 only. 3. State set is S D( ) = d0,d1,d2{ }. We consider state d1 = s1,s2{ }. Again, we apply the same logic and get for the two elements in the state: A new state state is added: d_3. Also new transitions are created as well: d_1=(a)=>d_1, d_1=(b)=>d_3. We can now mark state d_1. 4. The state set is S D( ) = d0,d1,d2,d3{ }. We now proceed on state d2 = ∅. We apply the same logic, however we have no elements in the state, so we cannot create new states in the DFA. However this is not the same for transitions; in fact the union returns the empty set when considering symbols a and b, so the following transitions are created: d_2=(a)=>d_2 and d_2=(b)=>d_2. We can mark state d_2. 5. The state set remains the same as the previous iteration didn’t add new ones. We can now focus on the last state d3 = s1,s3{ }. We follow the same process: New transitions are created and no new states added: d_3=(a)=>d_1 and d_3=(b)=>d_2. Diag. 1 - Original NFA Diag. 2 - Resulting DFA s2 a s0 a a a a b ab s3 s1 d2 a d0 b a a b d3 d1 b a|b δ s1,a( )= s1,s2{ } δ s2,a( )= ∅ δ s1,a( )∪δ s2,a( )= d1 ⎧ ⎨ ⎪ ⎩ ⎪ ∧ δ s1,b( )= ∅ δ s2,b( )= s1,s3{ } δ s1,b( )∪δ s2,b( )= d3 = s1,s3{ } ⎧ ⎨ ⎪ ⎩ ⎪ δ s1,a( )= s1,s2{ } δ s3,a( )= s1,s2{ } δ s1,a( )∪δ s3,a( )= d1 ⎧ ⎨ ⎪ ⎩ ⎪ ∧ δ s1,b( )= ∅ δ s3,b( )= ∅ δ s1,b( )∪δ s3,b( )= d2 ⎧ ⎨ ⎪ ⎩ ⎪

pag 44 Andrea Tino - 2013 Note that the process at every iteration focuses on a state in the DFA. When focusing on a state, new states are created and outgoing transitions might be created from the state in exam. Direct conversion from a regular expression to a DFA To build a function to directly convert a regex into a DFA, we can first remove empty transitions and then applying the subset construction algorithm. The overall complexity of this operation is O 2S ( ). But instead of a real direct approach, this is a 2-step solution. However a straightforward conversion from a regex to a DFA is possible thanks to another algorithm consisting of 3 steps. 1. AST creation: The AST of an augmented version of the regex is created. 2. AST functions generation: Functions operating on nodes of the AST are synthesized. 3. Conversion: From the AST, by means of the functions as well, the final DFA is generated. Creating the AST Given a regex, we augment it with a special symbol (not part of the regex alphabet) at the end of the same. Once we get the augmented regex, we can draw its AST where intermediate nodes are operators and leaves are alphabet characters or the empty string or the augmentation symbol. The augmenting symbol, here, will be character #, and it is to be applied as the last operator in the original regex pattern. For example, given regex ab*(c|d)+, the augmented regex will be: (ab*(c|d)+)#. The augmentation symbol acts like a normal regex character class symbol and must be concatenated to the whole original regex; that’s why sometimes it is necessary to use parentheses. However parentheses can be avoided in some cases like pattern: [a-zA-Z0-1]*, which simply becomes pattern: [a-zA-Z0-1]*#. One key feature of the tree is assigning numbers to leaves by decorating them with position markers. With this, every leaf of the tree will be assigned an index i ∊ N. The way numbers are assigned to leaves is the one defined by the pre-order depth- traverse algorithm; with the difference that intermediate nodes can be removed by the final list. A corollary of this approach is that the augmentation symbol will always be assigned with the last index. Example of creating an AST out of an augmented regex We consider regex (a|b)*abb. The first step is augmentation, so we get regex (a|b)*abb#. Following operators priority (from lowest to highest priority: |, • and *), we use postfix notation and build the AST. The final step is assigning indices to leaves, the process is simple as we only need to order nodes following the pre-order algorithm to get sequence {a,|,b,*,•,a,•,b,•,b•,#}. From the list we remove AST from augmented regex * • | a b #• • b• * a21 3 5 4 6

pag 45 Andrea Tino - 2013 all intermediate nodes: {a,b,a,b,b,#}, so that we can finally assign numbers from left to right as shown in the diagram. Defining AST node functions After building the AST for the augmented regex, some functions are to be defined. Actually it is a matter of defining one function called follopos(i∊N). But to define such a function, 3 more functions are needed. [Def] Function followpos: Given an AST from an augmented regex, function followpos(N):2^N returns the set of all indices following the input index inside the AST. To compute this function, we also define more functions whose behavior is described by the table reported below. Input node n Function firstpos(n) Function lastpos(n) Function nullable(n) The node is a leaf labelled ε {} {} true The node is a leaf with index i {i} {i} false Option node | (S2)(S1) firstpos(s1) ∪ firstpos(s2) lastpos(s1) ∪ lastpos(s2) nullable(s1) OR nullable(s2) Concatenation node · (S2)(S1) if nullable(s1) => firstpos(s1) ∪ firstpos(s2) else => firstpos(s1) if nullable(s2) => lastpos(s1) ∪ lastpos(s2) else => lastpos(s2) nullable(s1) AND nullable(s2) Kleene star node * (S1) firstpos(s1) lastpos(s1) true As it is possible to see, all functions accept a node and return a set of leaf indices, except function nullable which returns a boolean value. Furthermore, a key difference is to be underlined

pag 46 Andrea Tino - 2013 between follopos and the other functions: followpos does not accept a node, but an index. Also remember that only leaves are assigned with an index. At this point, evaluating follopos is possible by following 2 rules: 1. If node n is a contatenation node with left child s1 and right child s2, and i ∊ lastpos(s1) is an index, then all indices in firstpos(s2) are inside followpos(i) as well: followpos(i) ⊇ firstpos(s2). 2. If n is a Kleene star node and i ∊ lastpos(n) is an index in lastpos(n), then all indices in firstpos(n) are inside followpos(i) as well: followpos(i) ⊇ firstpos(n). Using an in-depth traverse algorithm, provided that functions nullable, firstpos and lastpos have already been applied to all nodes in the AST, it is possible to evaluate function followpos for every leaf. Using pseudo-code, it is easy to implement the function: procedure followpos(i,n) set F <- {}; for ∀x ∊ n do /* for each node in the tree whose root is n */ if n.type == CONCAT_N and i ∊ lastpos(n.s1) then for ∀y ∊ firstpos(n.s2) do F <- F ∪ {y}; end elseif n.type == KLEENE_N and i ∊ lastpos(n) for ∀y ∊ firstpos(n) do F <- F ∪ {y}; end end end return F; end The procedure requires an iterative or a recursive approach to be considered in order to browse the AST. Please note that Kleene star nodes have only one child, the AST is not an exact binary tree. Example of applying function to an AST Taking advantage of the same example of before, we can now augment the tree with attributes to nodes. To calculate function followpos for each leaf, we first need to calculate the other functions for all nodes. The best approach is starting from leaves and proceeding up the root. Function nullable should be applied first, then functions firstpos and lastpos can be evaluated for each node. When evaluating each function, leaves are to be considered first as intermediate nodes depend on function values of children, which makes the methodology a recursive one.

pag 47 Andrea Tino - 2013 * • | a b #• • b• * a{1} {2} {3} {4} {5} {6} {1,2} {1,2} {1,2,3} {1,2,3} {1,2,3} {1,2,3} * • | a b #• • b• * afalse false false true false false false false false false false false * • | a b #• • b• * a{1} {2} {3} {4} {5} {6} {1,2} {1,2} {3} {4} {5} {6} Diag. 1 - Applying nullable Diag. 2 - Applying firstpos Diag.3 - Applying lastpos Now we can evaluate function followpos for every leaf index. However the approach is a little bit different, we start considering an index and for each one of them we must consider all nodes in the tree to evaluate the function. Starting from the root is not mandatory, but can be a good approach. So we have the following associations: 1=>{1,2,3}, 2=>{1,2,3}, 3=>{4}, 4=>{5}, 5=>{6}, 6=>{}. Conversion algorithm After all nodes have been decorated with results of function followpos being applied to them, we can synthesize the DFA. The methodology is sistematic and described by the following steps: 1. Initialize the DFA’s states set S ⊆ 2^N with the empty set: S = {}. States of the final DFA are represented by sets of indices labelling leaves of the AST. 2. The first state being considered is the DFA start state s_0. This state is s_0 = firstpos(r), where r is the AST’s root node. State s_0 is to be inserted into S as a not-marked state. 3. For each not-marked state s ∈S and for each symbol a ∈Σ , consider all indices si ∈s in state s that map onto symbol a, and evaluate function followpos for each: si F( ) = followpos si( ). Create a temporary set with all collected indices: ′s = si F( ) ∀i=1…s  . 4. If temporary set ′s is not empty, then it is to be identified as a new valid state and added to the states set S as a not-marked state. However, if ′s is not empty but already in S, then leave it as it is; if it is a not-marked state it will remain so, the same goes in case the state is marked. 5. Let δ :S × Σ  S be the transition function. Here the function does not return a set of states as we are going to create a DFA not a NFA! Create connection δ s,a( )= ′s . 6. Run again for another not-marked state jumping to point 3. If the states set S is full of marked- states, then the algorithm is over.

pag 48 Andrea Tino - 2013 In pseudo-code, the algorithm is more straightforward: procedure synth_dfa(p) set n <- create_ast(augment_regex(p,”#”)); /* step 1 */ n <- decorate_tree(n,followpos); /* step 2, ast represented as its root */ set S <- {firstpos(n)}; U <- {} set δ <- {}; /* dfa initialization */ for ∀s ∊ S, ∀a ∊ Σ do /* for each not-marked state and symbol */ set s1 <- {}; for ∀i ∊ s do if σ(i) == a then s1 <- s1 ∪ followpos(i); end end if s1 != {} then δ <- δ ∪ {(s,a,s1)}; if not s1 ∊ s then S <- S ∪ {s1}; end end U <- U ∪ {s}; end return S,δ; end Function σ :  Σ Returns the symbol associated to the leaf index specified as input. Final states in the DFA are all those ones containing the index of the augmentation symbol. Example of converting a decorated AST into a DFA By means of the same example, we can finally create the DFA. Passages are shown in the table. Working state Working symbol Step description Current s_0 = {1,2,3} a s1 = {1,3}; f(1) = {1,2,3}; f(3) = {4}; s_1 = f(1) ∪ f(2); s_1 = {1,2,3,4}; δ(s_0,a) = s_1; a s0 s1 b s1 = {2}; f(2) = {1,2,3}; s_0 = f(2); δ(s_0,b) = s_0; b a s0 s1

pag 49 Andrea Tino - 2013 Working state Working symbol Step description Current s_1 = {1,2,3,4} a s1 = {1,3}; f(1) = {1,2,3}; f(3) = {4}; s_1 = f(1) ∪ f(2); δ(s_1,a) = s_1; !!NEW STATE ADDED!! b a s0 s1 a b s1 = {2,4}; f(2) = {1,2,3}; f(4) = {5}; s_2 = f(2) ∪ f(4); s_2 = {1,2,3,5}; δ(s_1,b) = s_2; !!NEW STATE ADDED!! b a s0 s1 a s2 b s_2 = {1,2,3,5} a s1 = {1,3}; f(1) = {1,2,3}; f(3) = {4}; s_1 = f(1) ∪ f(3); δ(s_2,a) = s_1; b a s0 s1 a s2 b a b s1 = {2,5}; f(2) = {1,2,3}; f(5) = {6}; s_3 = f(2) ∪ f(5); s_3 = {1,2,3,6}; δ(s_2,b) = s_3; !!NEW STATE ADDED!! b a s0 s1 a s2 b a s3 b

pag 50 Andrea Tino - 2013 Working state Working symbol Step description Current s_3 = {1,2,3,6} a s1 = {1,3}; f(1) = {1,2,3}; f(3) = {4}; s_1 = f(1) ∪ f(3); δ(s_3,a) = s_1; !!FINAL STATE ADDED!! b a s0 s1 a s2 b a s3 b a b s1 = {2}; f(2) = {1,2,3}; s_0 = f(2); δ(s_3,b) = s_0; b a s0 s1 a s2 b a b a b s3 In the last passage, final states are defined! As we can see, the final automaton is a deterministic one. The approach allows the direct calculation of the DFA from a regex. Minimizing FSAs A question naturally arises: “Is the automaton in minimal form?“. Minimization is a term used to refer to DFAs which have the minimum number of states to accept a certain language. More DFAs can accept the same language, but there is one (without considering states labelling) that can accept the language with the minimum number of states: that DFA is the minimal DFA. When having a DFA, we can try to make it minimal using a sistematic methodology, if final result is the same, then the original DFA was already in its minimal form. [Theo] Minimal DFA existance: For each regular language that can be accepted by a

pag 51 Andrea Tino - 2013 DFA, there exists a minimal automaton (thus a DFA with minimum number of states) which is unique (except that states can be given different names). [Cor] Minimal DFA’s computational cost: The minimal DFA ensures minimal computational cost for a regular language parsing. Hopcroft’s DFA minimization algorithm DFA minimization can be carried out using different procedures; a very common approach is Hopcroft’s algorithm. Considering a DFA, to obtain the minimal form the following procedure is to be applied: 1. Consider the states set S ⊆  , and split it into two complementar sets: the set of final states S F( ) ⊂ S and the set of non-final states S N( ) ⊂ S . 2. Consider a generic group and create partitions inside that group. The rule to create partitions is having all states inside it having the minimal relation property. 3. Proceed creating more partitions until no more partitions can be created. 4. At the end of the partitioning process, consider all partitions to be the new states of the minimized DFA. Remove internal transitions (inside every partition) and leave all partition- crossing transitions. 5. Make the initial state of the minimal DFA the state containing the initial state of the original DFA. Make final states in the minimal DFA all states containing final states in the original DFA. Inside a partition, all states have the following property: [Theo] State in partition: In a minimal DFA, a state in a partition has at least one transition to another state of the same partition. A direct consequence of the previous result, is a useful approach to locate partitions, or better to locate states that can be removed from a partition in order to be placed into a different one. [Cor] State not in partition: During the partitioning process to get the minimal DFA from a DFA, a state is to be placed into a different partition when it has no transitions to states of that partition. Example of DFA minimization We consider the previous example and try to check whether the DFA is a minimal one or not. By carrying out the partitioning process, we note that state s_0 falls inside the non-final states partition. We also note that in that partition, no transition are directed to s_0 from other states in the group: state s_0 is to be placed into a different partition.

pag 52 Andrea Tino - 2013 After doing so, we note that no more partitioning can be carried on and stop the process. Inside the only one group we remove transitions from/to states s_1 and s_2 and draw remaining transitions to/from the entire partition to the other nodes (without forgetting self-transitions). Diag. 1 - Original DFA Diag. 2 - Minimized DFA b a s0 s1 a s2 b a b a b s3 b as0 s1 s2 a b a b s3 The example also proves that the process of converting a regex into a DFA does not return a minimal DFA.

Andrea Tino - 2013 Lexical analysis and scanners Plurality should not be posited without necessity. “ “William of Occam http://www.brainyquote.com/quotes/quotes/w/ williamofo283623.html

pag 54 Andrea Tino - 2013 About lexicon In a language, lexicon is that part of the language describing all allowed lexems in phrases. Lexems, or tokens, in a language can be divided into some groups or lexical classes. Lexical class Example (C++) Description Keywords void fun(); if (var) {...} Fixed words in the language that cannot be altered or inflexed. Delimiters int i = 0; for (;;) {} char a,s,d; Particular characters used to separate other tokens in a statement. Operators a <= b; if (a==b)... int a = 2; They typically consist in single characters or couple of characters; used to evaluate special functions. Composite structures // Comment /* Comment */ More lexems identified as a single structure, like comments. Open classes int identifier = 1; if (var==1) {...} obj = new type(); All those tokens that cannot be enumerated. They can be altered or inflexed. Identifiers for functions and variables. Literals as well. A lexer for a language is a program whose purpose is recognizing valid tokens inside a phrase provided as input. Such a task is not so difficult, however the main problem is represented by open lexical classes. They cannot be enumerated, but they are valid tokens to be recognized like all the others. That’s why a lexer is implemented as a FSA. If a language were finite, then it would be a regular language, but more importantly all lexems would be enumerable. The problem of recognizing a token (if the language is small) might be handled by simple algorithms checking whether an eleent belongs to a set or not (Blooming filters for example). But if the language has open classes in its lexicon, then it is a whole different matter. To recognize all tokens, it is necessary to generate them. So a grammar is to be used! In particular, tokens for a language are treated as regular expressions. In fact they all constitute a regular language where order of tokens is not important (as a token is to be recognized as a single entity). That’s the reason why lexers are implemented using FSAs. What should a lexer do? A lexer has some important tasks. Basically our compiler will receive as input a string; the first stage is the lexer which takes the string and needs to manipulate it to return tokens. So following we can find all tasks a lexer needs to perform:

pag 55 Andrea Tino - 2013 • A lexer must provide a system to isolate low level structures from higher level ones representing the syntax of the language. • A lexer must subdivide the input string in tokens. The operation is known as tokenization and it represents the most important activity carried out by a lexical analyzer. During this process, invalid sequences might be recognized and reported. • A lexer is also responsible for cleaning the input code. For example white spaces are to be removed as well as line breaks or comments. • A lexer can also perform some little activities concerning the semantic level. Although semantics is handled as the final stage of the compilation process, lexers and parsers can provide a little semantic analysis to make the whole process faster. A typical application, is having lexers insert symbols and values in the symbol table when encountering expressions like variables. Separation is not compulsory It is important to underline a concept. Lexical analysis and syntax analysis can be carried out together into the same component. There is no need to have two components handling lexicon and syntax separately. If a compiler is designed as modular, it is easier to modify it when the language changes. Communications between lexer and parser We will consider a compiler with a modular structure, so the lexer and the parser in our compiler are implemented as two different components. However they interact together. In particular, the parser requests one token at a time to the lexer which provides them. The source code is not seen by the parser, as it comes as input to the lexer. About tokens There a little of naming to understand when considering one language’s lexicon. The first thing to understand is the difference between a lexem and a token in the language. [Def] Lexem: In a language, a lexem is the most basic lexical constituent. A lexem is represented by a word of the language carrying, at most, one single semantic value. [Def] Token: In a language, a token is a lexem or a sequence of lexems. It is the unit of information carrying a certain semantic value which is returned by the lexer to the parser. The difference os not that evident and not so simple to grasp. However that’s when examples come to the rescue. An identifier is a lexem and at the same time a token for example. A keyword is another example of lexem being a token as well. On the other hand, a comment like /* comment */ in C++, is a token, but it is a sequence of different lexems tarting from comment delimiters /* and */. An easy way to differentiate tokens from lexems is placing our point of view right in between the lexer and the parser. Everything which the lexer passes to the parser is a token. If a

pag 56 Andrea Tino - 2013 lexem is never passed, then it is not a token. Semantics We said that both lexems and tokens have semantics associated with them. However, when a lexem is not a token, it is likely to have it passed to the parser as part of a token consisting of more lexems. Semantics is something that, in the end, will be seen by the parser. The parser has no knowledge related to lexems. So the final semantic values are those ones arranged by the lexer working on lexems and attached to tokens. A typical example of semantics handling is when working with literals. Consider the following C/ C++ input code: int n = 0;, The lexer will return 5 different tokens to the parser: a keyword, an identifier, an operator, a literal and a delimiter. They all carry semantics with them except for the operator and the delimiter tokens. The identifier carries an information about the type of the variable declaration being considered in the statement; the identifier provides the name of the variable to be inserted as key in the symbol table and the literal provides a numerical value. When the literal is handled by the lexer as a lexem, it is seen as a string, the lexer needs to parse it as an integral value and associate that value to the literal token that will be sent to the parser. Structure of a token It is evident that a token is not simply passed to the parser as a string. A composite structure, representing the whole token, is passed to the parser. Typically the following fields are included in a token: • Token type: The type of token: identifier, keyword, operator, etc. • Token value: The string representing the sequence of lexems fetched and recognized by the lexer and passed to the parser. • Attributes: Semantic values (if any). The content of this fiend strongly depends on the type of token being considered. • Localization: Typically a field used to locate errors if any. This field is optional. When the compiler implements error handling routines, this field is very useful. Depending on the token type, some fields might get ignored by the parser. For example a keyword does not carry any special meaning. In this specific case, no semantic values are added as well. How to implement a scanner Implementing a scanner is not so difficult after all. But how to proceed? We learned that a scanner is based on regular expressions, and we learned that regexes are evaluated using FSAs. So we should learn how to implement a FSA. This is not necessarly true; in fact there are more options

pag 57 Andrea Tino - 2013 other than FSAs only. Let us consider them all in a brief overview. Methodology Flow Description Procedural approach Regular Grammar => Program Regex => Program This is a hard-coded solution. Starting from a regex or a regular grammar, a program is synthesized by manually coding the application. DFA implementation approach Regular Grammar => DFA => Program Regex => DFA => Program From a regular grammar or a regex, a DFA accepting the language is obtained. This solution provides sistematic methodologies to hard-code DFAs. Scanner generators Regex => Tool => Program This is the most common solution. Using generative tools, one needs to specify regexes only. The tool will generate the code to be compiler into the final lexer. Today, scanner generators are very common and are usually preferred to hard-coded approaches. Procedural approach: regex 2 program The methodoly provides a sequence of rules to hard-code a scanner accepting a certain regular expression. So the problem is converting a regular expression into a program accepting it. With this solution, there is no need to convert the regex into a DFA, the approach operates directly on the regex. The program will call a function whose body is filled using some rules. Using C/C++, a function is to be created: typedef int idx; /* index for a character in the input string */ enum Char {...}; /* symbols in alphabet */ Char cur(const std::string& str, idx i) {...} /* gets the current char */ bool scanner(const std::string& input) { idx i = 0; /* index for the character to consider */ /* body here */ } Rules Procedural code is added according to symbols and characters in the regex: 1. Write a sequence for each concatenation. 2. Write a test for each union. 3. Write a loop for each Kleene star. 4. Place return points for each accepting position in the regex.

pag 58 Andrea Tino - 2013 In this case as well, only union, concatenation and Kleene-star characters are considered. Every other regex character is to be converted into a more basic form using the afore-mentioned regex symbols. For example regex a(a|b)+ should be converted into a(a|b)(a|b)*. Example of scnner hard-coding from a regex The problem of this approach is the order of operators. Although not necessary, one should figure out about the operators tree, after that, starting from the root, rules are applied. Consider the following example for regex (a(a|b)*)|b. enum Char { CHAR_A = 0x00, /* in alphabet: a */ CHAR_B = 0x01 /* in alphabet: b */ }; Char cur(const std::string& str, idx i) { if (i >= str.length() || i < 0) throw std::exception(); if (str[i] == ‘a‘) return CHAR_A; if (str[i] == ‘b‘) return CHAR_A; } bool scanner(const std::string& input) { /* the scanning function */ idx i = 0; try { if (cur(input,i) == CHAR_A) { /* test */ i++; /* consume */ while (cur(input,i) == CHAR_A || cur(input,i) == CHAR_B) /* kleene */ i++; /* consume */ return true; /* accept */ } else if (c == CHAR_B) return true; /* test */ return false; /* reject */ } catch (std::sxception e) { return false; /* reject */ } } The code above implements a scanner recognizing the provided regex. Be careful about the fact for which the procedure will implement a scanner for a unique regex only. If another regex is to be matched, a new procedure must be written. To use the scanner, the main routine is to be simply written and function scanner to be called inside it. Procedural approach: regular grammar 2 program When having a grammar, it is possible to write a scanner from that. As for regexes, a methodology can be considered to create the scanning program. This time, rules focus on production rules of the grammar to synthesize the scanner. The scanner function will have the same structure and interface from the previous approach. Also, function cur and enum Char are recycled as well. Rules Procedural code is added according to non-terminals and produnction rules in the grammar.

pag 59 Andrea Tino - 2013 1. Write a function for each non-terminal. 2. Write a test for each alternative in the production rule for each production rule. 3. Call corresponding functions for each non-terminal appearing as RHS in production rules. 4. Place return points in all places where a grammar rule is to be verified and when terminals are reached. First, all non-terminals appearing in rules are to be processed and functions are to be created accordingly. After that, one focuses on production rules. For each production rule alternatives are to be evaluated. For each alternative a test is written. In the end, every alternative’s RHS is evaluated and functions called when the corresponding non-terminal is encountered. Example of scanner hard-coding from a regular grammar Consider the following regular grammar: L = {S,A,0,1,2,3,#}; V = {S,A}; T = {0,1,2,3,#}; S -> 0 A | 1 A | 2 A | 3 A; A -> 0 A | 1 A | 2 A | 3 A | #; The code to synthesize is shown below. The enumeration will now host terminal and non-terminal symbols, while function cur will act accordingly: enum Char { /* terminals only */ CHAR_0 = 0x00, CHAR_1 = 0x01, CHAR_2 = 0x02, CHAR_3 = 0x03, CHAR_S = 0xf0 }; Char cur(const std::string& str, idx i) { if (i >= str.length() || i < 0) throw std::exception(); if (str[i]==‘0‘) return CHAR_0; if (str[i]==‘1‘) return CHAR_1; if (str[i]==‘2‘) return CHAR_2; if (str[i]==‘3‘) return CHAR_3; if (str[i]==‘#‘) return CHAR_S; } Before handling function scanner, we need to create all functions to manage with non-terminals: bool _handle_S(const std::string& input, idx& i) { /* non-terminal S */ if (cur(input,i) == CHAR_0) { /* test */ i++; /* consume */ if (_handle_A(input,i)) return true; /* ok */ } if (cur(input,i) == CHAR_1) { /* test */ i++; /* consume */ if (_handle_A(input,i)) return true; /* ok */ } if (cur(input,i) == CHAR_2) { /* test */ i++; /* consume */

pag 60 Andrea Tino - 2013 if (_handle_A(input,i)) return true; /* ok */ } if (cur(input,i) == CHAR_3) { /* test */ i++; /* consume */ if (_handle_A(input,i)) return true; /* ok */ } return false; /* reject */ } /* _handle_S */ bool _handle_A(const std::string& input, idx& i) { /* non-terminal A */ if (cur(input,i) == CHAR_0) { /* test */ i++; /* consume */ if (_handle_A(input,i)) return true; /* ok */ } if (cur(input,i) == CHAR_1) { /* test */ i++; /* consume */ if (_handle_A(input,i)) return true; /* ok */ } if (cur(input,i) == CHAR_2) { /* test */ i++; /* consume */ if (_handle_A(input,i)) return true; /* ok */ } if (cur(input,i) == CHAR_3) { /* test */ i++; /* consume */ if (_handle_A(input,i)) return true; /* ok */ } if (cur(input,i) == CHAR_#) { /* test */ i++; /* consume */ return true; /* ok */ } return false; /* reject */ } /* _handle_A */ Please note how non-terminal handling procedures accept a reference (pointer) of the index. bool scanner(const std::string& input) { /* the scanning function */ int i = 0; try { return _handle_A(input,i); /* starting */ } catch (std::sxception e) { return false; /* reject */ } } The approach, however, might require in some cases, to change the grammar style. For example recursive rules can be a little problematic. Creating more rules can help to make the scanner work with this approach.

pag 61 Andrea Tino - 2013 DFA implementation approach It doesn’t matter if we start from a regex or from a regular grammar. We need to proceed converting them into a DFA, and after that the following methodology will be considered. So far we can convert a regex into a DFA, but other approaches (we did not cover here) can be used to handle grammar to DFA conversions (for example by generating a regex from a grammar). To implement a DFA by hardcoding it, we need a scanning routine and a function acting as the DFA transition function. typedef int idx; /* index for a character in the input string */ enum State {...}; /* states of the dfa */ enum Char {...}; /* symbols in alphabet */ State deltaf(const State& s, const Char c) {...} /* transition function */ bool is_final(const State& s) {...} /* is the state final? */ Char cur(const std::string& str, idx i) {...} /* gets the current char */ bool scanner(const std::string& input) { /* the scanning function */ State current = STATE_BEGIN; idx i = 0; try { for (;;) { if (i >= input.length()) break; /* reached the end of the string */ current = deltaf(current,cur(input,++i)); /* transition & consume */ if (current == STATE_NONE) return false; /* reject */ } return is_final(current); } catch (std::sxception e) { return false; /* reject */ } } So the problem is not writing the scanning routine, but writing the transition function for the DFA, this is a very simple task actually. The transition routine and the state enum should be built according to the following principles: 1. Enumeration State must provide two compulsory values: STATE_BEGIN, the start state of the DFA, and value STATE_NONE. 2. Function deltaf must always return a single value of type State for each couple state, character. All combinations must be handled. 3. When a transition from a state for a given symbol does not exist in the original DFA, function deltaf will return value STATE_NONE. Mind that in case the transition function is not properly coded, possibilities of infinite loops may arise. A DFA is quite simple to hard-code because transitions are deterministic! This explains the

pag 62 Andrea Tino - 2013 reason why DFAs are faster. Example of scanner hard-coding from a DFA Let us consider the DFA reported here. We can write the transition function following connections and symbols. enum State { STATE_BEGIN = 0x00, STATE_NONE = 0xff, STATE_D1 = 0x01, STATE_D2 = 0x02, STATE_D3 = 0x03 }; enum Char { CHAR_A = 0x00, CHAR_B = 0x01 }; State deltaf(const State& s, const Char c) { if (s == STATE_BEGIN && c == CHAR_A) return STATE_D1; if (s == STATE_BEGIN && c == CHAR_B) return STATE_D2; if (s == STATE_D1 && c == CHAR_A) return STATE_D1; if (s == STATE_D1 && c == CHAR_B) return STATE_D3; if (s == STATE_D2 && c == CHAR_A) return STATE_D2; if (s == STATE_D2 && c == CHAR_B) return STATE_D2; if (s == STATE_D3 && c == CHAR_A) return STATE_D1; if (s == STATE_D3 && c == CHAR_B) return STATE_D2; return STATE_NONE; /* no transition */ } bool is_final(const State& s) { if (s == STATE_BEGIN || s == STATE_D1) return true; return false; } This methodology actually requires very few efforts than those ones considered before. More issues when building scanners When developing a scanners, there are is a certain set of problems which are fairly common. We are going to analyze them without detailing things too much, just for the sake of knowledge. Identifiers vs. reserved keywords How can a scanner recognize when a certain sequence of characters is to be identified as a reserved keyword or as an identifier? The problem comes from the fact that reserved keywords and identifiers usually, in the most general case, fall into the same matches when processed by the various regex patterns while scanning the input. To solve the problem, some solutions can be d2 a d0 b a a b d3 d1 b a|b

pag 63 Andrea Tino - 2013 considered. • When scanning a sequence, regexes mapping identifiers are used; thus no regexes are inserted to evaluate reserved words. However before returning an identifier token, the sequence is searched into a reserved words table. If the sequence is matched by an entry, then returned token’s will be changed into reserved word causing the lexer to return a reserved word token; nothing will be done otherwise and an identifier token will be returned. • Another approach is giving priority to identifiers. Reserved words will be chosen in the complementar set of the one generated by all matches identified by regexes for identifiers. By doing so we avoid identifiers’ regexes to match reserved words. However such an approach can be very limitative as the complementar set leaves can leave very few options for reserved words. Also, reserved words should be easy to associate to the language semantics, with this approach it would be impossible to choose appropriate names for reserved words. For example the if keyword might not be available, what to do? • Lexers today follow this solution: all regexes are assigned a priority. Among all regexes matching a given sequence, the one holding the highest priority will be considered. Solving the identifiers vs. reserved words problem is easy at this point: all regexes assigned to reserved words a arregned the highest priorities. Getting semantic values Some tokens can have semantic values associated with them. The classic example is represented by numerical literals. They should varry the integral or double values they represent. This operation sometimes is not easy. The real problem is always represented by converting a string into a numerical value. The problem is neither simple nor dummy because several machine architectures might support different encodings. Scanner termination The problem of terminating the scanning is a logic issue, but sometimes it can hide some troubles. A typical approach is allowing the existance of an EOF token released when a particular sequence is encountered. Character lookahead There are some languages which introduce particular sequences for which it is necessary to allow the lexer to peek the next character without consuming the current one. This approach can be really powerful. Error recovery Error management in scanners is very important. The problem is not just notifying an error, but recovering from it. If a sequence is not recognized, the scanner should not be stopped. That’s why

pag 64 Andrea Tino - 2013 the current trend is trying to discard a not-recognized sequence and go on to the next one in order to return a token. Current approaches are the following: • Discarding all characters read until the error occurrance and restarting scanning operations. This approach completely discards a token and proceed further. • Starting from the moment the previous token was returned until the error occurrance, discarding the first read character and scanning again from the next one. This approach tries to fix a mispelled keyword. The ratio behind such a logic is explained by the empirical observation for which the most common lexical errors reside in illegal characters or mispelled keywords. In such cases, statistics show that errors typically occur at the beginning of the token; in this case both approaches detailed before are equivalent.

Andrea Tino - 2013 Syntax analysis and context-free grammars Machines take me by surprise with great frequency. “ “Alan Turing http://www.brainyquote.com/quotes/quotes/a/ alanturing269234.html

pag 66 Andrea Tino - 2013 Syntax and CFGs A language, as described before, is made of two important components: • A set of words over an alphabet representing all the recognized and valid lexems. • A syntax. What is the syntax of a language? The syntax is a set of rules which defines how words can be arranged together in order to create valid phrases in the language. A syntax also allows the correct translation of a phrase into a computer readeable structure, this structure is called Abstract Syntax Tree. Syntax handles sequences of tokens which must be recognized during the lexical analysis. Each token is an atomic structure in the language that cannot be decomposed further. What is a CFG? Context Free Grammars are particular kinds of grammars corresponding to type-2 in the Chomsky-Schützenberger hierarchy. Today almost all languages take advantage of context-free syntaxes. Throughout this chapter, unless differently specified, we will consider CFGs only, also when mentioning a generic grammar. What are the pros of using CFGs? CFGs can be parsed (processed and recognized by a syntax analyzer) by using simple, compact, fast and efficient algortihms. Production rules can be defined intuitively. Today even the most complex languages can be described by type-2 grammars. What about the cons? Although very powerful, CFGs cannot really describe all possible phrases in modern languages. However hybrid solutions can be used. For example, some languages are parsed using context-free syntax parsers, everything left unrecognized is, then, processed by other parsers. A context-free model for example is not able to define the rule for which a variable must first be declared before being used. In modern solutions, to handle the variable declaration issue, CFGs are used anyway, the declaration matter is then handled by at semantic analysis. How is parsing performed? Parsing can be performed all at once given the source code or, in a better scheme, over a sequence of tokens evaluated by the scanner one step before. The latter is a better approach because it allows a better separation between lexical and syntax analysis. The parser, when running, is meant to ask for the next token of the sequence to the scanner. Derivations: concepts and definition CFGs enables many features in a language. Differently from regular ones, a CFL (Context Free Language) can have recursive structures. Being type-2 in the Chomsky-Schützenberger hierarchy, such grammars have all rules in the form: A → γ . That is a non-terminal is the only allowed sequence of symbols in the LHS, while the RHS can be a generic sequence of terminals and non- terminals. Although the empty string is not really allowed as rules’ RHS in the original Chomsky-

pag 67 Andrea Tino - 2013 Schützenberger hierarchy, we can also have such a condition. Derivations Production rules in CFGs can also be called rewriting rules. The concept of derivation is very important because it is strictly bound to the output of parsers: ASTs. Following rules in a CFG and starting from a string, it is possible to apply many rules and get new sequences of symbols. A derivation is a whatever sequence of rule applications in a grammar, starting from a given initial string. When it comes to CFGs, it is very simple to apply rules: each non-terminal can be expanded; only when the final result contains no non-terminal the process is over. Derivations in CFGs can be visualized also using trees. Tokens Tokens returned by the lexer are terminals in the grammar handled by the parser. An example: algebraic expressions Consider the following example showing a simple CFG consisting in the following elements: L = {S, x, y, z, +, -, *, /, (, )}; T = {x,y,z}; V = L - T; And having the following rules: S -> x; S-> y; S -> z; S -> (S); S -> S+S; S -> S-S; S -> S*S; S -> S/S; This CFG defines a simple language to create basic algebraic expressions. For example the string (x+y)*(x-z) can be generated by the following sequence of rules: S => S*S => (S)*(S) => (S+S)*(S-S) => (x+y)*(x-z). Each step α => β is a single derivation. The language of derivations Derivations are represented by double arrows between two strings. Since a derivation is the result of applying one or more rules of the grammar one or more times, it is not always possible to always tell how many rules led the LHS of the derivation to the RHS. If a derivation like α => β is considered, we say that it is valid (for a given grammar) only if there exists a rule legitimating such a transition. In a more formal way, we can summarize this property as follows: αAβ ⇒αγβ ⇔ ∃ A → γ( )∈P,∀α ∈ V ∪T( ) ,β ∈ V ∪T( ) For any given context-free grammar G = V,T,P,S( ). This means that the following expression α => β refers to a single step derivation. The following table shows all possible symbols to use when handling derivations and their meanings. Notation Relation Descrirption Single step α => β α derives (directly) from β. β is derived (directly) from α. To reach the RHS from LHS, one derivation step is needed.

pag 68 Andrea Tino - 2013 Notation Relation Descrirption Fixed steps α =>{n} β α derives n steps from β. β is derived n steps from α. To reach the RHS from LHS, a fixed number of derivation steps (shown in the relation) are needed. One or more steps α =>+ β α derives (directly) from β. β is derived (directly) from α. To reach the RHS from LHS, one or more derivation steps are needed. Zero or more steps α =>* β α derives (directly) from β. β is derived (directly) from α. To reach the RHS from LHS, zero or more derivation steps are needed. In the example of before, we have that the following relation S*S =>* (x+y)*(x-z) holds. LeftMost derivations In CFGs production rules can be applied in a very easy way. Since rewriting rules are applied by simply substituting an expression to a non-terminal symbol, we can perform this operation using certain policies. This is particularly true when a string contains more non-terminals: which one should be replaced first? Given a string, the LeftMost derivation policy replaces the first non-terminal encountered when scanning the string from left to right. In the example of before, one possible multi-step LeftMost derivation can be: S => S*S => (S)*S => (S+S)*S => (x+S)*S => (x+y)*S => (x+y)*(S) => (x+y)*(S-S) => (x+y)*(x-S) => (x+y)*(x-z). RightMost derivations The counterpart of LeftMost derivations are RightMost ones. The principle is almost the same. Given a string, the RightMost derivation replaces the last non- terminal encountered when scanning the string from left to right. The same multi-step derivation of before S =>+ (x+y)*(x-z) can be expanded in a sequence of single-step RightMost derivations as follows: S => S*S => S*(S) => S*(S-S) => S*(S-z) => S*(x-z) => (S)*(x-z) => (S+S)*(x-z) => (S+y)*(x-z) => (x+y)*(x-z). Derivations as a tree When starting from a string of symbols and considering a CFG, we can evaluate all possible derivations from that string using that grammar by means of trees. How can we chenck whether a derivation is correct or not? The derivation tree is built using the grammar, if it is possible to start from the LHS of the derivation in the tree and follow a branch in it until the RHS of the derivation, then the derivation is valid. The example shown here considers a very simple grammar (a smaller version of the algebraic expressions grammar where only one variable, multiplication, grouping and summation are allowed). Derivation S+S =>* x+(x*S) is a valid one as both LHS and RHS can be found in the tree, and they are part of a branch.

pag 69 Andrea Tino - 2013 Derivations shown for a simple grammar. S x (S) S+S S*S (x) ((S))(S+S) (S*S) x+S (S)+S S+S+S S*S+S x*S (S)*S S+S*S S*S*S x+x x+(S) x+S+S x+S*S x*x x*(S) x*S+S x*S*S x+(x) x+((S)) x+(S+S) x+(S*S) x*x+S x*(S)+S x*S+S+S x*S*S+S x+(x*S) x+((S)*S) x+(S+S*S) x+(S*S*S) The tree is a good approach at visualizing the grammar, but it is hardly considered in real scenarios as the tree can be very complex to draw. Productions How far can a derivation process be carried? If the derivations tree were a fixed height one, then the language would be a finite one, thus a regular language. Actually all CFGs generate not-finite languages. However, the tree will always have leaves. The point is that, once a leaf is reached, it is not possible to get more strings; this happens, in a CFG, when a string is formed by terminals only. When a leaf in the tree is reached, the string contains only terminals and that string is a word of the language. Every leaf in the tree is called a production. Productions of a grammar When defining a CFG, production rules must have a certain set of properties which make that grammar a good grammar, able to generate the syntax of a language. When these properties are not observed, the grammar is said to be malformed, misshapen. • The grammar should not contain redundant rules. For example self-rules like A -> A count nothing in the grammar, so they should not be used. • The grammar should not contain useless rules. For a CFG it means that all non-terminals (except the start symbol) should appear respectively as LHS and RHS of at least two different production rules. • The grammar should not contain infinite loops. For example a rule A -> aA alone is not allowed. For a CFG this means having a terminating RHS for each recursive rule. • The grammar should define the syntax using the lowest possible number of rules. Generative grammars Grammars are called generative because they generate all words in a

pag 70 Andrea Tino - 2013 language. Recalling the definitions of language and grammar, we say that a language L ⊆ Σ∗ (over a certain alphabet) is generated by grammar G = V ⊂ L,T ⊂ L,P,S( ), namely L G( ), when the grammar generates all words in the language. A phrase in the language is also a word inside it. We also say that string ω ∈Σ∗ is word in L G( ), only if that string can be generated as a production of the grammar: ω ∈L G( )⇔ S ⇒ω,∀ω ∈T ∗ . Of course a word of the language is made of terminals only. Equivalence between grammars Recalling the definition of language equivalence, two grammars are equivalent only if they generate the same language: G1 ≡ G2 ⇔ L G1( )≡ L G2( ). Converting an expression into an AST The job of a parser is to convert the sequence of tokens in input from the scanner into another entity that can be easily read by the code generator to create the output code. This particular structure is the Abstract Syntax Tree (AST). A question naturally comes: what is the structure of an AST for a certain grammar? The tree can have different shapes depending on the rules of the grammar. Every production rule generates a subtree with height 1 and width equal to the number of terminals and non-terminals in the RHS. When combined together starting from the axiom of the grammar, we get the final AST. The leaves of an AST are all terminals. AST vs derivations Please do not mix up derivations with ASTs. The derivations tree is an abstract structure ruling words generation; an AST, on the other hand, is a physical structure, strongly related to the derivations tree of that grammar, which is returned to by the parser to the semantic analyzer of the compiler. The difference purely relies in formal matters, but it is important. That being said, we define an AST as the graphic representation of a production in the grammar (which is originated by sequential applications of different derivations). How a parser works The parser reads tokens in sequence. At a certain point a sequence of tokens must match one or more rules. The point is, how can a parser return the corresponding AST given the input code? A parser needs the following: • Tokens: Typically the parser reqiests the next token to the lexer. This operation is performed by the parser on its own will.

pag 71 Andrea Tino - 2013 • Parsing stack: A stack (whose size is limited by the memory allocated for the parser while running) used by the parser to put symbols inside it. The stack is a data structure containing terminals and non-terminals. • Stack functions: Two functions used to manipulate the stack and elements inside it called shift and reduce. When the parser runs, it requests tokens and put them inside the stack. This is called shifting, as the parser uses the shift function to put something into the stack. However, tokens are not always shifted; if when receiving a token the stack contains a sequence that matches the RHS of a production rule in the grammar, the parser will remove that sequence and insert the symbol in LHS of the same rule, before shifting the incoming terminal. This operation is called reduction. Bottom-Up parsers What described before is the structure of a bottom-up parser. There are other types of parsers which do not require the use of a stack and also take advantage of other functionalities. We chose to focus on bottom-up parsers because they can be very efficient, also many parser generators today are implemented using the bottom-up approach. The objective is using sequences of shift/reductions in order to complete the parsing process. The parsing process ends successfully only when the stack contains the start symbol only at the end of the parsing. Example of parsing a sequence Consider the algebraic grammar of before, and consider input sequence x+x*x. What the parser does is represented by the following table: Incoming token Stack Operation Descrirption Step 1 x {} shift The incoming token is inserted into the stack. Step 2 + {x} shift One rule is matched, possibility to match larger rules, no reduction performed for now. Step 3 x {x,+} reduce No rules matching. Reducing. Step 4 {S,+} shift No rules matched, shifting. Step 5 * {S,+,x} reduce The whole expression doesn’t match, but a part of it matches. Reducing. Step 6 {S,+,S} shift One rule matching, possibility to match larger rules. Do not reduce. Step 7 x {S,+,S,*} shift No matches. But one operator with higher priority. Do not reduce. Step 8 EOF {S,+,S,*,x} shift Operators found, maybe other high priority operators can show. Do not reduce.

pag 72 Andrea Tino - 2013 Incoming token Stack Operation Descrirption Step 9 {S,+,S,*,x,EOF} reduce EOF reached. Reduce starting from high priority operators. Step 10 {S,+,S,*,S,EOF} reduce Reduce high priority operators. Step 11 {S,+,S,EOF} reduce Reduce operators. Step 12 {S,EOF} reduce Reduce using rule. Step 13 {S} accept Parsing over. The shifting is performed also when the current sequence of elements in the stack match one rule. The strategy is shifting also when a rule is matched until no rule is matched; at that time reduction is performed. This approach allows the parser to try reducing larger rules. This is something we will understand later in this chapter. Ambiguities in a grammar When defining a grammar, ambiguities are the most threatening causes of misshaping. A grammar ambiguity relies in the fact for which one word can have more syntax trees generating it. More formally, in a grammar, starting from the axiom, when a word of the language can be reached by more than one different branches of the derivations tree, that grammar is ambiguos. Consider the one-variable algebraic expressions grammar: L = {S, x, +, -, *, /, (, )}; T = {x}; V = L - T; S -> x; S -> (S); S -> S+S; S -> S-S; S -> S*S; S -> S/S; And let us consider the following phrase: x+x*x. The derivation that leads to this result is S =>* x+x*x. However, dissecting this derivation into one-step derivations, we find two different paths leading to the production. One is: S => S+S => x+S => x+S*S => x+x*S => x+x*x. The other one is: S => S*S => S+S*S => x+S*S => x+x*S => x+x*x. Actually, looking closer, more derivations can be found leading to the same result. Please note that ambiguities are not generated by using RightMost derivations and LeftMost derivations. A grammar must always be Diag. 1 - AST 1 for x+x*x. Diag. 2 - AST 2 for x+x*x. S S + S x S * S x x S S * S xS + S x x

pag 73 Andrea Tino - 2013 handled with one single policy! Why threats? Ambiguities are not something good and an ambiguos grammar cannot be parsed correctly (meaning that the corresponding generated language cannot be parsed). An ambiguity simply puts the parser into the condition for which it needs to choose from more than one rule to proceed, but it cannot obviously choose which one is the best. From the parser point of view, string x+x*x can’t be parsed. The parser starts reading symbols, when x is found, it is reduced to S, then S+x is found, x is reduced to S, but what next? when the parser encounters the last character it will reduce the expression into S+S*S, from here it will not be able to go on. Should it reduce S+S into S first or S*S? This is the consequence of an ambiguity. Ambiguities at semantic level Well, it looks like ambiguities are just a matter of choosing which rule to apply first; however the problem is not that simple. Threats hidden by ambiguities tend to show their real faces at sematic level. Considering the same ambiguity of before, let us consider both ASTs and let us see what happens when they are returned to the semantic analyzer. In this example, each node carries a semantic value, which is the result of the operation specified by operators. So imagine that symbol x will be replaced with a numeric value, say 3. The first AST will cause the following semantic values to be calculated: 3+(3*3) => 3+9 => 12. The second AST will generate the following semantic values (3+3)*3 => 6*3 => 18. Semantic values are not the same! This proves how an ambiguity at syntax level will cause anomalies at semantic level as well. Removing ambiguities A grammar is a good one when no ambiguities exist. In that case there exists only one AST for each phrase in the language. Ambiguities are removed during compilator design. When ambiguities are encountered, rules must be changed accordingly. Typical ambiguities arise when handling operators. The example described before shows a very common ambiguity related to the absence of operator precedence rules. If priorities are assigned to operators, a grammar will not cause the parser to experience ambiguities like those ones seen so far. Taxonomy of ambiguities Ambiguities typically fall into two categories (for bottom-up parsers): • Shift/reduce conflicts: The parser is left undecided whether to perform a shift or a reduction operation on a given sequence. • Reduce/reduce conflicts: The parser is left undecided whether to reduce a given sequence using a one among more rules (all matching). This classification is made depending on which type of conflict is generated during parsing. Shift/reduce conflicts They are the most common conflicts experienced by parsers handling ambiguos grammars, as they are mostly related to precedence rules. Let us consider the following

pag 74 Andrea Tino - 2013 YACC grammar rule for a very dummy C-like language, leading to a very famous shift/reduce conflict first eperienced in literature by programming language ALGOL60’s compiler: stmt: expr | if_stmt; if_stmt: IF_KWORD expr THEN_KEYW stmt | IF_KWORD expr THEN_KEYW stmt ELSE_KEYW stmt; expr: x | y | z | w; Let us consider input string if x then y else z, when the stack contains the first 4 symbols and the next token is the else symbol, the parser notices that one rule is matching the input, so a reduce would be fine; however if it waited for the next token the second rule might be matched as well (and that rule a more complete one as it extends the first). This is an example of shift/reduce conflict; this particular case is known as: The dangling else conflict. Although shift/reduce conflicts can threaten a grammar, they do not represent a serious issue. In fact they are usually automatically solved by parsers by always giving priority to shift operations. This policy is in order to allow larger rules to be matched when sharing a subset of element as RHS (the initial part). However a conflict is a conflict. In fact when a shift/reduce occurrance is experienced, the grammar needs to be adjusted a bit in order to prevent undesired behaviors. Let us consider the dangling else problem again. Consider the following string: if x then if y then z else w and the dummy grammar shown before; the parser will give priority to shifting, so all tokens will be shifted first and then reduced. The result is the following demantics: if(x){if(y){z}else{w}}; the else is attached to the innermost condition. But if the parser had given priority tu reductions, the resulting semantics would have been: if(x){if(y){z}}else{w}; the else is attached to the outermost condition. What if the language designer had preferred the last semantics? The problem is that the grammar is ambiguos. These kinds of conflicts are solved by refining the grammar and specifying where the else clause is to be attached to. Reduce/reduce conflicts They are not so common, however if such a conflict is experienced by the parser, it means that the underlying grammar not only is ambiguos, but really malformed. A reduce/reduce conflict is a symptom of having something wrong in the rules. In few words, the parser encounters a situation in which a string is matched by more than one rules. This means that different rules target the same sequence of symbols, which is a contradiction. In such cases, the grammar should be reviewed and edited to clear remove the conflict. Generally speaking These types of conflicts are experienced by bottom-up parsers only. They key concept, valid for all parsers, is that ambiguities always have the same origin: more ASTs describing the same word/phrase in the language. This condition can generate different types of conflicts in the parser, and it is something the language designer must handle at grammar level.

pag 75 Andrea Tino - 2013 Different parsing methodologies Parser can process input using two main approaches: • Top-down parsing: Parsing is conducted starting from the root of derivations subtrees and proceeding to leaves. Rules are picked and tested, if a rule fails matching a string, backtracking ensures recovery to the latest good position in the tree. The direct consequence of this approach is that ASTs are built starting from the root down to leaves. • Bottom-up parsing: Parsing is performed starting from leaves of derivation subtree up to the root. Concepts of shifting and reduction are needed as well as a stack to store sequences of symbols possibly matching a rule. It is possible to develop the parser using a state machine. The direct consequence of this approach is that ASTs are built starting from the leaves up to the root. So far we put much focus onto bottom-up parsers because they are very efficient and common today. However we will also see, later, how top-down parsers are structured, but a deeper analysis will be conducted on bottom-up parsers anyway. Classification today Parsing algorithms are today divided into some categories depending on the following parameters: • Derivation policy: Parsers can be LeftMost or RightMost. • Parsing direction: Which is the direction of the prsing process? This is a parameters depending on how sequences of tokens are handled. If the approach starts from the root of derivations, then we have LeftToRight parsers, these parsers are TopDown parsers as a direct consequence. If the parsing algorithm starts from leaves of derivations in order to proceed to the root, then we have RightToLeft parsers, these parsers are BottomUp parsers as a direct consequence. • Lookahead characters: Parsers can take advantage of a lookahead character. It means that before actions take place on the sequence of tokens/symbols loaded so far, the parser can request some tokens to peek what is next; however these lookahead symbols will not be part of current actions. The more lookahead symbols are used, the more predictive the parser can be; however this number must be small, if not the parser will not be efficient. The bottom-up parsing algorithm described before, for example, is a parsing algorithm that took advantage of one lookahead symbol. Basing on parameters introduced so far, parsersing algorithms are today classified using the following scheme: <Direction><Derivation><Lookahead>, thus: (L|R)(L|R)[k]. Very common algorithms are: • LL parsers: They are LeftToRight LeftMost top-down parsers using one lookahead character.

pag 76 Andrea Tino - 2013 They can also be referred to as: LL(1). • LR parsers: They are LeftToRight RightMost bottom-up parsers using one lookahead character. They can also be referred to as LR(1). • LA parsers: Parsers using lookahead symbols. Typically the type of parser is specified as well, so we can have: LALL (LA + LL) and LALR (LA + LR) parsers. If the number of lookahead characters is to be specified, it is possible to put it in the end: LL(k) (LL parser with k lookahead characters) or LR(k) (LR parser with k lookahead characters). Today’s algorithms tend to prefer LR approaches. The bottom-up parsing algorithm described before is, for example, a LR(1). Some subclasses of context-free grammars can be parsed using a mixture of top-down and bottom-up parsing. Noticeable quantities in a CFG When having a CFG, we can consider some important quantities that we are going to use later when analyzing top-down and bottom-up parsers in detail. The common hypothesis is always the same: a CFG is considered: G = V,T,P,S( ). First-set Given a CFG, the first-set for a symbol α ∈V ∪T , written First α( )⊆ T , is the set of all terminals in the grammar appearing as the first character in all strings generated through derivations from that symbol: First α( )= a ∈T :α ⇒∗ aβ{ }∪ Empty α( ). The empty-set of a symbol, is the set containing the empty string (only when included in the grammar) if the empty string can be reached from that symbol, or the empty set otherwise: Empty α( )= { } ∃α ⇒∗  ∅ otherwise ⎧ ⎨ ⎪ ⎩⎪ Handy rules So, let α ∈V ∪T be a symbol in the grammar: • If the symbol starts with terminal a ∈T , then that terminal is part of the first-set: α = aβ ⇒ a ∈First α( ). • If the symbol starts with a non-terminal A ∈V : α = Aβ , then locate all productions having A as LHS: • If there exists a production having A as LHS, then the first-set of the RHS is a subset of

pag 77 Andrea Tino - 2013 the first-set of the original symbol: ∃A ⇒∗ γ ⇒ First γ( )⊆ First α( ). • If there exists a production having A as LHS and the empty string as RHS, then first-set of β is a subset of the first-set of the original symbol: ∃A ⇒∗  ⇒ First β( )⊆ First α( ). • Do not forget to evaluate the empty-set and make union with the set found so far. The first-set construction is recursive, however it can be simply handled by considering the derivations tree generated by the grammar, locating symbol α and following that subtree; all nodes in the subtree starting with a terminal are to be marked and that terminal is to be inserted in the first-set. Always remember to insert the empty string in the first-set of the symbol if the grammar allows derivation from that symbol to the empty string. Example I Consider the following grammar: T = {a,b,c,d,e}; V = {S,B,C}; S -> a S e; S -> B; B -> b B e; B -> C; C -> c C e; C -> d; We can evaluate first-sets for some words/phrases in the grammar First(aSe) = {a}; First(C) = First(cCe) ∪ First(d) = {c,d} First(B) = First(bBe) ∪ First(C) = {b} ∪ {c,d} = {b,c,d}; First(cCe) = {c}; Example II Consider now the following grammar: T = {a,b,c,d,e,ε}; V = {S,B,C,D}; S -> B; S -> C e; S -> a; B -> b C; C -> D c; D -> d; D -> ε; We can evaluate first-sets for some words/phrases in the grammar First(B) = {b}; First(Ce) = First(C) = First(Dc) = First(D) ∪ First(c) = {ε} ∪ First(d) ∪ First(c) = {ε,d,c}; First(Dc) = {ε} ∪ First(c) ∪ First(D) = {ε} ∪ {c} ∪ First(d) = {ε} ∪ {c} ∪ {d} = {ε,d,c}; The empty symbol is to be considered only for those grammars using it. Follow-set Given a CFG, the follow-set for a non-terminal B ∈V , is the set containing all terminal symbols immediately next to B : Follow B( )= a ∈T :S ⇒γ 1Baγ 2{ }. We use symbol π to indicate the end of a string.

pag 78 Andrea Tino - 2013 Handy rules So, let B ∈V be a non-terminal in the grammar: • It is a typical convention to put the last-token π (if present in the grammar) in its follow-set if B is the start-symbol in the grammar. • Consider all production rules having B as RHS: • Every derivation in the form A ⇒∗ αBβ causes the follow-set of the RHS non- terminal to contain the first-set of the right-most expression (empty string excluded): A ⇒∗ αBβ ⇒ Follow B( ) ⊇ First β( )− { }. • Every derivation in the form A ⇒∗ αBβ where  ∈First β( ), then we must put the follow- set of A into the follow-set of B : A ⇒∗ αBβ ∧ ∈First β( )⇒ Follow B( ) ⊇ Follow A( ). • Every derivation in the form A ⇒∗ αB (a non-terminal as the right-most symbol in RHS) causes the follow-set of A to be contained in its follow-set: A ⇒∗ αB ⇒ Follow B( ) ⊇ Follow A( ). • Do not forget to evaluate the rmost-set and make union with the set found so far. Here the approach is reversed if compared with the one considered for the first-set. To get the first- set of a symbol, we focused on all rules having that symbol as LHS. Here, to compute the follow- set of a non-terminal, we must focus on rules where it appears as part of RHS. Handling the empty string Please note that the empty string is can't be part of any follow-set. Example I Consider the following grammar: T = {+,*,(,),id,ε}; V = {S,T,X,Y,F}; S -> T X; X -> + T X; X -> ε; T -> F Y; Y -> * F Y; Y -> ε; F -> ( S ); F -> id; We can eval some follow-sets: Follow(S) = {π} ∪ First()) = {π,)}; Follow(X) = Follow(S) ∪ {ε} = {π,)}; Follow(T) = First(X) - {ε} ∪ Follow(X) ∪ Follow(S) = {+} ∪ {π,)} ∪ {π,)} = {+,π,)}; Follow(Y) = Follow(T) = {+,π,)}; Follow(F) = First(Y) - {ε} ∪ Follow(Y) = {*} ∪ {+,π,)} = {*,+,π,)}; Please note that the last-token π is never formally part of a grammar. This token is just used for analysis purposes.

Andrea Tino - 2013 Top-down parsers Computers make me totally blank out. “ “Dalai Lama http://www.brainyquote.com/quotes/quotes/d/ dalailama446748.html

pag 80 Andrea Tino - 2013 Overview Top-down parsers proceed creating the AST starting from the root to the leaves. The class of parsing algorithms used today to implement such parsers is LL(k): LeftToRight LeftMost parsers with k lookahead symbols. LL parsers are not an option, due to the fact that the AST is built from the root to the leaves, the approach is top-down and the parser proceeds from left to right necessarly. Taxonomy of top-down parsers When considering top-down parsers, there are some algorithms to implement them. The Chomsky- Schützenberger hierarchy tells us how to implement such parsers (using pushdown automata), however we can find other approaches rather than going for the most generic one. “Generic“ is probably the best word. In fact the Chomsky-Schützenberger hierarchy considers pushdown automata as the computational structure able to parse a CFG, and so it is! But if our grammars have some restructions, more efficient and fine-tuned algorithms can be considered. Why? Because pushdown automata can be difficult to implement and might require time. Today, top-down parsers can be divided into two groups: Algorithm Efficiency Descrirption Recursive descent They are not efficient; backtracking features are needed. The grammar also needs to have a particular form. The algorithm considers the input string and tries to descent the derivations tree following possible branches. A lot of attempts might be required depending on the grammar size, backtracking is necessary upon errors, in order to recover and try a different derivation path. Predictive They are efficient. Here as well, grammars need to have a particular form. The algorithm proceeds on derivations having a particular structure as not all grammars can be handled by this methodology. Recursive predictive descent They are efficient. No all grammars can be handled by this approach as well. It is a special form of recursive descent parsing, but no backtracking is needed, thus making the process faster and more efficient. Left recursion Recursive descent algorithms (with backtracking or not) have a problem: they operate in way for which left-recursion can make them go infinite loop. What is left-recursion by the way? [Def] Left-recursive grammars: A grammar is said to be left-recursive when it contains

pag 81 Andrea Tino - 2013 at least one left-recirsive rule. [Def] Left-recursive rule: Given a CFG, a left-recursive rule is a rule in the form: A ⇒+ Aα , thus the non-terminal in LHS appears as the left-most symbol in the LHS of the same rule. [Def] Immediate left-recursive rule: A left-recursive rule is said to be immediate when the left-recursion shows at one-step distance: A ⇒1 Aα , thus the left-recursion is evident. [Def] Indirect left-recursive rule: Left recursion might not show in rules of a CFG. New rules can be created when processing derivations; if starting from a non-terminal the grammar allows derivation A ⇒∗ Aα to occur, then that grammar is affected by indirect left-recursion. Left-recursion is definitely a problem for top-down parsers. When designing a language and willing to parse it using a top-down parser, the grammar must be designed to avoid left-recursion. If the grammar is affected by left-recursion, methodologies exist to remove it and fix the grammar. Handling immediate left recursion Recognizing immediate left-recursions is very simple as they show themselves in the rules of the grammar. When a non-terminal appear both in LHS and as the left-most symbol in RHS if a rule, then that rule is left-recursive. Simple case How to fix immediate left-recursion? Very easy. Consider left-recursive rule: A -> A α; | A -> β; /* where β != A γ */ Let us replace it with: A -> β B; /* non-terminal B added in the grammar */ B -> α B; B -> ε; /* from left-recursion to right-recursion */ As it is possible to see, the recursion is moved from left to right, a new equivalent grammar is created. The procedure does not remove a recursion as such a process is impossible. A recursive rule will remain recursive. However right recursion is ok as it can be handled by top-down parsers. Also please not how a non-terminal is inserted in the grammar together with the empty string terminal (if it was not part of the grammar originally). General case In the most general case, a left-recursive rule in the form: A ⇒ Aα1 | Aα2 || Aαm | β1 | β2 || βn βi ≠ Aγ ,γ ∈ V ∪T( )∗

pag 82 Andrea Tino - 2013 Can be changed into rules: A ⇒ β1B | β2B || βnB B ⇒α1B |α2B ||αm B | ⎧ ⎨ ⎩ B ∈V Getting an equivalent grammar. Example Let us consider the following grammar: T = {*,+,(,),id}; V = {S,T,F}; S -> S + T; S -> T; T -> T * F; T -> F; F -> id; F -> ( S ); And try to remove left-recursions as follows: S -> T S’; S’ -> + T S’; S’ -> ε; T -> F T’; T’ -> * F T’; T’ -> ε; F -> id; F -> ( S ); The new grammar is not affected by left-recursion anymore. Handling indirect left recursion A grammar can be left-recursive in a indirect way. It means that left-recursion is not shown in rules themselves, but it is hidden among derivations. Let us consider the following grammar: T = {a,b,c,d}; V = {S,A}; S -> A a; S -> b; A -> S c; A -> d; Each rule doesn’t show immediate left recursion, but let us write down few derivations like S -> A a -> S c a and A -> S c -> A a c. We have the following derivations: S -> S c a and A -> A a c which clearly are left-recursive. How to remove indirect left-recursion? Algorithm for indirect left-recursion removal Let us consider a grammar, to remove all indirect left-recursions, the following procedure can be applied: 1. Sort all non-terminals to reach pattern: A1,A2 An . 2. For all non-terminals from A1 to An , consider generic non-terminal Ai . 3. For all non-terminals from A1 to Ai−1 , consider generic non-terminal Aj . 4. Remove every rule in the form Ai ⇒ Ajγ with rules Ai ⇒α1γ |α2γ ||αkγ , where Aj ⇒α1 |α2 ||αk . 5. After substitution, if some rules having Ai as LHS are affected by immediate left-recursion, remove them. Go to the next step from point 2 until the end of the cycle. The algorithm can be formalized as follows:

pag 83 Andrea Tino - 2013 procedure rem_ilrec(V,T,P) /* grammar as input */ precond: V = {A1,A2...An}; for i = 1..n do for j = 1..i-1 do P = P - {Ai -> Aj γ}; P = P ∪ {Ai -> α γ} ∀α : Aj -> α; end rem_lrec(); /* handles immediate left-rec */ end end Example Let us consider the following grammar affected by indirect left-recursion: T = {a,b,c,d,f}; V = {S,A}; S -> A a; S -> b; A -> A c; A -> S d; A -> f; 1. We first sort non-terminals: S, A. 2. We consider non-terminal S. In the sequence it is the first, no one preceding it. Nothing to do aside from checking for immediate left-recursions. No immediate left-recursion. 3. We now consider non terminal A and locate rules having A as LHS and one non-terminal precending A in the sorting (S only) as the left-most symbol in RHS. We have only one rule matching: A -> S d. We replace this rule with A -> b d, A -> A a d rules. So final rules for A will be: A -> A c, A -> A a d, A -> b d and A -> f. Some of them are affected by immediate left-recursion, so we fix them into: A -> b d B, A -> f B, B -> c B, B -> a d B and B -> ε. Left factorization If we want to use a top-down parser to handle a grammar, we must make that grammar a non-left- recursive one. However if we also wish to use a predictive algorithm (the most efficient one), we also need to be sure that grammar is left-factored as well. Left-factorization is deeply connected to the concept of predictive grammar. A grammar must be predictive to be handled by a predictive parsing algorithm. [Theo] Predictive gramars: A top-down predictive parsing algorithm cannot handle non-predictive grammars. [Theo] Predictive and left-factorized grammars: A left-factorized grammar is also a predictive grammar. What is left-factorization? Consider the following production rule for a dummy grammar (YACC notation):

pag 84 Andrea Tino - 2013 instr -> IF_KWORD PAR_O expr PAR_C instr ELSE_KWORD instr | IF_KWORD PAR_O expr PAR_C instr; And let’s change shoes with the parser. A predictive grammar hasn’t been defined yet, but the concept is quite clear. The problem, in the rule above, is that two rules start with the same sequence of symbols and differentiate later. In particular they both start with the if keyword. Simple case Let us consider grammar G V,T,P,S( ) and production rule A ⇒αβ1 |αβ2 where α ∈ V ∪T( )∗ − { }, β1,β2 ∈ V ∪T( )∗ and β1 ≠ β2 . When such a rule exists, the grammar is not left-factorized. However to left-factorize the rule we can transform it into these rules: A ⇒αB B ⇒ β1 | β2 ⎧ ⎨ ⎩ Where B is a non-terminal inserted in the grammar as a new symbol. General case Let us consider CFG G V,T,P,S( ) and rules A ⇒αβ1 |αβ2 ||αβm |γ 1 |γ 2 ||γ n where α1αn ∈ V ∪T( )∗ − { }, β1βm ∈ V ∪T( )∗ and βi ≠ βj ,∀i, j = 1…m ∧ i ≠ j . To get a left- factorized grammar, all above rules are transformed into the following: A ⇒αB |γ 1 |γ 2 ||γ n B ⇒ β1 | β2 || βm ⎧ ⎨ ⎩ B ∈V Obtaining a left-factorized equivalent grammar with new non-terminal symbols. Example Let us consider the following grammar: T = {a,b,c,d,e,f,g}; V = {S,A}; S -> a b A; S -> a A; S -> c d g; S -> c d e A; S -> c d f A; We can start the factorization process more times as the left-most parts in common involve more than one sequence of symbols. 1. The first set of rules we consider is: S -> a b A | a b. We apply the factorization process and get: S -> a B, B -> b A | A and S -> c d g | c d e A | c d f A. 2. The grammar has changed but there are rules to left-factorize. In particular we now concentrate on rules: S -> c d g | c d e A | c d f A. We can repeat the process again getting: S -> a B, B -> b A | A, S -> c d C and C -> g | e A | f A. Final grammar will be: T = {a,b,c,d,e,f,g}; V = {S,A,B,C}; S -> a B; S -> c d C; B -> b A; C -> g; C -> e A; C -> fA;

pag 85 Andrea Tino - 2013 The grammar is predictive (left-factorized). Also note how new non-terminals are added to the original grammar. This approach makes the grammar getting bigger. Non-left-recursive grammars A result is obvious: [Lem] Predictive and non-left-recursive grammars: All predictive grammars are not affected by left recursion. And is very simple to prove as the left-factorization process considers the left-recursion removal as part of its steps. LL(1) grammars LL parsers are used as today’s mean to handle (particular types of) CFGs. However we could see that some algorithms can be used to develop a LL parser. Two are the most important: • Predictive: They take advantage of parsing tables. • Predictive recursive descent: They are a subclass of recursive descent algorithms, but no backtracking is used. As well as recursive descent, predictive recursive descent take advantage of recursive functions. Simple recursive descent algorithms involve backtracking and recursive functions and are not used as real implementations. That’s why we will not cover them here. [Def] LL grammars: LL grammars are a particular subset of CFG grammars that can be parsed by LL(k) parsing algorithms. LL grammars and predictiveness To handle a grammar using a top-down generic parser, that grammar must be non-left-recursive. When we want to use predictive approaches (more andvanced and efficient) we need to have predictive grammars. Detailing LL(1) grammars As specified before, LL(1) parsers are LL parsers using 1 lookahead token only. [Def] LL(1) grammars: LL(1) grammars are grammars that can be parsed by LL(1) parsing algorithms.

pag 86 Andrea Tino - 2013 These types of grammars can be easily parsed. Having one lookahed symbol can make the whole process more efficient than when using more lookahead tokens sometimes. Also, LL(1) parsers are very easy to implement. The concept of predict When handling LL(1) grammars, there is an important quantity that can be really helpful to solve the decision problem: “Is a certain grammar a LL(1) grammar?“. This entity is called predict of a production rule. [Def] Predict-set: Let G V,T,P,S( ) be a CFG and p ∈P a production rule in the form p : A ⇒α . The predict-set of p , written as Predict p( ), is the set containing all look- ahead tokens (terminals), usable by a LL(1) parser, indicating that production rule p is to be applied. Calculating the predict-set The predict-set for generic rule A ⇒α can easily be evaluated using the following rule: Predict A ⇒α( )= First α( ) /∃α ⇒∗  First α( )∪ Follow A( ) ∃α ⇒∗  ⎧ ⎨ ⎪ ⎩ ⎪ Example Let us consider the following grammar: T = {a,b,c,ε}; V = {S,A,B}; S -> A B c; A -> a; A -> ε; B -> b; B -> ε; Let us calculate some predict-sets: Predict(S->ABc) = First(ABc) = {a,b,c,ε}; Predict(A->a) = First(a) = {a}; Predict(A->ε) = First(ε) ∪ Follow(A) = {b,ε}; Predict(B->b) = First(b) = {b}; Predict(B->ε) = First(ε) ∪ Follow(B) = {c,ε}; Conditions for a grammar to be LL(1) Predicts are really helpful when some facts on LL(1) grammars are to be considered. A particualr handy result is the following: [Theo] Predict-set: A CFG is LL(1) if and only if predict-sets for all production rules having the same LHS are disjoint; for all non-terminals. The theorem is anecessary and sufficient condition, so a very useful tool.

pag 87 Andrea Tino - 2013 Example Let us consider the following grammar: T = {a,b,c,d,e}; V = {S,A,B}; S -> a S e; S -> A; A -> b A e; A -> B; B -> c B e; B -> d; Let us check whether this grammar is LL(1) or not. Predict(S->aSe) = First(aSe) = {a}; Predict(S->A) = First(A) = {b,c,d}; Predict(A->bAe) = First(bAe) = {b}; Predict(A->B) = First(B) = {c,d}; Predict(B->cBe) = First(cBe) = {c}; Predict(B->d) = First(d) = {d}; The grammar is LL(1) as non-terminal S have disjoint sets, the same goes for A and B. Parsing LL(1) grammars using predictive algorithms Predictive LL(1) parsers make use of some elements to process a grammar. Remember that a parser has one objective only: answering the question “Is the string a valid word in the grammar?“. On positive conditions the parser also needs to build the AST representing that word in the grammar for that language. • Input buffer: The buffer containing all tokens returned by the lexer and requested by the parser one by one. • Stack: A stack containing symbols in the grammar. A special symbol is always considered: last-token π which is returned by the lexer when the string is over (no more tokens to return). • Parsing table: A bidimensional array mapped onto application M :V ×T  P . The table returns a production rule when a non-terminal and a terminal are provided as input. • Output buffer: The buffer contains the sequence of production rules to apply in order to generate the derivation sequence starting from the start-symbol until the input string. Parser actions A predictive LL(1) parser works with specific actions depending on the lookahead token and the stack state. Initialization Let W be the stack, at the beginning the stack contains the axiom S and the last-

pag 88 Andrea Tino - 2013 token π: W = {S, π}. The top element in the stack is the left-most one. The symbol on top of the stack will be denoted as α. The stack can contain terminals and non-terminals. We will also use symbol a to denote the current working symbol (thus, the lookahead symbol). The first action is putting the first input character of the string and considering it as the current symbol. Parsing algorithm Given current symbol a and stack top symbol α, we can have these possible actions undertaken by the parser: Top stack symbol Current symbol Actions Descrirption α = π a = π exit(0); The parser terminates successfully. α = a a = α stack.pop(); a = next_token(); Pop the top symbol from the stack and make a the next token from the input buffer. α ∊ V a ∊ T if (M(α,a) != null) { for ∀β ∊ M(α,a).RHS.rev() do stack.push(β); out.insert(M(α,a)); } else { error(“no rule“); exit(1); } If a valid production is returned, each symbol appearing in the RHS is pushed into the stack from the right-most to the left-most symbol; after that the production is inserted into the output buffer. If no rule is returned, syntax error! The parsing table The parsing table is a structure containing entry for a non-terminal and a terminal. Every entry of the table tells the parser how to behave when a non-terminal is at the top of the stack and the current symbol (a terminal) is considered. The table is built from the grammar. Example Let us consider the following grammar: T = {a,b}; V = {S,A}; S -> a A a; A -> b A; A -> ε; First note as the grammar is non-left-recursive and left-factorized. Consider now the following parsing table (we will see how to build it later here): a b π S S -> a A a null null A A -> ε A -> b A null

pag 89 Andrea Tino - 2013 We consider input string abba. Now let us parse it using the predictive approach described so far. Stack Input buffer LA symbol Output buffer Descrirption {S,π} {a,b,b,a,π} null { } Initialization. {a,A,a,π} {b,b,a,π} a { S->aAa } The first character in the string is fetched and considered as the lookahead symbol. In the table an entry can be considered, the rule is applied. {A,a,π} {b,a,π} b { S->aAa } Stack’s top-symbol is equal to the current symbol. Pop and fetch next token. {b,A,a,π} {b,a,π} b { S->aAa, A ->bA } An entry in the table is found, non- terminal expansion in the stack. {A,a,π} {a,π} b { S->aAa, A ->bA } Stack’s top-symbol is equal to the current symbol. Pop and fetch next token. {b,A,a,π} {a,π} b { S->aAa, A->bA, A->bA } An entry in the table is found, non- terminal expansion in the stack. {A,a,π} {π} a { S->aAa, A->bA, A->bA } Stack’s top-symbol is equal to the current symbol. Pop and fetch next token. {a,π} {π} a { S->aAa, A->bA, A->bA, A->ε } An entry in the table is found, non- terminal expansion in the stack. The empty string is a symbol that is treated as a null, the stack actually shrinks. {π} {} π { S->aAa, A->bA, A->bA, A->ε } Stack’s top-symbol is equal to the current symbol. Pop and fetch next token. The algorithm then terminates successfully. Remember that stacks are visualized with the top element as the left-most symbol in the lists. How to build the parsing table To build the parsing table an algorithm is used. Actually the process is very simple and is based on the notion of first-set of a symbol and follow-set of a non-terminal. So let G V,T,P,S( ) be a predictive CFG (left-factorized), and let A ⇒α be a production rule. Then the parsing table M :V ×T  P is built using the following rules: 1. Add production rule A ⇒α to M A,a( ) for every terminal a ∈T which belongs to the first- set of the rule’s RHS: M A,a( ) ⊇ A ⇒α{ },∀a ∈First α( ).

pag 90 Andrea Tino - 2013 2. If the empty-string symbol belongs to the first-set of RHS of rule A ⇒α , then add that production to M A,a( ) for every terminal which belongs to the follow-set of the rule’s LHS:  ∈First α( )⇒ M A,a( ) ⊇ A ⇒α{ },∀a ∈Follow A( ). 3. If the empty-string symbol belongs to the first-set of RHS of rule A ⇒α and the last-token also belongs to the follow-set of LHS of rule A ⇒α , then add that production to M A,π( ):  ∈First α( )∧π ∈Follow A( )⇒ M A,π( ) ⊇ A ⇒α{ }. The algorithm can be concisely described as follows: procedure build_ptab(V,T,P) /* grammar as input */ set M = {}; for ∀(A -> α) ∊ P do for ∀a ∊ First(α) do M = M ∪ {(A,a,A->α)}; end if ε ∊ First(α) then for ∀a ∊ Follow(A) do M = M ∪ {(A,a,A->α)}; end if π ∊ Follow(A) then M = M ∪ {(A,π,A->α)}; end end end end Please note that the empty string symbol does not figure as an entry of the table and is to be discarded when encountered. Example Consider the grammar in the previous example. Let us try to build the parsing table. 1. We consider rule S->aAa. We have that First(aAa)={a}. So we have entry M(S,a)={S- >aAa}. Empty string is not part of the set, this rule is ok so far. 2. We consider rule A->bA. We have that First(bA)={b}. So we have entry M(A,b)={A->bA}. Empty string is not part of the set, this rule is ok so far. 3. We consider rule A->ε. We have that First(ε)={ε}. So we must consider Follow(A)={a}. So we have entry: M(A,a)={A->ε}. The table is built and it is the same as shown before. Parse tables and LL(1) grammars An important result is to be considered. [Theo] Parsing tables and LL(1) grammars: If the parsing table for a given grammar contains, for each entry, one production rule at most, then that grammar is LL(1). This is a necessary and sufficient condition. In fact the parsing table was defined to host sets of production rules. If all entries have at most one production rule, the grammar is a LL(1) grammar.

pag 91 Andrea Tino - 2013 LL(1) grammars and ambiguities Recalling the concept of ambiguities for grammars, after introducing parsing tables, the following result is obvious: [Cor] Non-ambiguos LL(1) grammars: A LL(1) grammar is not ambiguos. The proof is very simple. From the lemma introduce before, we know that LL(1) grammars have parsing tables with no multiple entries. This means that for each couple of non-terminal and terminal, one production rule only is considered. This leads to the fact that no ambiguities can exist under such circumstances. The following result as well is worth mentioning. [Cor] Ambiguos LL(1) grammars: An ambiguos grammar is not a LL(1) grammar. Easy to prove as the ambiguos grammars have multiple entries in the parsing table. Conditions for a grammar not to be LL(1) Necessary and sufficient conditions are good to check whether a grammar is LL(1) or not; however sometimes one would like to answer the question: “Is this a non LL(1) grammar?“. These questions typically involve the use of necessary conditions only, which are simpler to handle and easy to prove by means of the parsing table. Non LL(1) grammars The following theorem relates left-recirsive grammars to LL(1) grammars. [Theo] Left-recursive and LL(1) grammars: A left-recursive grammar is not LL(1). Proof: Provided the grammar is left-recursive, there must exist a rule in the form A ⇒ Aα | β . We can derive from this rule derivation Aα ⇒ βα . Considering the first- sets of both parts, we have that First Aα( ) ⊇ First βα( ) which we can transform into First Aα( ) ⊇ First β( ) considering that /∃β ⇒  . This means that a terminal x must exist such that x ∈First β( )∧ x ∈First Aα( ); thus table entry M A,x( ) will have entries A ⇒ Aα and A ⇒ β . Now we are left with checking the case for which ∃β ⇒  . In this condition we have that Follow A( ) ⊇ First α( ), this happens when trying to evaluate A’s follow- set. Common elements in the two sets can be considered; so for one of these x ∈First α( )∧ x ∈Follow A( ) we have that table entry M A,x( ) will contain again rules A ⇒ Aα and A ⇒ β . Non-left-factored grammars Another theorem can be really handy. [Theo] Impossibility of left-factorization: A grammar cannot be left-factored if there

pag 92 Andrea Tino - 2013 exists a terminal which belongs to the first-set of two factorizable rules’ RHS: A ⇒αβ1 |αβ2 ∧ x ∈T : x ∈First αβ1( )∧ x ∈First αβ2( ){ } > 0 ⇒ no left-factorization. This result is important as it allows to find a way to tell whether a grammar can be left-factored. Parsing LL(1) grammars using (predictive) recursive-descent algorithms We have described how to create a parser for a predictive grammar using predictive algorithms. Now we are going to parse the same grammars with a different approach. At the beginning of this section we introduced recursive-descent approaches as inefficient implementations of parsing algorithms for a certain subset of LL(1) grammars. The approach we are going to see is a mixture as it does not involve the use of backtracking. This is possible thanks to the recycling of the parsing table (built in the same exact way); however since no stacks are considered, the algorithm relies on recursive functions. The presence of the parsing table is very important, if it weren’t the algorithm would require backtracking. The idea The approach is based on recursive functions. Each non-terminal in the grammar is assigned a procedure responsible for the analysis of a certain sequence of tokens. When a non- terminal is encountered, the corresponding function is called. The methodology is very similar to the one used when we introduced lexical analyzers and how to implement a scanner for a regular grammar (regular grammar to program). Parsing a predictive grammar The parsing table is necessary for this class of algorithms, which is the reason why predictive grammars only can be handled here. So, given a predictive grammar, the parser is built by building functions for each non-terminal as follows: 1. Write a function for each non-terminal. 2. Write a test for each alternative in the production rule for each production rule. 3. Call corresponding functions when non-terminals are encountered. 4. Place return points where the rule is matched. Everytime a new token is needed, it is requested from the input stream. Example Let us consider grammar production rule S->aBb;S->bAB. The corresponding handling

pag 93 Andrea Tino - 2013 routine would be as follows (C++): class Input {...}; /* handling input */ enum Terminal { /* terminals only */ T_A = 0x00, T_B = 0x01 }; bool _rule_S(Input& input) { /* passing the object handling input */ switch (input.current()) { /* returns a Terminal */ case T_A: input.next(); /* accept */ if (_rule_B(input)) { /* if success, next token requested by routine */ if (input.current() == T_B) { input.next(); /* accept */ return true; } else return false; } else return false; case T_B: input.next(); /* accept */ if (_rule_A(input)) { if (_rule_B(input)) { return true; } else return false; } else return false; default: return false; } } The approach is actually very simple. Table-less algorithm This approach handles predictive grammars without using parsing tables. The approach is less efficient than the one using tables, but can be a possible choice. The example shown before provides a good overview of the methodology; however what to do when handling empty-strings? Well, consider the same grammar rule of before but a little modified: S->aBb;S->bAB;S->ε. When handling the routine for token S, we might be in a situation for which the current token is neither a nor b, in that case an epsilon-transition is to be considered. Actually, we must apply the epsilon-production when the current token is one of the terminals that can follow S. This is something very familiar for us as it involves first-sets and the follow-sets. Example Consider the following grammar: T = {a,b,c,d,e,f}; V = {S,A,B}; S -> a A e; S -> c A d; S -> B; A -> b A; A -> ε; B -> f; We proceed as before, but handle more things because of the empty string. class Input {...}; /* handling input */ enum Terminal { /* terminals only */ T_A = 0x00, T_B = 0x01, T_C = 0x02, T_D = 0x03, T_E = 0x04, T_F = 0x05

pag 94 Andrea Tino - 2013 }; bool _rule_S(Input& input) { switch (input.current()) { case T_A: input.next(); /* accept */ if (_rule_A(input)) { /* if success, next token requested by routine */ if (input.current() == T_E) { input.next(); /* accept */ return true; } else return false; } else return false; case T_C: input.next(); /* accept */ if (_rule_A(input)) { if (input.current() == T_D) { input.next(); /* accept */ return true; } else return false; } else return false; break; case T_F: return _rule_B(input); /* do not consume symbol here, the procedure will */ default: return false; } } bool _rule_A(Input& input) { switch (input.current()) { case T_B: input.next(); /* accept */ return _rule_A(input); case T_D: case T_E: input.next(); /* accept */ return true; break; default: return false; } } bool _rule_B(Input& input) { if (input.current() == T_F) return true; else return false; } } As it is possible to see, the problem is locating terminals that can follow a certain non-terminal or that can appear as a result of applying a certain rule. This is something done by the parsing table. A table-less approach is, actually, one making implicit use of tables (sort of). That’s why a more sistematic approach exists. Using predict-sets The previous approach can be made more sistematic by using predict-sets. Well we actually used

pag 95 Andrea Tino - 2013 predict-sets somehow there, but we made them not explicit. Now we are going to see how to implement the recursive-descent parser when predicts are considered. Creating functions using predict-sets Using predicts, for each non-terminal, we can write functions as before. The rules now are the following: 1. Write a function for each non-terminal. 2. For each production rule (all alternatives as RHS for the same LHS are considered part of the same rule), write a test based on the predict of that particular alternative. 3. Call corresponding functions when non-terminals are encountered. 4. Place return points where the rule is matched. Example Consider the following grammar first: T = {a,b,c}; V = {S,A,B,C}; S -> A a; A -> B C; B -> b; B -> ε; C -> c; C -> ε; Let us calculate predict-sets for all production rules: Predict(S->Aa) = First(Aa) = First(A) ∪ First(a) = First(B) ∪ First(C) ∪ {a} = {b,ε} ∪ {c,ε} ∪ {a} = {a,b,c,ε}; Predict(A->BC) = First(BC) ∪ Follow(A) = First(B) ∪ First(C) ∪ {a} = {a,b,c,ε}; Predict(B->b) = First(b) = {b}; Predict(B->ε) = First(ε) ∪ Follow(B) = {ε} ∪ (First(C)-{ε}) ∪ Follow(A) = {ε} ∪ {c} ∪ {a} = {a,c,ε}; Predict(C->c) = First(c) = {c}; Predict(C->ε) = First(ε) ∪ Follow(C) = {ε} ∪ Follow(A) = {ε} ∪ {a} = {a,ε}; We can now create the code as we did before, but now we use predict-sets (the empty string can be discarded in the process): class Input {...}; /* handling input */ enum Terminal { /* terminals only */ T_A = 0x00, T_B = 0x01, T_C = 0x02 }; bool _rule_S(Input& input) { switch (input.current()) { case T_A: case T_B: case T_C: /* predict(S->Aa) */ input.next(); /* accept */ if (_rule_A(input)) { if (input.current() == T_A) { input.next(); /* accept */ return true; } else return false; } else return false; default: return false;

pag 96 Andrea Tino - 2013 } } bool _rule_A(Input& input) { switch (input.current()) { case T_A: case T_B: case T_C: /* predict(A->BC) */ input.next(); /* accept */ if (_rule_B(input)) { if (_rule_C(input)) { return true; } else return false; } else return false; default: return false; } } bool _rule_B(Input& input) { switch (input.current()) { case T_B: /* predict(B->b) */ input.next(); /* accept */ if (input.current() == T_B) return true; else return false; case T_A: case T_C: /* predict(B->epsilon) */ input.next(); /* accept */ return true; default: return false; } } bool _rule_C(Input& input) { switch (input.current()) { case T_C: /* predict(C->c) */ input.next(); /* accept */ if (input.current() == T_C) return true; else return false; case T_A: /* predict(C->epsilon) */ input.next(); /* accept */ return true; default: return false; } } Differently from before, epsilon-productions can be easily handled. Parsing with tables When using the parsing table for the language, everything gets easier and the code can be written in a more systematic approach. The main idea is to write mutually recursive functions that, driven by the table, will parse a certain input string without the need of a stack. Recursion can have a certain depth and the probability to experience an overflow-fault is very high when grammars get very complex with many rules.

pag 97 Andrea Tino - 2013 Predictive vs. recursive-descent algorithms Predictive approaches are more efficient in terms of implementation and program speed. When having a predictive algorithm the grammar is mapped into the parsing table. On the other hand a recursive-descent approach makes the grammar mapped onto the code itself. If the grammar changes the whole code needs to be written again; when using predictive algorithms the table only needs to be changed.

Andrea Tino - 2013 Bottom-up parsers Acategorical imperative would be one which represented an action as objectively necessary in itself, without reference to any other purpose. “ “ Immanuel Kant http://www.brainyquote.com/quotes/quotes/i/ immanuelka393400.html

pag 99 Andrea Tino - 2013 Overview Bottom-up parsers treat the parsing problem from the opposite perspective copared to top-down ones. The AST is built starting from the leaves and proceeding up to the root. The approach might look strange, however we will discover that many algorithms here generalize top-down ones. In particular, the class of algorithms used today to parse strings in a bottom-up flavor is represented by LR(k) algorithms: LeftToRight RightMost. So as top-down parsers the input is processed from left to right, but derivations proceed by replacing the first non-terminal encountered in a sequence from the right. We will start from LR parsers and move to LR(0) parsers in this section. What makes LR parsers special today LR parsers are very famous today, much more than LL. The point is that they can be so powerful to make it possible the parsing of grammars up to type-2 (context-free grammars). In fact today almost all languages and grammars are handled using LR algorithms. Implementation issues One more point that makes LR parsers so widely used is implementation specifications. LR parsers are a class of shift/reduce parsers not using backtracking. There are a lot of ways, today, to implement these algorithms and they can be really fast and efficient. Error handling Errors can be easily managed when implementing LR parsers. It is also possible to provide locations where errors are encountered without struggling too much; implementing error management for LR parsers has become a standard approach, so well-known methodologies are available out there. Typical problems with LR parsers LR parsers are not a field of daisies anyway, some problems are to be considered. The biggest issue is development: it is not possible to easily develop a LR parser without using some tools to help with the process. Manual implementation is almost impossible given the complexity of the parsing procedure. Simple grammars can go, but when it comes to something more serious, building the parsing table and all structures required by the parser can be a hard task. LR grammars and parsers We are now going to analyze the LR(1) parsing algorithm. But first we focus a little on grammars.

pag 100 Andrea Tino - 2013 [Def] LR grammars: LR grammars are a particular set of grammars ranging from type- 2 to type-3, that can be parsed by LR(k) parsing algorithms. In particular we have: [Def] LR(1) grammars: LR(1) grammars are grammars that can be parsed by LR(1) parsing algorithms. The LR parser Focusing on LR parsers (thus LR(1) parsers), a common LR parsing algorithm is characterized by the following elements: • Stack: Containing states and symbols. The stack will always be in a configuration like: sm,xm,sm−1,xm−1s1,x1,s0{ } (the top is the left-most element). We can have states si ∈S or symbols xi ∈V ∪T in the stack. • Input buffer: The string a1a2 anπ of terminals ai ∈T to parse. • Output buffer: Contains information about all actions performed by the parser to perform the parsing for a given input. • Action-table: A table of actions to perform depending on the state. It can be seen as an application Action :S ×T  Χ accepting a state and a terminal (part of the input) and returning an action. • Goto-table: A table of states. It can be seen as an application GoTo :S × V ∪T( ) S that accepts a state and a symbol of the grammar, and returning another state. The table The parsing table for LR parsers is represented by the action-table and the goto-table. As for LL parsers, these tables can be seen as applications accepting two entries and returning something. Parser state The parser has an internal state which is a concept different from the set of states Χ . In every moment the state of the parser is represented by the current configuration of its stack sm,xm,sm−1,xm−1s1,x1,s0{ } and the remaining string left to parse aiai+1anπ . It can be represented, in a concise way, as: sm xmsm−1xm−1s1x1s0,aiai+1anπ{ }. It is to be interpreted as: when the parser is processing current symbol ai , the stack is in the reported configuration and at the top we find state sm . Actions The parser can perform 4 different actions depending on a state and a terminal symbol. Preconditions are always the same: parser is in state sm xmsm−1xm−1s1x1s0,aiai+1anπ{ }. • Shift: If the action is shift, thus Action sm,ai( )= χS , then the parser will push current symbol ai into the stack. Later the parser will calculate the next state as: sm+1 = GoTo sm,ai( ); the new

pag 101 Andrea Tino - 2013 state is then pushed into the stack and will become the new top-state. All this will cause the parser to make a transition to state sm+1aism xmsm−1xm−1s1x1s0,ai+1anπ{ }. Please note how the look-ahead symbol is consumed. • Reduce: If the action is reduce, thus Action sm,ai( )= χR , then the action table will also return one more information: the production rule to use for reduction: A ⇒ β . So let r = β be the number of terminals and non-terminals in the rule’s RHS, then the stack will be shrunk by 2r elements by popping them. In particular the following must hold: β = xm−r+1xm−r+2 xm , thus symbols xm−r+1,xm−r+2 xm will be popped from the stack together with their corresponding states sm−r+1,sm−r+2 sm . Rule’s LHS A will be pushed into the stack together with the next state calculated as s = GoTo sm−r ,A( ) and becoming the new top-state. The parser will move to configuration sAsm−r xm−rsm−r−1xm−r−1s1x1s0,aiai+1anπ{ }. Please note how the loookahead symbol is not consumed. • Accept: If the action is accept, thus Action sm,ai( )= χA , then the parser terminates successfully. • Error: If the action is error, thus Action sm,ai( )= χE , then the parser terminates reporting the problem occurred. As it is possible to see, tables can return more information than those introduced so far in their formal definition. Later in this chapter we will detail them. The LR(1) parsing algorithm When considering an input sequence, a LR(1) parser follows these steps: 1. The input buffer is initialized with the input sequence. The stack is initialized by pushing the initial state. At the end of the initialization process, the parser’s state will be: s0,a1a2 anπ{ }. 2. Evaluate Action sm,ai( ) where sm is always the top-symbol in the stack. Accordingly to the action, the parser will act as described before. 3. Repeat point 2 until an error is found or until an accept action is performed. The algorithm needs certain types of grammars, ambiguities can be considered here as well. Right sentential forms A concept is very important in the context of LR parsing. [Def] Right sentential form: Given a grammar G V,T,P,S( ) and a sequence of symbols α ∈ V ∪T( )∗ , we call it a Right Sentential Form (RSF) for the grammar when the sequence can be written in the form: α = β1β2 βma1a2 an having βi ∈ V ∪T( ),∀i = 1…m and ai ∈T,∀i = 1…n , thus the right side is always filled with terminals. Recalling the concept of state for a LR parser introduced before, the state of a parser

pag 102 Andrea Tino - 2013 sm xmsm−1xm−1s1x1s0,aiai+1anπ{ } always corresponds to RSF x1x2 xm−1xmaiai+1an . Deriving a RSF A RSF can be involved into a derivation process using production rule A ⇒ β . however from the LR parser’s perspective, the process is conducted by inverting the usual flow. The rule’s RHS matches a sequence of symbols into the RSF, later the rule’s LHS is replaced on that sequence into the original RSF returning a new one. More formally, when having the following situation in the grammar: S ⇒⇒αAω ⇒αβω , having α,β ∈ V ∪T( )∗ and ω ∈T ∗ , and production rule A ⇒ β , it is possible to have αβω ⇒αAω . As it is possible to see, here we are following the opposite strategy when comparing to LL parsers. LR parsers start from productions and try to move up into the derivations tree in order to reach its root (the start symbol). On the other hand, LL parsers made the opposite, instead of reducing expressions, non-terminals were expanded, thus starting from the root of the derivations tree to its leaves. The Action/Goto table As noted before, the action/goto table (in particular the action table) contains more information than those reported in their formal definition when we first introduced it. The point is that the parsing table is formed by the action-table and the goto-table. The goto-table share a portion of the entry space with the action-table. When one needs to visualize the parsing table, one matrix only is shown. Consider the following table: ID PLUS STAR RO RC π exp term fin 0 S-5 null null S-4 null null 1 2 3 1 null S-6 null null null A null null null 2 null R-2 S-7 null R-2 R-2 null null null 3 null R-4 R-4 null R-4 R-4 null null null 4 S-5 null null S-4 null null 8 2 3 5 null R-6 R-6 null R-6 R-6 null null null 6 S-5 null null S-4 null null null 9 3 7 S-5 null null S-4 null null null null 10 8 null S-6 null null S-11 null null null null

pag 103 Andrea Tino - 2013 9 null R-1 S-7 null R-1 R-1 null null null 10 null R-3 R-3 null R-3 R-3 null null null 11 null R-5 R-5 null R-5 R-5 null null null The table maps actions and gotos for the following grammar (expressed in YACC notation): T = {ID,PLUS,STAR,RO,RC}; V = {exp,term,fin}; 1) exp -> exp PLUS term; 2) exp -> term; 3) term -> term STAR fin; 4) term -> fin; 5) fin -> RO exp RC; 6) fin -> ID; As it is possible to see, grammar rules have been assigned a number, an index. It is mandatory to assign to each rule in the grammar an index i = 1… P . The table makes use of these indices. The table is structured like this: all rows refers to states si ∈S . The table has T + V columns: the first T refer to all terminals in the grammar while the remaining V all refer to non-terminals. The whole table is actually the union of the action-table and the goto-table. Indices showing in the action part of the table are indices actually belonging to the goto-table which “overlaps“ the action- table. Think of the parsing table as the union between the action and the goto tables. Action-table The action-table’s entries are all in the form x-i: an action and an index when available (the index is part of the goto-table overlapping action-table entries). • Shift actions: Letter S is used to refer to shift actions followed by the index of the state which will be pushed on top of the stack. • Reduce actions: Letter R is used to refer to reduce actions followed by the index of the grammar rule to use when reducing the stack. • Accept actions: Letter A is used to refer to accept actions. No index is needed here. • Error actions: Literal null is used to refer to error actions. No index is needed here. When the table returns no entry for a given position, the null value is returned. Goto-table The goto-table’s entries are all indices referring to states si ∈S . A demostrative example on LR parsers We consider the same grammar of before and the table. We still do not know how to build the table, but we are just going to focus on the parsing algorithm for now. Consider input string ID STAR ID PLUS ID. As always stacks’ top element is the leftmost one.

pag 104 Andrea Tino - 2013 Stack Input buffer LA symbol Action [index] Descrirption {0} {ID,STAR,ID,PLUS ,ID,π} null reduce-0 Initialization. {5,ID,0} {STAR,ID,PLUS,ID,π} ID shift-5 Action-table returns shift. Symbol and state pushed. {3,fin,0} {ID,PLUS,ID,π} STAR reduce-6 Reducing using rule 6. Rule’s LHS is used to get next state. {2,term,0} {ID,PLUS,ID,π} STAR reduce-4 Reducing using rule 4. {7,STAR, 2,term,0} {ID,PLUS,ID,π} STAR shift-7 Shifting and pushing elements in the stack. {5,ID,7,STAR, 2,term,0} {PLUS,ID,π} ID shift-5 Shifting and pushing elements in the stack. {10,fin, 7,STAR, 2,term,0} {ID,π} PLUS reduce-6 Reducing using rule 6. {2,term,0} {ID,π} PLUS reduce-3 Reducing using rule 3. Stack gets shrunk! {1,exp,0} {ID,π} PLUS reduce-2 Reducing using rule 2. {6,PLUS, 1,exp,0} {ID,π} PLUS shift-6 Shifting. {5,ID,6,PLUS, 1,exp,0} {π} ID shift-5 Shifting. {3,fin,6, PLUS,1,exp,0} {} π reduce-6 Reducing using rule 6. {9,term,6, PLUS,1,exp,0} {} π reduce-4 Reducing using rule 4. {1,exp,0} {} π reduce-1 Reducing using rule 1. {} {} null accept Success! The whole process is very simple. The core part of the algorithm resides in the table. How to build parsing tables When we introduced LL grammars we also introduced parsing tables for them. A theorem can also be considered in order to decide whether a grammar is LL by inspecting the parsing table generated by that grammar. Here is the same. A grammar generates a parsing table which is

pag 105 Andrea Tino - 2013 supposed to have a certain structure. [Theo] LR grammars: A grammar is LR when the Action/Goto table can be built. There are some algorithms to build the parsing table for a LR grammar, it mostly depends on the specific case (in particular, the number of look-ahead symbols). The parsing algorithm is deeply related to the parsing table. Some important considerations The stack plays a key role; in fact it contains all information regarding the parsing itself. But what’s important to underline here is the fact that the whole stack is not necessary to parse a string; the top-state contains all information we need to make the parsing. So a LR parser just needs to investigate the top-state instead of the whole stack. This means that a FSA can be built out of this system. States are pushed into the stack and the stack is reduced only depeding on the top-state. A FSA will rule make the algorithm proceed until an accepting state is reached; the stack will help the FSA with all information it needs. So here it is: a pushdown automaton! We discovered nothing actually, as Chomsky and Schützenberger already told us that type-1 grammars are to be handled using such structures. Relating LR and LL grammars We understood that LL grammars are a subset of CFG grammars. What about LR grammars? [Lem] LR and CFG grammars: There exist non-LR CFG grammars. So, again, LR grammars are a subset of CFG grammars. But what can we say about LL and LR grammars when comparing them both? From what we could understand before, we found that LR grammars are parsed using pushdown automata. But LL grammars are parsed without the use of any FSA. Furthermore we know that type-2 grammars are handled using pushdown automata (recall the Chomsky-Schützenberger hierarcy). So it looks like LR grammars tend to be a little more generic than LL. [Theo] LR and LL grammars: LR grammars can describe LL grammars. LR(k) grammars can describe LL(k) grammars. So we have that LL grammars are a subset of LR grammars which are a subset of CFG grammars. This happens for one reason only: LL grammars are more restrictive than LR grammars. Recalling LL grammars, we need them to be non-left-recursive which is a strong requirement, LR(k) CFG LL(k)

pag 106 Andrea Tino - 2013 left-factorization is also required whenever predictive parsing is to be used. On the other hand, LR grammars, to be parsed, have ony one simple requirement: [Theo] Parsing LR grammars: LR(k) parsers can parse context-free grammars having one production rule only for the axiom. [Cor] Ambiguos LR grammars: Ambiguos grammars cannot be LR or LR(k). If the grammar has more than one production rule for the axiom S , we can create an equivalent LR grammar by simply adding a new non-terminal ′S ∈V , making it the new axiom and adding production rule ′S ⇒ Sπ . Also please note that the theorem provides a necessary condition only. LR(0) grammars and parsers The parsing algorithm is always the same, what changes is the parsing table, thus the action-table and the goto-table. How to build them? In this section we are going to see how to build these tables when handlind LR(0) grammars. LR(0) grammars are LR grammars using no lookahead symbols. These particular grammars are a little more restrictive than LR grammars, the parsing table can be built by analyzing the set of production rules. LR(0) items Item-sets A LR grammar must have some characteristics to be LR(0). By following certain methodologies it is possible to check whether a LR(0) can be generated out of a LR one. The first concept to understand is the LR(0) (dotted) item of a production rule. Given LR grammar G V,T,P,S( ) and a production rule, A ⇒ X1X2 …Xn , where Xi ∈V ∪T,∀i ∈1…n , the LR(0) item-set for that production rule is a set of production rules, all having A as LHS, in the form: A ⇒ X1Xi i Xi+1Xn ∀i = 1…n −1 iX1X2 …Xn X1X2 …Xn i ⎧ ⎨ ⎪ ⎩ ⎪ The dot • symbol is a new symbol added to the terminal set of the original grammar and becomes part of the new rules introduced by the item. When considering items for all production rules of

pag 107 Andrea Tino - 2013 a grammar, a new grammar is generated. The new grammar will host many more rules than the original one, and a new symbol becomes part of it. One last thing Typically another operation is performed when creating an augmented grammar in this context: let S be the axiom of the grammar, then the new augmented grammar will be added with non-terminal ′S and with production rule ′S ⇒ S . This is necessary to guarantee that the LR(0) parser doesn’t fault in the last part of the parsing, but actually this trick is the same we saw before and ensures that the grammar will be a proper LR grammar at least. Special items In the item-set of a production rule we can find items. The number of elements in the item-set of a rule is n +1, where n is the number of symbols in the rule’s RHS. There are two items that are special: • Predict-item: The dot • symbol appears as the left-most symbol in the item’s RHS like: A ⇒ iX1X2 …Xn . • Reduce-item: The dot • symbol appears as the right-most symbol in the item’s RHS like: A ⇒ X1X2 …Xn i. When handling the empty string, a rule like A ⇒  will generate one item only: A ⇒ i . What does it mean? An item for a production is the representation of the recognized portion of that rule’s RHS. Given an item for a certain rule, the RHS’ left part of the dot • symbol is the original rule’s portion recognized by the parser so far; everything appearing on the left of the dot • symbol is expected to be encountered later. Augmented grammar So, as a resume, when having a certain LR grammar G V,T,P,S( ), we can create an augmented grammar ′G = ′V , ′T , ′P , ′S( ) having: • The augmented set of non-terminals can be generated from the original one by adding the new axiom symbol: ′V = V ∪ ′S{ }. • The augmented set of terminals is created by simply adding the dot • symbol: ′T = T ∪ i{ }. • The augmented set of production rules is created by adding all items for all original production rules: ′P = ′p : ′p ∈Item p( ){ } ∀p∈P  . Furthermore, production rule ′S ⇒ S is added to the augmented grammar. • The new axiom is simply defined as a new symbol. LR(0) closure of an item Given an item ′p ∈ ′P , we can consider a set called the closure of that rule. Our purpose is creating a FSM, to do so, the concept of closure is necessary (here we consider the closure for LR(0) grammars). the closure of an item is the set of items/rules that could be used to extend the

pag 108 Andrea Tino - 2013 current structure handled by the parser. It can be calculated as follows. 1. Initialize the closure to the empty set and add the item itself to it: Closure ′p( )= ′p{ }. 2. For every item A ⇒α i Bγ 1 in the closure, add all items in the form B ⇒ iγ 2 (predict-items). Thus consider all rules in the original grammar whose LHS is the non-terminal appearing right after the dot in the current item. Mark the item. 3. Repeat step 2 until all items in the set are marked. Example Let us consider the following grammar: T = {a,b,c,p,π}; V = {S,E,T}; S -> E π; E -> T; E -> E p T; T -> a; T -> b E c; Let us calculate the closure of item S->•Eπ. 1. At the beginning the closure is: closure = {S->•Eπ}. 2. From the item in the closure, we inspect the first non-terminal after the dot: E. We can add items E->•T and E->•EpT. We have: closure = {S->•Eπ, E->•T, E->•EpT}. 3. The third item redirects to the same items. The only one item left unmarked is E->•T. We inspect rules having T as LHS, and add items T->•a and T->•bEc. The closure is: closure = {S->•Eπ, E->•T, E->•EpT, T->•a, T->•bEc}. No more items are to be added as unmarked items have terminals after the dot. Closures for items in a grammar will be used in the next step in order to build the parsing table. Building a FSA out of the grammar In order to build the parsing table, a FSA is needed. To build the FSA an algorithm can be considered and it takes advantage of closures. The procedure will create a DFA. FSA generation algorithm Recalling the formal definition of DFAs, we consider an empty DFA N = Σ,S,δ,s0,F( ). An augmented grammar is considered as well: ′G = ′V , ′T , ′P , ′S( ). The DFA will have the following characteristics: • The alphabet contains terminals and non-terminals: Σ = ′V ∪ ′T . • States are closures of items in the grammar: S ⊆ t :Closure ′p( ),∀ ′p ∈ ′P{ }. The following rules are to be applied. 1. Consider item ′S ⇒ iS , this item must be in the grammar due to its construction. Build the closure for this item and mark it as the initial state of the DFA. 2. For every not-marked state in the DFA, consider its value (the closure). Create a new state in the DFA for each symbol (terminal or non-terminal) following the dot in the items of the

pag 109 Andrea Tino - 2013 closure. Create a connection from the current state to new ones and label them with the symbol following the dot of the items considered previously. 3. For each new state created, put in the closure (representing that state’s value) the items from the source linked state where the symbol after the dot is the one placed on the connection between the source state and the new state. When adding items from the source state move the dot one position to the right. For all items where the dot is followed by non-terminals, add to the closure all items in the closure of those items. 4. Every state containing at least one reduce-item is to be marked as final state. 5. Mark every new state which has undergone the procedure. 6. Repeat from point 2 until no unmarked states are left. The DFA built is called characteristic FSM: CFSM for the grammar. Example Consider the following grammar: T = {id,(,),+}; V = {E,T}; start = E; E -> T; E -> E + T; T -> id; T -> ( E ); The augmented grammar will be: T’ = {id,(,),+,π,•}; V’ = {S,E,T}; start = S; Note how we also added the end-symbol to the augmented grammar, this is a typical methodology. Following all rules introduced before, the final DFA will be the one reported below. Todo Building the parsing table We have everything needed to build the parsing table. The parsing table is initialized with states as rows and, respectively, terminals and non-terminals as columns. Attention The table refers the original grammar not the augmented one; so no augmented symbols will be shown in the table. All entries are set to null values. The table should be visualized in compact form: action-table + goto-table. Note that the action-table won’t need characters to get the action. For this reason a separate column will host actions for each state. To build the table, the following rules can be applied: 1. For each transition arc si → sj marked with symbol X ∈V ∪T , enter index j ∈ in table entry i, X( ). So: GoTo i, X( ) ⊇ j{ }. 2. For all transition states (states that are not final states), enter shift in the action-column for that state in the table.

pag 110 Andrea Tino - 2013 3. For each final state, enter reduce in the action-column for that state in the table. Also specify the production rule by placing the rule index corresponding to the item in the state. If more items are present, put more indices (this is an ambiguity reduce/reduce). 4. The DFA must have a final state showing one item corresponding to rule ′S ⇒ iS . Replace the reduce action for that state with action accept. The table looks a little different when compared to LR tables, but both tables rely onto the same structures. Example Continuing the example of before, we can build the table following the algorithm. The table is: Todo Action ID PLUS STAR RO RC π exp term 0 shift S-5 null null S-4 null null 1 2 1 reduce null S-6 null null null A null null Please note how the action-table does not overlaps the goto-table. The parsing algorithm Compared to LR parsing described so far, LR(0) parsing is different because no lookahead symbols are needed. So the parsing algorithm described before needs to be modified just a little. Actually the procedure is always the same, but the action-table needs no character as input. It means that the action table does not overlaps with the goto-table. Also, to get the action, no character is needed, so the application becomes: Action :S  Χ . The parsing algorithm remains the same. The next state is to be calculated using the goto-table as before, in this case the symbol in the stack is necessary. Conditions for LR(0) grammars Some important considerations can be made, in particular a very important result is reported below: [Theo] LR(0) grammars: A LR grammar is LR(0) if and only if each state in the CFSM is: a reduction state (final state) containing one reduction-item only, or a normal state (shift state) with one or more items in the closure (no reduction-items). [Cor] LR(0) grammars: A LR(0) grammar has one element only in each entry of the parsing table LR(1) CFG LR(0)

pag 111 Andrea Tino - 2013 This clearly makes LR(0) grammars a little more specific than LR grammars. So we can make our hierarcy a bit more precise by adding these grammars and placing them in the right position. SLR(1) grammars and parsers LR(0) grammars are very compact, but the lack of look-ahead symbols makes them inefficient sometimes. They can be improved with little efforts. SLR(1) parsers (Simple LR) work on a wider class of grammars as they allow the possibility to relax restrictions a little. Inserting look-ahead symbols SLR(1) parsers work with LR(0) parsing CFSM together with look-ahead symbols. Look-aheads are considered and generate modifications in the CFSM. In particular lookaheads are considered for items in states/closures in the DFA, but how to get look-aheads? For each item in states/closures of the DFA we have: • When having shift-items (non-reduce-items) in the form A ⇒α i β , look-aheads are all terminals in First β( ). • When having reduce-items in the form A ⇒α i , look-aheads are all terminals in Follow A( ). This will cause more transitions to be generated from the original CFSM. Solving ambiguities Sometimes SLR(0) grammars can help resolving LR(0) parser conflicts generated by ambiguities in the corrersponding LR(0) grammar. The approach is using look-ahead symbols to bypass ambiguities; however not always such an approach is successful. Not an exact methodology Using look-aheads in the follow-set of reduction-items is a good strategy, but definitely not always the best especially when handling shift/reduce conflicts. This is due to the fact for which using parsers embodying look-ahead symbols in their implementation is a far more precise approach than estimating them using follow-sets. Characterizing SLR(1) grammars We have the following important result: [Theo] SLR(1) grammars: A LR grammar is SLR(1) if and only if each state in the CFSM (augmented with look-aheads) is: a reduction state (final state) containing only one reduction-item per look-ahead symbol, or a normal state (shift state) with one or more

pag 112 Andrea Tino - 2013 items in the closure (no reduction-items). We know that ambiguos grammars cannot be LR, the same goes here: [Theo] Ambiguos SLR(1) grammars: Ambuguos grammars cannot be SLR(1). Also, the following result helps us placing SLR grammars: [Theo] SLR and LR grammars: There exist LR non-SLR grammars. This means that SLR(1) grammars are a superset of LR(0) grammars. The proof is simple as SLR(1) grammars enhance LR(0) grammars. Simplifying more complex grammars SLR grammars can be very efficient and be able to catch several constructs of languages. LR(1) grammars generate complex DFAs, however very often designers oversize the problem as many LR(1) grammars are actually SLR(1)! DFAs shrinks a lot when using SLR(1) parsing compared to LR(1). In fact, SLR(1) grammars being parsed using LR(1) algorithms generate a DFA with many more states than SLR(1) algorithms. LR(1) grammars and parsers LR(1) grammars are parsed by including look-ahead symbols into the process. Differently from SLR(1) parsers and grammars, here look-ahead symbols are part of the process from the beginning, they are not considered later as an appendix or similar. LR(1) items LR(1) items are a little bit different from LR(0) items. Basically a LR(1) items are like LR(0) items, but they contain information about look-ahead symbols. An item appears in the following form: A ⇒α i β,a[ ]. The dot-symbol assumes the same meaning, but a look-ahead symbol (terminal) appears as well. What does it mean? An item now is to be intended like: “The RHS left part has been recognized SLR(1) LR(0) LR(1)

pag 113 Andrea Tino - 2013 so far, the left part is expected once the look-ahead is be encountered!“. An item can be represented as: A ⇒ x1x2 xi i xi+1xn,a[ ], where xi ∈ V ∪T( )∗ ,∀i = 1…n and a ∈T ∪ π{ }. Basically an item like A ⇒α i β,a[ ] is not so different from a LR(0) item, but this item A ⇒αi,a[ ] is more different as it is a reduce-item only when the specified terminal appears in imput. The end-terminal π appears as look-ahead symbol for the axiom rule. LR(1) closure of an item The LR(1) closure for an item ′p ∈ ′P is to be calculated differently from LR(0) closures. 1. Initialize the closure to the empty set and add the item itself to it: Closure ′p( )= ′p{ }. 2. For every item A ⇒α i Bγ 1,a[ ] in the closure, add all items in the form B ⇒ iγ 2,b[ ] (predict- items) for all terminals appearing in the first-set of expression γ 1a: ∀b ∈First γ 1a( ). Mark the item. 3. Repeat step 2 until all items in the set are marked. The procedure is different because it keeps track of look-aheads. Building the CFSM All considerations made before for LR(0) grammars are still valid here. Algorithm The algorithm tu build the DFA is still the same with very little modifications. When calculating closures, the LR(1) closure is to be used. Furthermore, at the beginning of the algorithm, the first item to consider is ′S ⇒ iS,π[ ]: its closure is the initial state of the DFA. Final states Another important aspect is the following: in order to have a valid parsing, the DFA should have all final states as those states having at least one predict-item in the closure. But this is something we already knew. Now we have one more condition: predict-items should always have the last-terminal π as look-ahead symbol. Building the parsing table Compared to LR(0) grammars, the table now will be a little different as the action-table will overlap the goto-table. It means that the action depends both on the current state and on the look- ahead symbol as well. Here too, all considerations made for the LR(0) parsing table building algorithm are valid. 1. For each transition arc si → sj marked with symbol X ∈V ∪T , enter index j ∈ in table entry i, X( ). So: GoTo i, X( ) ⊇ j{ }. 2. For each shift state’s outgoing transition arc si → sj marked with symbol X ∈V ∪T , enter shift in action-table entry si , X( ), in the goto-table at the same entry place value j . 3. For each final state si , for each reduce-item A ⇒αi,a[ ], enter reduce in action-table entry

pag 114 Andrea Tino - 2013 si ,a( ). Also specify the production rule by placing the rule index corresponding to the item in the state, thus A ⇒α . If more items are present, put more indices (this is an ambiguity reduce/ reduce). 4. The DFA must have a final state showing item ′S ⇒ iS,π[ ], corresponding to rule ′S ⇒ iS . Replace the reduce action for that state with action accept. The procedure is very similar as it is possible to see. The parsing algorithm Parsing is performed as explained at the beginning of this chapter during the example. Characterizing LR(1) grammars We have the following important result: [Theo] LR(1) grammars: A LR grammar is LR(1) if and only if each state in the CFSM is: a reduction state (final state) containing only one reduction-item per look-ahead symbol, or a normal state (shift state) with one or more items in the closure (no reduction- items). We know that ambiguos grammars cannot be LR, the same goes here: [Theo] Ambiguos LR(1) grammars: Ambiguos grammars cannot be LR(1). LALR(1) grammars and parsers Let us focus on the last method to build parsing tables for LR grammars: LALR (Look-Ahead LR grammars). Here as well, these grammars can be more compact and generate smaller tables compared to canonical LR. The same goes for the CFSM which shrinks and needs less states. LR, SLR and LALR SLR grammars were introduced almost the same way as we are doing for LALR. However SLR grammars cannot catch many constructs that LALR can, yet both grammars have the same important characteristic: they generate smaller DFAs and more compact tables compared to canonical LR. However one result, which will be detailed later, is very imprtant: [Theo] LALR and SLR grammars’ CFSM sizes: LALR tables have the same number of states of SLR tables; although LALR grammars cover more constructs of SLR grammars.

pag 115 Andrea Tino - 2013 Please remember that table size is related to the number of states in the CFSM and vice-versa. Saying that the CFSM shrinks is the same as saying that the table gets more compact and smaller. The idea behind LALR parsing The idea behind is the same which drove SLR parsers: simplifying LR canonical parsers. The concept of core We need to introduce a quantity: the core of an item. Given a generic LR item A ⇒α i β,a[ ], we say that its core is the augmented grammar rule A ⇒α i β inside it, thus descarding any lookahead symbol (remember that when handling items, a new grammar is considered). Which means that LR(0) items’ core is the item itself. An interesting fact If we considered some generic LR grammar and the set of items built for the CFSM (closures), we would probably find closures where items have the same cores like items A ⇒ ia,a[ ], A ⇒ ia,b[ ] (as part of one state) and A ⇒ ia,π[ ] (into another different state) for example. When such states are encountered the action to perform is based on the lookahead symbol (the only one element making items different in this context). These two states could be merged into one state as they all imply a shift operation. Actually LALR(1) parsers work exactly this way: they reduce the number of states by merging LR(1) states generating smaller tables. Merging LR(1) states As anticipated before, the process basically consists in merging LR(1) states, but when can we perform this? The operation is possible only when states/closures contain items having the same set of cores. Attention, to be merged, two states si and sj must fulfill these conditions: 1. Evaluate the first state/closure and locate all cores. Put all cores into set Ri . 2. Evaluate the second state/closure and locate all cores. Put all cores into set Rj . 3. If both sets are equivalent: Ri = Rj , then the two states can be merged. The new state will contain all items contained in the orginal states. However now a problem arises: the CFSM has shrunk, the number of states has decreased but the table needs to be synched! What to do? What about the action of the new state? What about the goto of the new state? Setting actions and goto of merged LR(1) states There is a mapping between the parsing table (action-table + goto-table) and the CFSM. When a merge occurs, the tables must be re-shaped to keep synchronization. Synching the goto-table When two states are merged into the DFA, we need to handle transitions, which is equivalent to saying that the goto-table must synched! Actually the merging does not cause any problem in goto-table synching, the process is always successful.

pag 116 Andrea Tino - 2013 X1 X2 ... Xm s_i v_i_1 v_i_2 ... v_i_m ... ... ... ... ... s_j v_j_1 v_j_2 ... v_j_m X1 X2 ... Xm s_ij w_1 w_2 ... w_m ... ... ... ... ... Let us consider a generic entry for the same symbol in the original table for both states (to be merged): v_i_k and v_j_k which point to the next state. We can have two possibilities: • Both values are the same: v_i_k = v_j_k. In this case, no problem if both states for the same symbol lead to the same state it is ok, we have that w_k = v_i_k and that’s all. • Values are different. However this condition cannot occur. Let us consider absurdly to have two mergeable sets having transitions to two different states on the same symbol X , this condition is not possible because the items causing the transitions must share the same core (they cannot lead to two different states), thus such a condition is not possible. The goto-table can be managed always without problems. Synching the action-table The goto-table is never a problem. Problems can occur when trying to re-shape the action-table. In fact the action to set for a table entry depends on the value of each state, thus on items inside each closure. For example we might merge two states having different actions, what to do in that case? Conflict might occur. We will discover that shift/reduce conflicts will never occur (if the original LR(1) parser had no conflicts). However reduce/reduce conflicts might occur. No shift/reduce conflicts Let us consider a LR(1) parser having no conflicts, so a proper parser for a proper LR(1) grammar. If two states are merged into one state, it means they share the same set of cores. Let us consider two hypothetical states: s1 = core1,a[ ], core2,b[ ]{ } (a reduce state since we need to create a shift/reduce conflict) and s2 = core1,c[ ], core2,d[ ]{ } (a shift state). They must share the same cores as they are mergeable and inside each state, items generate no conflicts. The only possibility to have item core1,a[ ] to be in conflict with item core2,d[ ]. But if they generate a shift/reduce conflict it means that core1,a[ ] is in shift/reduce conflict with core2,b[ ] because item core2,b[ ] has the same core of item core2,d[ ], this would imply that the conflict originated in the first state first, but this is absurd as we considered the initial LR(1) as not ambiguos! This proves we cannot have shift/reduce conflicts. Reduce/reduce conflicts However, as anticipated, reduce/reduce conflicts can be experienced! Cores are the same but look-aheads are different! In that case we have conflicts and the grammar

pag 117 Andrea Tino - 2013 is not LALR(1)! Building the LALR(1) table We have already seen how to build the parsing table starting from the LR(1) table of the original LR(1) grammar: this is a valid approach but, although the simplest but the most time-expensive, definitely not the only one. A more efficient and advanced algorithm allows the construction of the LALR(1) CFSM without the LR(1) DFA. However, the advanced approach will not be covered here. To build the parsing table we start from the LR(1) parser, we shrink its CFSM and edit the original table. From a sistematic point of view, the following actions must be performed: 1. Consider all states in the LR(1) CFSM: s0,s1sn{ }, and locate all those having common sets of cores. 2. Merge states having the same set of cores obtaining a new set of states: r0,r1rm{ }. 3. For each new state, the corresponding value in the action-table is calculated with the same exact procedure used for LR(1) DFAs. 4. The goto table is filled as seen before, no conflicting values will be as all items in merged states have the same set of cores. This procedure is quite expensive and not efficient, but it is a valid approach. Characterizing LALR(1) grammars We have the following important result: [Theo] LALR(1) grammars: A LR grammar is LALR(1) if and only if each state in the reduced CFSM is: a reduction state (final state) containing only one reduction-item per look-ahead symbol, or a normal state (shift state) with one or more items in the closure (no reduction-items). [Cor] LALR(1) grammars: A LALR(1) grammar has one element only in each entry of the parsing table. We know that ambiguos grammars cannot be LR, the same goes here: [Theo] Ambiguos LALR(1) grammars: Ambiguos grammars cannot be LALR(1). LALR vs. LR LALR(1) grammars are more generic gramars than SLR(1), however not so powerful as canonical LR(1) LALR(1) SLR(1) LR(0)

pag 118 Andrea Tino - 2013 LR(1)! Another interesting fact about LALR and LR parsers is the difference in their behavior. Consider the same LALR grammar parsed by a LALR parer and a LR parser. When providing a valid input string to both parsers, they will behave the same exact way; however when an invalid input string is passed, the LR parser will signal the error before the LALR parser! Errors are detected, but detection is postponed in LALR parsers. LL vs LR grammars So far we have covered all grammars. They all have different characteristics. For the sake of recalling, let us resume them here. LL(1) grammars LR(0) grammars SLR(1) grammars LALR(1) grammars LR(1) grammars Testing quantities Predict-sets Parsing table / CFSM Parsing table / CFSM Parsing table / CFSM Parsing table / CFSM Parts to test All rules in the grammar Table entries / CFSM states Table entries / CFSM states Table entries / CFSM states Table entries / CFSM states Complexity Predict-sets must be calculated for all production rules in the grammar CFSM must be built CFSM must be built CFSM must be built CFSM must be built Conditions Predict sets for all production rules having the same LHS must be disjoint, for all rules in the grammar Each state must be a shift state or a reduction state with one reduction-item only. Each state must be a shift state or a reduction state with one reduction- item only per terminal. Each state must be a shift state or a reduction state with one reduction- item only per terminal. Each state must be a shift state or a reduction state with one reduction- item only per terminal. LR grammars mostly rely on the table, LL grammars use predicts. However LL grammars as well can use tables for testing (one rule for each entry), but this condition is a consequence of predicts.

Andrea Tino - 2013 Errors management Aship in port is safe, but that’s not what ships are built for. “ “Grace M. Hopper http://www.brainyquote.com/quotes/quotes/g/ gracehoppe125849.html

pag 120 Andrea Tino - 2013 Overview What happens when a parser encounters a problem in the input string? One thing is for sure: parsing cannot continue normally! A parser should treat errors in efficient ways. Depending on the type of parsing/grammar, it is possible to detect errors at different distances from the input string start. Errors management phases Errors handling typically happens in three stages: 1. Error detection: The error occurs in the input and if we look at the input we will detect it immediately. But the parser cannot behave like this, the error will be discovered on a certain token from the input start. 2. Error analysis: Once detected, the error must be analyzed in order to understand what type of error it is. A mispelled keyword, a syntax error and so on. 3. Parse recovery: Errors should not interrupt the parsing. Parsers should be able to continue parsing the input regardless of errors. Recovery is not mandatory. Many parsers give up on recovering and focus their attention on providing a good error classification in order to let the programmer understand what went wrong. When an error is encountered the parser quits and reports the error in detail. Error reporting without recovery might look like a simple task but it is not; some complex languages can be very difficult to analyze when an error occurs and sometimes the error reporting will be filled with so many information that make impossible a good understanding of what to do, what to fix. LL and LR grammars A very good condition in error handling occurs thanks to LL and LR grammars: they are viable prefix grammars. This means that the error can be detected on the first token that does not match eny rule in the grammar. It happens because LL and LR parsers can recognize a valid string starting from the beginning of it: valid prefixes in the language can be detected by these grammars. Important concepts Some relevant quantities are to be considered when talking about errors management. Error distance The error distance is defined as the number of tokens, starting from the latest parser recovery, after which the error is detected. Errors management strategies A strategy is a comprehensive approach about how to analyze and recover from an error once it is detected. We have some options depending on the type of parser (here we consider type-3 and

pag 121 Andrea Tino - 2013 type-2 grammars). • Panic mode: Once detected, the parser tries to locate a token in the sequence from which recovering the parsing. These tokens are special delimiters that depend on the syntax construct. Very common tokens can be begin and end tokens, another example can be clang: curly brackets. It is a quite common approach but it can cause the parser to discard much of the input. • Phrase level: The parser tries to modify the input using local modifications, thus small changes performed very close to the token that originated the error. This approach can involve insertion, deletion or swapping of tokens. Disadvantages can be serious as possible loops may occur with very high probability to encounter non-recoverable conditions as the error distance increases. • Error productions: It is very common to augment the grammar with invalid production rules mapping typical errors. This is a very common and good approach as the parser knows possible errors in detail and, when these particular errors are encountered, recovery is always possible without any demage. The only disadvantage is that the grammar must be modified. • Global correction: The best correction is calculated, or at least attempts are performed. The point is trying to get the best correction for specific types of errors when a derivation is not observed. this approach is not very common, but sometimes it is used together with phrase level to enhance it. Most common algorithms Algorithms to handle errors can be divided into two groups depending on the type of parser: • Top-down parsing: We are going to analyze two common approaches to treat errors in LL parsing algorithms. • Bottom-up parsing: Other two common methodologies will be considered for LR parsers. The distinction is based on the fact that LR and LL parsers use different derivation policies. Handling errors in top-down parsers We have two algorithms here. Let us examine them both. LL panic mode method When having a predictive LL(1) parser an error can occur for two reasons:

pag 122 Andrea Tino - 2013 • The terminal symbol on top of the stack is different from input look-ahead character. • The couple formed by the non-terminal on top of the stack and the look-ahead terminal points to a null entry in the parsing table. What to do when an error occurs? It should notify the error and try to recover the parsing to go on. The idea This method basically relies on one hypothesis: maybe the error occurred because the first part of the stack contains bad symbols, so good ones are to be encountered if the stack is reduced one symbol at a time. If this does not happen, then it means that the current look-ahead is not good, it is discarded and the next token is considered anfter recovering the original stack. The algorithm The algorithm can be summarized as follows: 1. An error is detected and the parser is in a certain configuration. The stack is duplicated and stored in a temporary variable. 2. One by one each symbol in the stack is popped. For each discarded symbol the stack shrinks. The parser tries to continue parsing with this new configuration. If it fails, again the top symbol is popped and this point is repeated until the parser succeeds or the stack becomes empty. Here the look-ahead symbol remains the same. 3. If the stack becomes empty, the saved stack is recovered and the current look-ahead symbol is discarded. The next token is fetched and considered as the new look-ahead symbol. If another error occurs, the algorithm starts again from point 1. In pseudo-code we can write the algorithm as: procedure err_llpanic(Stack,Input) set StackCopy = Stack.clone(); do do if parse(Stack.top(),Input.current) return; Stack.pop(); while not Stack.empty(); Stack = StackCopy; Input.next(); while not Input.empty(); report_error(“Could not recover!“); /* giving up :( */ end This method can be efficient and a valid option for LL parsers. It can be considered a panic mode approach to errors management. Synchronization tokens method With this method every non-terminal in the grammar is assigned a triple O,A,C( ) having the

pag 123 Andrea Tino - 2013 following characteristics: • The first component is the opening mark: a set of terminals. • The second component is the non-terminal the triple is assigned to. • The third component is the closing mark: a set of terminals. This structure is called synchro-triple and for each non-terminal A ∈V in the grammar its triple has the following value: First A( ),A,Follow A( )( ). The idea The method is based on a panic mode approach where symbols are discarded rationally using the grammar itself. The key strategy is discarding symbols that are not needed for the current top-symbol in stack. In this case input symbols will be discarded until a symbol can be used for recovery basing on the symbol at the top of the stack. The algorithm The algorithm proceed as follows: 1. An error is detected when the parser is in a certain configuration. The top-symbol X ∈V ∪T in the stack is considered. 2. The synchro-triple for the top-symbol is considered and the closing mark C = Follow A( ) as well. One by one, characters in the input are discarded until the look-ahead character is one inside the stack’s top-symbol closing mark. When found, symbol X is popped from the stack. This part guarantees the possibility to discard the part of the input which is to be discarded as it was connected to the symbol at the top of the stack. After this the input is cleaned from wrong characters that were connected to symbol X for which an error occurred. After the input is discarded, the stack is also cleaned from the symbol which generated the error. At the end of this the parser is supposed to be in a configuration from which it can continue parsing. Advanced version Point 2 of the algorithm can be further improved, the algorithm becomes: 1. Same as point 1 of before. 2. The same procedure as point 2 is followed but with a little modification. While discarding input symbols looking for terminals in the stack’s top-symbol’s closing mark, input terminals (look-ahead tokens) inside the stack’s top-symbol’s opening mark might be encountered. In that case, the parser keeps symbol X without popping it from the stack and tries to recorver parsing from there. By doing so, it is possible to recover the parsing without discarding all the input connected to the stack’s top-symbol. This means that the parser does not give up on the language construct that generated the error, hoping to reconver that construct instead of discarding it completely. Drawbacks However there is a little drawback: only errors on non-terminals can be achieved. In

pag 124 Andrea Tino - 2013 fact the algorithm works when an error is detected upon a non-terminal at the top of the stack. If a terminal is at the top of the stack and the look-ahead symbol is not the same, the parser can choose to discard the symbol and keep going or to stop and report the error. Mapping the algorithm onto the parsing table To develop this algorithm, actually not much effort is required as everything can be mapped onto the parsing table. This comes from the fact that this algorithm can handle only errors for non-terminals. So if an error is encountered it is because the couple stack’s top-symbol and look-ahead terminal points to an empty entry in the table. In that case the algorithm places a call to the error management routine. Handling errors in bottom-up parsers When handling LR parsers we can consider 3 approaches. LR panic mode method This technique is very similar to the first one we saw for LL parsers. The algorithm is the same in many aspects. The idea An error in a LR parser can occur for one reason only: an attempt to access an emtpy table entry is being performed. However is it really true? The point is that the table is built according to the CFSM, so if a problem occurs it is in the goto-table. In this case the approach is the same, the stack is popped until a state is found for which the goto- table returns a valid entry. If the stack is traversed until its very end, the input character (look- ahead token) is discarded and a new token is considered recovering the stack to its original state. The algorithm The algorithm can be summarized as follows: 1. An error is detected and the parser is in a certain configuration. The stack is duplicated and stored in a temporary variable. 2. One by one each state/symbol couple in the stack is popped. For each discarded couple the stack shrinks. The parser tries to continue parsing with this new configuration. If it fails (thus, the goto-table is empty for that state and loo-ahead), again the top couple is popped and this point is repeated until the parser succeeds or the stack becomes empty (the last pop will involve the statrt state only). Here the look-ahead symbol remains the same. 3. If the stack becomes empty, the saved stack is recovered and the current look-ahead symbol is discarded. The next token is fetched and considered as the new look-ahead symbol. If another

pag 125 Andrea Tino - 2013 error occurs, the algorithm starts again from point 1. In pseudo-code we can write the algorithm as: procedure err_lrpanic(Stack,Table,Input) set StackCopy = Stack.clone(); do do if (not Table[Stack.top().state(),Input.current] = null) return; Stack.pop(); while not Stack.empty(); Stack = StackCopy; Input.next(); while not Input.empty(); report_error(“Could not recover!“); /* giving up :( */ end The algorithm is very similar to the one described before for LL parsers. LR phrase level method We introduced the phrase level approach during the introduction to this chapter. The methodology consists in modifying the input token which generated the error in order to make an attempt for a correct parsing. The idea When an error occurs the parser makes some attempts on the current look-ahead terminal in order to recover the parsing. The parser chooses some possible replacements to the current look-ahead. The basic idea is choosing replacements basing on the look-ahead which caused the error, since the error was due to the empty entry in the table for that specific terminal, replacements should make the parser point to a valid table entry (considering the current state at the top of the stack). 1. An error is detected because the table entry is null for the current state and the current look- ahead, the stack is saved into a temporary variable. 2. The parser considers the current look-ahead (which generated the error) and evaluates a set of terminals R = r1…rn{ }⊆ T that can possibly fix the input and that make the parser point to a valid entry in the goto-table given the current state. 3. Each replacement ri ∈R is used as new look-ahead (thus, discarding the original one). The parser tries to continue, if another error occurs, the replacement is discarded, the stack is recovered and the next replacement ri+1 ∈R is used until no more errors occur or the set of replacements becomes empty. In pseudo-code we can write the algorithm as: procedure err_phrasel(Stack,Table,Input)

pag 126 Andrea Tino - 2013 set StackCopy = Stack.clone(); for each r in get_replacements(Input.current()) do if (parse(Stack.top(),r)) return; Stack = StackCopy; end report_error(“Could not recover!“); /* giving up :( */ end The parser can choose different recovering policies when all replacements fail. The look-ahead can be simply discarded for example. Mapping the algorithm onto the parsing table Again, the algorithm can be mapped onto the parsing table. All empty entries in the goto-table can be filled with error-recovery procedures like the one seen before. Error productions method This method is one among the most common and known. The idea Basically the grammar for a certain language is augmented with wrong rules. They are rules that are not meant to generate words in the grammar, but words that are not part of the grammar. Many times programmars make very common mistakes. For example in clang they can miss a round bracket into an if statement. The designer dicides to collect all common errors and put them into the grammar into a special set of rules. When this rules are encountered, the parser knows exactly what type of error happened and can also automatically fix the problem. However, today’s approach is using this technique just to report the exact error to the user; no automatic fixing is performed, the user will fix the problem and will try to parse again. The algorithm The grammar is augmented with new rules mapping errors. New (error) production rules are inserted and new (error) symbols are inserted. The parser will have states with items inside it where these error symbols will appear. Error rules appear in the form: A ⇒αEβ where E is an error symbol (it can be seen as a non-terminal). So the parser’s CFSM will surely have states where items involving error symbols are present: A ⇒α i Eβ . In this case, all items causing the current state to go to a new state because of an error symbol, will be such to lead the parser to an error state. 1. After the grammar has been augmented with error rules and symbols the CFSM is created. 2. When creating the CFSM, each state containing error items in the form A ⇒α i Eβ will generate error states to which the compiler is lead upon error symbol E . Make these states error (final) states. 3. In the CFSM, for each transition si → ej to an error state caused by error symbol E , insert in table entry si ,E( ) an error-management routine.

pag 127 Andrea Tino - 2013 The approach requires the implementation of some error-management routines. Drawbacks The only one drawback of this approach is the fact for which the grammar needs to be edited.

Andrea Tino - 2013 Semantic analysis and code translation Personally, I look forward to better tools for analyzing C++ source code. “ “ Bjarne Striustrup http://www.brainyquote.com/quotes/quotes/b/ bjarnestro269155.html

pag 129 Andrea Tino - 2013 Overview A language has three important aspects: 1. Lexicon: The collection of all lexems of the language, thus the most basic constituents through which the language can be decomposed into individual units. 2. Syntax: All rules used to create all valid phrases in the language. If the language is enumerable the syntax can be a simple list of words. However if the language is not limited, grammars are needed to describe the syntax. 3. Semantics: More values added to each lexem of the language allows the possibility to give a new value to the whole phrase. More values for more phrases allow the possibility to provide language constructs with values. Values are additional information carried by each element of the language. Here we are going to describe semantics for a language. As anticipated, semantics is all about giving more values to all elements in the language. Since elements in the language can be combined, semantic values for those entities can be derived by semantic values associated to each single component. Types of semantic rules Because of this, semantics proceeds assigning values to language elements. But how is this procedure carried out? Semantics as well uses rules like syntax. We can talk about semantic rules. Among the different types of semantics, we are going to deal with attribute semantics: today’s most common approach to handle semantics for a language. Approaching semantic analysis Semantics can be handled using several approaches. As we said before, we have semantic rules. It happens that, in order to define semantics for a language, we need rewriting rules more powerful than type-2 rules. Type-0 rules The ideal case would be unrestricted rules (type-0) but they are so generic that it is practically impossible to build a compiler by using them. The point is that free-syntaxes (type-2 rules) are so good in describing language constructs, that we prefer using them instead of taking advantage of more generic rules (which would lead to more expensive solutions). A better approach Today, we prefer using type-2 rules to define language semantics. Together with this, we also take advantage of type-2 grammars. Basing on that grammar we can build a semantic layer. This approach makes semantics interact quite tightly with syntax; from one point of view this is not very good as modularization and decoupling concepts do not apply, however the approach remains powerful and easier to implement.

pag 130 Andrea Tino - 2013 Semantic functions The approach is assigning elements of the grammar a value. Since grammar rules create relations between sequences of symbols and a single symbol, the approach is calculating the value of the group using values of single elements. Functions can be used: semantic functions. Semantic values When handling semantics, values cannot be always strings. We need more types to act on. That’s why semantic values are not just strings. They can be integral values, they can be composite values. Syntax driven semantic analysis The type of semantic analysis we are going to describe is called syntax driven semantics. We studied in previous chapters how syntax analysis performs its tasks and what objectives are to be targeted. The AST for the input must be built. After building the AST out of the input language, it is possible to have a clear vision of the structure of all grammar elements. Semantic values will be assigned to leaves of the AST. STarting from these values, all values for generic nodes in the AST will be evaluated using its children’s ones by means of semantic functions. This procedure, as stated before, creates semantic values directly upon the syntax tree! That is the reason why this method is called syntax driven. Attribute grammars The approach strictly relates semantics to syntax. One technique which operates in this way is called attribute grammar. The AST is decorated with attributes which are semantic values. AST leaves have semantic values which originates typically from static associations, generic nodes use semantic functions. Attribute grammars are the most common syntax driven approach known today! However syntax must be design to catch all constructs of the language in a proper way. Types of syntax driven semantics Today’s most common approaches fall into two different categories of semantics: • Operational semantics: Functions are defined to convert subtrees of the AST into subtrees of a different language mapping semantics for the provided input language. • Denotational semantics: The AST is augmented with semantic values. Attribute grammars is a type of denotational semantics. Types of code translation In this chapter we are going to cover not only semantics. Considering it is the last step before code translation, we will discover that the semantic level can sometimes be bypassed to go straight to code generation. So we are going to introduce some code generation techniques as well. So, talking about this, we typically have a very common pattern when handling code output: syntax- directed translation.

pag 131 Andrea Tino - 2013 As well as semantics can be driven by syntax, today’s most modern approaches to code translation rely on syntax too. This means that the translation process is deeply concerned about the syntax of the language. However not all syntaxes can support such an approach. In order to map language constructs into output fragments, the syntax must be such to have subtrees of the AST deal with specific parts of the output language. Almost all compilers today have a translation approch oriented to syntax. Pure syntax-directed translations We are going to have a look to a common method regarding operational semantics. This particular technique is called pure syntax-directed translation. It is a way to perform code translation directly without using an evident semantic approach. How it works The approach can work on simple grammars, thus they are not used in real case scenarios. However for small languages or specific cases, it is a valid methodology. The basic idea is providing a formal system to couple two context-free syntaxes: a source syntax and a sink syntax. Functions are used here to have one subtree in the source syntax correspond to another subtree in the sink syntax. The automatic consequence is that a phrase in the source language will correspond to another one in the sink language. Formal definition Let us consider two alphabets: a source alphabet Σ and a sink alphabet Δ . Let us consider the source language L ⊆ Σ∗ as the language on the source alphabet, and the sink language ′L ⊆ Δ∗ as the language on the sink alphabet. Introducing elements Let us consider a word in the source language w ∈L . We can consider a word in the sink language ′w ∈ ′L and say that ′w is the translation image of w when, given a translation application τ : L  ′L , when the couple of both words is part of that application: w, ′w( )∈τ . Also, if the translation application makes one word from the source language correspond to one word only in the sink language, the application can be considered a function and we can write: ′w = τ w( ). Relating grammars A syntax translation scheme consists of two grammars. A source grammar G = V,Σ,P,S( ) and a sink grammar ′G = V,Δ, ′P ,S( ) having the following characteristics: • The source grammar acts on the source alphabet defining all words in the source language.

pag 132 Andrea Tino - 2013 • The sink grammar acts on the sink alphabet defining all words in the sink language. • Both grammars use the same set of non-terminal symbols. • They also share the same start rule. • They have different production rules. Relating rules Since both languages are generated by their respective grammars, the translation function, mapping words of both languages, also acts on grammars as well. It is not possible to describe the translation function by enumerating all associations between words in both languages (since languages are likely to be non-finite). Thus we need a way to define the translation function’s behavior in a concise way (like grammars do for languages). For this reason we have that the set of production rules of both grammars are related together through a bijective association based on translation function τ . Given production rules A ⇒α( )∈P and A ⇒ β( )∈ ′P , all non-terminals in α and β must have the following properties: • All non-terminals in both expressions must be the same. • All non-terminals in both expressions must appear the same number of times. • All non-terminals in both expressions must appear in the same order. Please note how this applies to non-terminals only, for each non-terminal appearing as LHS of rules in both grammars. A way to generate the output code directly The properties introduced before make clear how syntax translation works: only terminals can be changed in position or deleted or replaced. With this technique it is possible to generate the output language from an input string by simply applying rewriting rules defined by the sink grammar: 1. The compiler must put in a buffer the rules he uses to perform derivations. In LL parsers, for example, the output buffer is such a structure. 2. After ordinary parsing, the buffer is browsed and for every rule, the parser will execute the corresponding rule in the sink grammar. 3. For every rule in the sink grammar, it is necessary to define a function to execute when that rule is evaluated. These routines are responsible for generating the output code. The method is said to be pure because no intermediate step is considered from the syntax analyzer to the code generator. A well-known example: reverse polish notation We want to build a compiler to transform a mathematical input expression into its equivalent polish

pag 133 Andrea Tino - 2013 reversed form. Σ = {x,(,),+,*}; Δ = {x,add,mult}; V = {E,T,F} E -> E + T; E -> E T add; E -> T; E -> T; T -> T * F; T -> T F mult; T -> F; T -> F; F -> ( E ); F -> E; F -> x; F -> x; Thanks to the source grammar (on the left) and the sink grammar (on the right), a string like x*x*(x+x) is converted into x x mult x x add mult. In this example, functions associated to rules in the sink grammar simply print terminals. Advantages and drawbacks This technique is very powerful but, as it is probably evident, can be applied only to very simple scenarios. Only terminal shifting is allowed together with terminal replacement or deletion. This makes the approach quite limited. A very basic example is number conversion: with this technique, returning the decimal representation of a binary number is not possible. Ordinary syntax-directed translation Instead of pure syntax-dircted translation, today’s most common techniques relies on a well defined semantic layer. However, the semantic layer is deeply connected to the syntax layer. Here we find two options for compilers today: • Classic translation (two steps): The AST is generated first, then it is traversed as the output code is being generated. This approach is the most common. • Direct translation (one step): No AST is generated, the input code is translated directly. This approach is not very common as it works only for very simple languages (maybe a little more complex than those ones that can be translated using pure syntax-directed techniques). Please note that when translating the input code without building the AST, the compiler must generate the output code during syntax analysis. Also note that so far no examples have been considered about AST generation. Introducing attribute grammars In syntax-directed translation, attribute grammars play a key role in helping defining semantics

pag 134 Andrea Tino - 2013 for a language. Please note how we are talking about an extension of grammars which are entities related to the syntax level. It happens because we are talking about syntax-directed approaches: semantics and syntax can mix together sometimes in such a context, so keep your focus during this section. Rough definition Attribute grammars are a formal extension of classic generative context-free grammars we have seen so far. They introduce the following concepts: • Attributes: They are values associated to symbols (terminals and non-terminals) in the grammar. One symbol can be associated with more than one attribute. • Semantic actions and semantic rules: They are functions and procedural entities associated to production rules of the grammar. Semantic actions, in particular, play a very centric role in attribute grammars. They are executed everytime a production rule is considered to process the input. An action can have access to all attributes of symbols defined in the production rule it is associated to during parsing. An action can also assign attributes to symbols. Attributes represent the most basic semantic layer because they are the meaning of symbols they are associated to. A very simple example is arithmetic expressions. When they are parsed, to each expression is assigned number. Formal definition An attribute grammar is a perfect extension of a context-free grammar. We refer to them as usual: G = V,Σ,P,S( ), providing a set of terminals, non-terminals and production rules. The start symbol sometimes is such that does never appear as part of the RHS of any rule. This grammar is augmented with a set of attributes Ω . This set has no formal definition as its elements can be of any possible type; we will use symbol Ω as a mere formalism here. Attributes have the following characteristics: • Attributes are associated to symbols in the grammar. Given a symbol X ∈Σ ∪V , and attribute ω ∈Ω , the association can be written as: couple ω, X( ). • One symbol can be associated with more than one attribute. One attribute is associated to one symbol only. • Attributes can be of any possible type (string, integral numbers, composite entities, etc.). • Every attribute ω ∈Ω has a domain Dom ω( ) which is the set of all possible values that attribute can have. • Attributes associated to non-terminals are (set) partitioned into two different groups: synthesized attributes and inherited attributes. Attributes associated to terminals are called lexical attributes. • All attributes ωi ∈Ω associated to a symbol X ∈Σ ∪V are included in set Attr X( )⊆ Ω .

pag 135 Andrea Tino - 2013 About semantic actions we have the following: • One semantic action is associated to one production rule only. • One production rule can be associated with more than one semantic action. • Given a production rule p ∈P , the set of all semantic actions associated to it is Rules p( ). Semantic values are set to non-terminals by semantic actions. Lexical attributes are not computed at syntax/semantic level, they are evaluated at lexical level (where a little semantics is handled). The generated AST can show attributes, in that case the tree becomes an annotated parse tree (AAST). Taxonomy of attributes How are attributes related to each other in the AAST? Attributes associated to non-terminals are divided into two disjoint sets: • Synthesized attributes: Their values depend only on attributes in the subtree of the node they are associated to. They generate an ascending information flow starting from the leaves until the root of the AAST. Leaves cannot have synthesized attributes, thus their semantic values are provided by the lexer. This is the most common approach used today. • Inherited attributes: Their values depend only on attributes associated to parent or sibling nodes of the node they are associated to. They generate a descending side (from left to right and vice-versa) information flow starting from the root until the leaves of the AAST. In this case, the initial symbol has no inherited attributes, a typical approach is assigning it a static value before the parsing starts. Attributes in an attribute grammar are often of the same type: inherited or synthesized, however mixing is allowed, but needs to be carried out with extreme care. Attribute grammars and syntax-directed translation Production rules in an attribute grammar are associated with semantic actions. So considering rule A ⇒α , there will be semantic actions fA⇒α,i :Ω ×× Ω  Ω responsible for calculating its attributes ωi ∈Ω as ωi = fA⇒α,i µ1,µ2 …µm( ); having ∀i ∈…n . We can have two possibilities: • We can have ωi ∈Ω as a synthesized attribute of LHS symbol A , thus ωi ,A( ) and ωi ∈Attr A( ), then attributes µ1,µ2 …µm ∈Ω are attributes associated to RHS: µ1,µ2 …µm ∈Attr α( ) and m = Attr α( ) . • We can have ωi ∈Ω as an inherited/synthesized attribute of RHS symbol α , thus ωi ,α( ) and ωi ∈Attr α( ), then attributes µ1,µ2 …µm ∈Ω are attributes associated to LHS: µ1,µ2 …µm ∈Attr A( ) and m = Attr A( ) . It is evident how semantic functions are evaluated in parallel to syntax rules. When a syntax rule is evaluated, the corresponding actions are evaluated too.

pag 136 Andrea Tino - 2013 The problem of assigning attributes This is where a problem occur: semantic is driven by syntax of course, so we can have problems assigning attributes if a semantic action is expecting certain values that have not been evaluated so far becuase production rules followed a different flow. This is the main drawback of approaching a syntax-driven semantics: semantics depends on syntax, and depends on it really badly! This problem is a well known problem in attribute grammars and is referred to as attributes evaluation problem. The problem of side-effects Side effects can occur as well. Semantic functions can act on global values throughout the parser. And maybe those values are shared among different semantic routines. If things are not handled carefully, very bad conditions might occur. An example of inherited attributes: decimal value of a binary rational number (D. Knuth 1968) Todo Evaluating attributes An AAST is a normal AST augmented with attributes for each node (corresponding to a symbol in the grammar typically). The process of defining attributes for each node is the attribute evaluation process. The order attributes are evaluated depends on the dependency-graph between attributes which is generated basing on semantic rules. Functional dependencies Every attribute is evaluated basing on values of other attributes. Who defines these dependencies? Semantic functions! Consider a generic production rule p ∈P : X0 ⇒ X1X2 Xr (having r ≥ 0 ) and consider a generic attribute associated to a symbol ωi, j , Xi( ) (where i = 1…r and remember that one symbol can be associated with more attributes). Consider a generic semantic function fp,ω ω1…ωm{ }⊆ ωs :ωs ≠ ω,∀ωs ∈Attr Xi( ), Xi = X0 Xr ∈p{ }⎡⎣ ⎤⎦ , having m ≤ r , associated to the production rule (we made explicit the association between the function, the production rule and the attribute of a symbol in the same production rule). We can formalize the dependencies of a symbol like set Dep p,ω( )= ωs, Xi( ):ωs ∈Args fp,ω( ),∀Xi ∈p{ }: thus the set of all attributes appearing as argument of the corresponding semantic function for that attribute in the context of the specified production rule. Important An attribute can be evaluated only by one semantic function. If more semantic functions evaluated the same attribute, how to choose the right one? If more policies are needed, they can be put into the same function and more cases can be added inside that function. Dependencies for a rule It is clear how production rules can have more functional dependencis for all attributes associated to all symbols appearing in the rule itself. Considering that a functional relation depends on a production rule and on a symbol inside it, we can collect all functional dependencies inside a rule and have the production functional dependencies set defined as:

pag 137 Andrea Tino - 2013 Dep p( )= Dep p,ω( ) ∀ω∈Attr X( ) ∀X∈p  Graph of dependencies When an attribute is in the dependency-set of another attribute, a binary directed relation can be conidered between them. A graph can be built for dependencies in each production rule. A global graph for the grammar can be built when considering graphs generated for all production rules. Nodes in the graph are attributes. An arc connects two attributes in the graph ω2 → ω1 when there exists a semantic function calculating ω1 having ω2 as one of its arguments in the context of a production rule p ∈P , where symbols X1 ∈Σ ∪V and X2 ∈Σ ∪V , having ω1, X1( ) and ω2, X2( ) , are part of that production rule: X1, X2 ∈p. About the dependency graph The dependency graph collects all dependencies among all attributes for symbols in all production rules of the grammar. The graph is often represented on top of the AST; this approach is useful only when having an input string. However the dependencies can be visualized independently from derivations. To build the tree, the procedure is very simple: 1. List all semantic rules in the grammar for all production rules. 2. For each function ωi = fk µ1,µ2 …µi …µm( ), create a node for each attribute. Also create a arc ωi ← µj for all attributes as arguments of the function: ∀j = 1…m . Mark the rule. 3. Repeat point 2 until all rules are marked. Example of a dependency graph Todo Checking the presence of cycle-dependencies The dependency graph must be an acyclic graph for obvious reasons. If cycles were, then it would be impossible to calculate attributes part of that relationship. [Def] Acyclic attribute grammars: An attribute grammar is acyclic when its dependency graph is acyclic. The following result is obvious: [Lem] Parsing attribute grammars: An cyclic attribute grammar cannot be correctly parsed. Looking for cycles When investigating the dependency graph for a production rule it is easy to

pag 138 Andrea Tino - 2013 locate cycles, if any. However when merging all graphs for all rules, the final graph is much bigger and much more complex. The problem of cycles must be handled at the global level! It is not possible to check for cycles by investigating single rule’s graphs. [Theo] Acyclic rule graphs: If the dependency graph of a rule in an attribute grammar is acyclic, then the grammar is not said to be acyclic too. Please also consider the following results: [Theo] Acyclic graphs for more rules: Given a subset of rules in an attribute grammar, if their dependency graph (built as union of each rule’s graph) is acyclic, then the grammar is not said to be acyclic too. The whole graph must be considered to check the absence of cycles: [Theo] Acyclic grammars: Given the set of all rules in an attribute grammar, if their dependency graph (built as union of each rule’s graph) is acyclic, then the grammar is acyclic too. The vice-versa can be written as: [Theo] Acyclic subgraphs: If the dependecy graph of an attribute grammar is acyclic, then all dependency graphs of single rules in the grammar or dependency graphs of combinations of rules in the grammar are acyclic too. Equations We now know that the biggest graph must be inspected! Again we ask ourselves why an acyclic graph is needed, the answer was provided before: to be able to calculate all attributes. So, let us conside a generic dependency graph (it does not need to be the graph of the whole grammar, a single rule can do as well), and let us try to evaluate the values of all attributes. Let us introduce first the concept of semantic equation. [Def] Semantic equation: A semantic equation is a semantic function ωi = fk µ1…µm( ). The term equation is used to underline the fact for which calculations are needed to compute the value of attributes. [Theo] Existance of a solution for a system of semantic equations: If the dependency graph for a rule or a set of rules is acyclic, then the system of semantic equations derived from the graph has one solution.

pag 139 Andrea Tino - 2013 Proof: If the dependency graph is acyclic, then it is possible to sort (topology) all attributes in the LHS of equations by following the order defined by the graph itself. So let ωi → ω j be an arc in the graph between two attributes appearing as LHS of two different equations. Then attribute ωi is to be listed before attribute ω j . After building the ordered list, equations are ordered basing on the same order as corresponding LHS attributes appear. Starting from the first element of the list, every attribute will receive a value by executing the corresponding function. The proof provides the algorithm to calculate all attributes in a graph following dependencies. We must understand the parser’s point of view which handles an AAST and needs to evaluate its attributes. This algorithm is not very efficient because it requires a lot of work from the parser: • The algorithm to generate the ordered list has complexity O n( ) on the number of nodes/ attributes. • Every node in the AAST must be traversed as many times as the number of its attributes. After building the list, when each function is evaluated values of argument attributes are needed, this means that the parser needs to jump to non-contiguos locations in the tree. This continuos hopping among nodes in the tree is the main cause performance decay. Improving performances Considering the both operations are performed sequentially, we have that attribute evaluation can be a very intensive operation. A possible way to improve performances is acting on special subsets of grammars which enable the parser complete the tree decoration using fewer traversals of syntax nodes. Methods to evaluate attributes The problem is always the same: evaluating all attributes in the AAST. One approach is the one we introduced before: topological sorting. We also described some computational problems this approach is affected with. In this section we are going to describe a way to improve this methodology. The key concept is the fact that, given the dependency graph, more topological sortings are valid to accomplish the attribute evaluation process. Consider the dependency graph shown in figure. Functional relations are shown for each attribute. For simplicity consider that the AST has the same structure, thus the diagram is the AAST where no syntax symbols are shown. we can have these valid topological sortings: c ed b a

pag 140 Andrea Tino - 2013 S1 = {d,e,c,b,a}; S2 = {e,d,c,b,a}; S3 = {c,e,d,b,a}; S4 = {e,c,d,b,a}; S5 = {d,e,b,c,a}; They are all valid because all functional dependencies are observed! There are more than one valid sorting because attributes are not completely ordered, thus there exist couples of attributes for which no particular order is defined. So which one to choose? Keeping in mind what we said before, sorting S5 is the best among listed! In fact when evaluating semantic actions following S5 sorting, the evaluator makes little traversals of the tree. The one described here is a possible improvement: its objective is minimizing the number of node traversals. Scheduling The evaluator is the component of a syntax-directed parser responsible for running the attribute evaluation process. We can have two different scheduling approaches: • Static scheduling: Scheduling is performed when the evaluator is built. Thus the scheduling depends on the grammar but not on the AST. A particular case is fixed scheduling: the sorting of attributes is the same for each production rule. • Dynamic scheduling: Scheduling is performed after the AST is built. Sorting of attributes is performed for every subtree generated by a production rule. Sweeps number Parsers can work using different approaches. Direct parsers build no AST and perform translation immediately. 2-step or classic parsers need to build the AST first and then traverse it. how many times is the tree to be traversed? 1-sweep parsers can traverse the tree one time only and translate the input code. However more complex grammars need more sweeps. The problems concenring sweeps number is purely related to semantics. The question is: “How many times does the semantic analyzer need to evaluate all attributes?“. In some cases more sweeps are needed to evaluate the value of all attributes in the tree. In syntax-directed semantics, two approaches are very common today: • S-attributed grammars: Only synthesized attributes are allowed. • L-attributed grammars: Synthesized and inherited attributes allowed, restriction on dependencies are considered. L-attributed grammars In order to make topological sorting more efficient, we can act on the grammar in order to be able to choose the best sorting. L-attributed grammars are a particular subset of attribute grammars with some properties. [Def] L-attributed grammar: An attribute grammar is L-attributed if all its production rules are L-attributed. [Def] L-attributed rule: A production rule A ⇒ X1X2 Xn of an attribute grammar is

pag 141 Andrea Tino - 2013 L-attributed if each attribute ωi, j of each RHS symbol Xi ∈Σ ∪V , having i = 1…n , only depends on attributes ωk,s ∈Ω of symbols Xk ∈Σ ∪V having k = 1… j −1 and from A ’s inherited attributes. Thus a L-attributed grammar is such that attributes for symbols depend on attributes of symbols on the left of each production rule’s RHS. This is a very strong restriction. Acyclic grammars Because of the structure of these grammars, we have a very important result here: [Lem] L-attributed acyclic grammars: L-attributed grammars are acyclic. Evaluating attributes How to calculate attributes in this grammar in a better way than topological sorting? A recursive descent left-to-right routine can work, actually this type of grammars can be correctly parsed using one sweep! This is the procedure: 1. Write a function for each production rule (whose LHS is a symbol in a node of the AAST). 2. Each routine takes two arguments: the subtree having as root the rule’s LHS symbol and all its inherited attributes. 3. When the value of an attribute is needed, a function calls that node’s function. The grammar is acyclic, thus recursive functions will never go loop. Using pseudo-code a possible implementation can be: Todo S-attributed top-down grammars Let us examine another class of grammars: top-down grammars where all attributes are synthesized attributes. How to handle attribute evaluation? The problem is that top-down parsing is such to create the AST from the root to leaves. Synthesized attribute describe an ascending flow. Directions of syntax and semantics are in contrast! We cannot parse the grammar without sweeping the tree at least one time! So this approach is not direct! Evaluating attributes using recursive-descent parsers Although conditions are such to make top-down and S-attributed grammars look incompatible, attributes can be evaluated without so many problems. Considering we want to treat generic LL(1) grammars, we are going to analyze how to handle recursive descent parsing: 1. For each function associated to a production rule (LHS), semantic functions for attributes of symbols in RHS are to be called after calling parsing functions for those symbols (non- terminals). 2. After gathering RHS attributes, attributes for LHS symbol can be evaluated.

pag 142 Andrea Tino - 2013 The approach will generate two distinct recursive tree traversal from two different classes of functions: the parsing functions (for each non-terminal) and semantic function. S-attributed bottom-up grammars In this case directions of syntax and semantics are the same. A bottom-up parser build the AST starting from its leaves while a synthesized attributes raise an ascending information flow. Thanks to these optimal conditions, the parsing is very efficient and requires little modifications in LR(1) algorithms. Evaluating attributes using LR(1) canonic parsers Attributes are synthesized, it means that when the parser is building the tree, every new node’s attributes can be evaluated at the moment of creation as dependent attributes have already been evaluated! But a new node is created when a reduction is performed! Thus attribute evaluation for a non-terminal is performed on reductions! LR(1) parsers take advantage of stacks. Actually the algorithm requires one more stack to handle semantic values. As symbols are pushed into the stack, their attributes are pushed into the semantic stack. Upon reductions, a non-terminal is placed on top of the stack after popping symbols from it. The same happens for the semantic stack where all attributes for popped symbols are considered to evaluate the pushed non-terminal attributes (pushed on top of the semantic stack as a group). X_{i} X_{i+1} X_{i+2} ... X_{i+r} ... X_{n} ω_{i} ω_{i+1} ω_{i+2} ... ω_{i+r} ... ω_{n} Y X_{i+r+1} X_{i+r+2} ... X_{n} μ ω_{i+r+1} ω_{i+r+2} ... ω_{n} For shifting operations, remember they are operated on terminals, thus their semantic values have already been provided by the scanner. Handling translation in syntax-directed grammars In this section we have analyzed syntax-directed grammars. In the section before this we had a look at how translation was possible in pure syntax-directed grammars. However we haven’t seen translation in this section so far! How is translation performed? The answer is semantic actions!

pag 143 Andrea Tino - 2013 Semantic actions for translation We can do this using semantic actions. By placing output- write routine calls into semantic functions, we can write the output buffer as the parsing is being performed. Translation schemes Here we examine how to perform code translation. So far we have analyzed how to conduct parsing and semantic analysis together in one solution. Syntax-directed grammars gave us the possibility to make translation, however we had one limitation: we had no possibility to choose when a semantic action had to take place or, in case of more semantic actions for the same rule, in which order those actions had to take place. When needing a more fine-tuned syntax-directed translation, a subset of attribute grammars is considered: they are called translation schemes. [Def] Translation schemes: Translation schemes are attribute grammars where it is possible to explicitely define in which order semantic actions take place for every production rule. Important In the previous section we said that one attribute had to be handled by one semantic function only. It is right! We also defined semantic actions and semantif functions as the same thing. Here, this is valid no more. A semantic action is a routine which does not compute one attribute’s value, although it is associated to a symbol. On the other hand, semantic functions do evaluate one symbol’s attributes’ values. Please consider the following definitions. [Def] Semantic function: A semantic function is a semantic routine used to calculate the value of one attribute of one symbol. One attribute can have only one semantic function. one semantic function computes the value of one attribute only. [Def] Semantic action: A semantic action is a translation routine used to write the output buffer. One semantic action is associated to one symbol. One symbol can be associated with more semantic actions. In a translation scheme, this difference is important as it separates translation from semantics.

pag 144 Andrea Tino - 2013 Translating code Semantic functions and semantic actions are not always treated as two separate concepts. Sometimes it is much simpler inserting the code to evaluate attributes in the body of a semantic action. In this case we have a semantic action (thus a routine writing the output code) acting like a semantic functions as well. How to translate Semantic actions and functions are placed on the right of symbols in production rules. When the AAST is created, semantic values are evaluated and semantic action calls are placed in nodes of those symbols to which they appear on the right. A depth traversal algorithm will be used to handle translation, and will execute actions in the order they appear in the tree. Since semantic functions and actions are associated to symbols, it is not wrong to say that a symbols owns a routine. Handling attributes However here again we must be carefull as symbols can be associated semantic functions and/or actions. For this reason some rules are needed when handling attributes in a translation scheme: • Given a production rule, an inherited attribute of a symbol in the RHS must be evaluated in a semantic function of a symbol preceeding it (on the left). • Given a production rule, a semantic action associated to a symbol must not refer to values of synthesized attributes of that symbol (unless such values are written in the action). • Given a production rule, a synthesized attribute for the LHS non-terminal can be computed only after all dependent attributes have been computed. A good approach is placing the evaluation code for the LHS symbol’s attributes in the last (right-most) symbol of the RHS. Note how not always these rules can be applied. However L-attributed grammars can always be compliant with such rules. Handling left recursion in translation schemes If we want to use a translation scheme with predictive grammars, we need to remove left-recursion. If semantic actions and rules have already been associated to symbols, the left-recursion removal process must keep into considerations all associated routines to keep the scheme. So given a left- recursive rule with synthesized attributes like: A ⇒ A1α ω,A( )= f ω,A1( ), µ1,α( )⎡⎣ ⎤⎦{ } A ⇒ β ω,A( )= g µ2,β( )⎡⎣ ⎤⎦{ } ⎧ ⎨ ⎪ ⎩ ⎪ We can remove left recursion by applying the following rules:

pag 145 Andrea Tino - 2013 1. Apply the usual left-recursion removal process without considering actions and functions. 2. Make the new symbol have two different attributes. 3. Make the recursive symbol’s attribute value propagate through the new symbol. The resulting rules are: A ⇒ β λ1,B( )= g µ2,β( )⎡⎣ ⎤⎦{ }B ω,A( )= λ2,B( ){ } B ⇒α B1,λ1( )= f λ1,B( ), µ1,α( )⎡⎣ ⎤⎦{ }B1 λ2,B( )= λ2,B1( ){ } B ⇒  λ2,B( )= λ1,B( ){ } ⎧ ⎨ ⎪ ⎪ ⎩ ⎪ ⎪ The new symbol will be assigned two different attributes: an input attribute λ1 and a synchronization attribute λ2 . Static vs. dynamic semantics Semantics is not about semantic attributes only. Semantics covers other aspects of the code like type checking, scope control, visibility rules, element access and so on. A good definition would be that semantics handles control and validations stuctures. A list of all aspect covered by semantics can be found below: • Type checking: Identifiers are assigned a value and a type. Is the type of a variable valid for the operations being performed? • Scope management: Is this identifier accessible here? Handling member shadowing and overloading for OOP languages. • Accessibility: Can the value of this identifier be evaluated here? • Importing: Is the imported code valid? Where to find imported code? Such aspects are an important part of the compiler and must be handled at semantic level. Types of semantics So, considering that semantics is also a matter of checking, when is this checking performed? • Static semantics: Checking is perfomed at compile time. The parser runs checking routines while building the AST, if something goes wrong, errors prevents the AST from being completed.

pag 146 Andrea Tino - 2013 • Dynamic semantics: Checking is performed at run-time. The compiler, while translating the input code to output, inserts calls to special routines meant to check parts of the code during execution. World is not black or white! Hybrid solutions exists as well, actually they represent the most common approach today. Type checking A language is said to be a typed language when every identifier is assigned an attribute called type. A type-value is just one possible value of the type attribute which can be associated to identifiers. [Def] Type system: In a compiler for a typed language, the type system is a collection of rules to assign a type to an expression involving identifiers and other entities to which a type is assigned. Type error When one rule in the type system is not observed, the compiler at compile time (static semantics) or the application at run time (dynamic semantics) can raise an exception. A type error is a condition for which one or more rules in the type system are not being observed. Type checker An important component of a type system is the type checker: a component whose objective is checking for type errors. Strongly typed languages A language is said to be strongly typed, when it gets compiled only if no type errors are found in it. Typed expressions The type system is used especially in those cases where an expression is to be assigned with a type. The expression is made of many identifiers and they all have a type. Which is the type of the expression? The type system can answer this question if no errors are found. Type expressions When assigning a type to an identifier or an expression, we do not always assign a simple type value, possible values for types can involve complex structures. Basic types for variables are simple types, for example int, float. A little bit more complex type expressions can be those involving new type, for example in C we can have enum myenum or struct mystruct. However in modern languages like C++ we can take advantage of templates, in that case we can refer to types by creating complex expressions like typename T::U. These are all type expressions. A possible list of type expressions can be: • Arrays, sequences or matrices. • Containers, composite data structures. • Pointers to other types.

pag 147 Andrea Tino - 2013 • Function pointers. • Generic programming, templates. The “type” attribute As we stated, in a typed language all symbols are assigned with a special attribute called type. This does not mean that all elements in the language have a type. Those symbols for which a type is not needed have their type attribute set to null. [Def] Type attribute: In a type system, every symbol is assigned with an attribute called “type“. The type attribute is a synthesized attribute and can own semantic actions and/or functions. Actions assigned to type attributes of symbols are typically used to check type attributes on which the original attribute relies on. So the type-checker can be implemented in tems of semantic actions. Checking type expressions Type checking is performed by analyzing type of more expressions and telling whether they are the same or not. Actually this is not the case for all possible conditions, but sure is a common case. So the question is: how can we recognize whether two types are the same? • Simple types: If the type values are simple types, the equivalence is immediate. For example: int = int, int != float, class A != class B, struct M = struct M. This happens because simple types are saved as basic types in the type system. • Type expressions: Type expressions are more complicated to handle. Actually the equivalence depends on the type system and its rules. More formally, the type system must include rules to check equivalence between type expressions. Type expressions equivalence We said that the type system must handle each case one by one according to its rules. Consider for example this fragment of C++ code: template <typename T> class TypeA { public: typedef typename T::U TypeAU; TypeAU member1; }; class TypeB { public: typedef unsigned int U; };

pag 148 Andrea Tino - 2013 Also consider this fragment: TypeB::U val = 10; /* Creating a value */ TypeA<TypeB> myA = new TypeA<TypeB>(); /* Creating an instance */ myA->member1 = val; /* !!Checking here!! */ What happens when the last line is evaluated? The compiler must go see all definitions and verify that the final type for member1 is the same of val. This leads to the concept of structural equivalence. Structural equivalence This is an approach to handle type expression equivalence: 1. Consider both expressions and break all of them into their simple constituents, thus simple types they are made of. 2. Create two ordered lists of basic types and fill them with values returned from point 1 (in order, one by one). 3. For each element in one list, check that the corresponding element in the other list is the same basic type. If not, the equivalence is not successful, there is a type error. This approach, however cannot be applied every time. Consider the following fragment of C code: struct S; typedef struct S { int a; struct S* next; /* cycle definition */ } S_t; The structural equivalence cannot be used on recursive definitions! Another approach is required to handle type expression equivalence. Type conversion The type system must provide type conversion rules as well. When we introduced type expressions, we did not describe how the type of the whole expression had to be evaluated given the types of each single element in it. Type conversion plays a key role in this context. Built in (static) conversion Many languaguages define internal operators and types without letting the user define overloaded versions for those operators (obvious example: C). In this case, when operations are performed on identifiers of different types, the type of the whole expression is calculated according to the rules on the type system. In C we have: int + float = float, int + int = int, float + double = double and again int * double = double. Dynamic conversion Languages like C++ allow the programmar to define how operators should behave when handling types different from built-in ones. When the compiler encounters an

pag 149 Andrea Tino - 2013 expressions where operators, functions and user-defined types are involved, the type system must evaluate all user-defined operators and calculate the final type for the expression. In these cases, type errors usually disguise themselves as different error types. In C++ for example when an operator has not been overloaded for a user-defined type and a variable of that type is being manipulated using that operator, the compiler will not generate a type error, but a not-found function error. The “scope” attribute Another important attribute typically handled by semantics is the scope attribute. In many languages identifiers are assigned value, however the validity of this value is related to the scope of the variable. Understanding the concept of scope From a compiler’s point of view, scope is nothing more than a numerical value; from a language’s point of view, scope is a portion of the code where an identifier can be accessed by other parts of the code in the same scope. [Def] Scope: In a language (thus in a grammar), scope is the set of all couples of symbols having the same value for the “scope“ attribute. Be carefull: one thing is scope and another one is the scope attribute. The former is a concept, the latter is an attribute in a grammar. Scope checking Other than type checking routines, scope checking routines are considered by a compiler to check whether an identifier is being accessed by a partof the language in the same scope. Scope nesting Scopes are not disjoint sets. We can have nested structures. For the sake of scope checking, the scope attribute can be designed to be an array: two elements can be into different scopes, but one element can be in a scope contained by the other one; in that case that member is accessible.

Andrea Tino - 2013 Runtime support and memory management If you can’t make it good, at least make it look good. “ “Bill Gates http://www.brainyquote.com/quotes/quotes/b/ billgates382270.html

pag 151 Andrea Tino - 2013 Overview Modern compilers treat all aspects of programming. When variables are declared and assigned a value, where is that value stored? Handling memory is a serious matter and can impact program execution performances. Memory allocation A compiler cannot decide on its own where to allocate a value. It must interface with the operating system and use system calls for some low-level operations. When a process is started, the operating system will reserve it a memory area where all its resources will be stored; the application code itself is stored into this area. Resource allocation can be performed in two flavors: • Static allocation: It is managed at compile time and depends by the code only. The total amount of memory required must be known at compile time. • Dynamic allocation: It s managed at run time, it depends both from the code and input parameters. It is not possible to evaluate the total amount of memory required by the program at compile time. To understand the difference between dynamic and static allocation we can consider array allocation. In static allocation it is not possible to create an array like this (dummy code): procedure MyProc(size) begin array MyArray[size]; end The size of the array is not known by the compiler until the program is executed and parameters passed by the user to it. The previous code, however, can be treated by a compiler supporting dynamic allocation. Static allocation The first language which introduced this strategy was Fortran77. Static allocation introduces a lot of limitations when programming, for example: • Arrays’ dimensions are to be known at compile time. • No recursive routines. • No composite structures without a fixed size. No expanding/shrinking containers. These languages can be very limitating but offer very high performances in exchange. Because of static allocation, all data is accessed in constant time. Furthermore there are no risks of memory overflow or out-of-memory errors at runtime.

pag 152 Andrea Tino - 2013 Dynamic allocation The first language supporting dynamic allocation was Algol58. Dynamic allocation was the key milestone for the development of modern programming languages. We are going to see these algorithms in a dedicated section here. Dynamic allocation strategies Two important dynamic allocation strategies were introduced: • Stack allocation: First used by the Algol58 programming language. • Heap allocation: First used by the Lisp programming language. We are going to examine both of them. Stack allocation This particular strategy takes advantage of stack to allocate memory. Algol is a programmaing language using programs (the same of functions, routines). Because of this structure, some hypothesis are considered: • Different invocations to the same subprogram generate different memory records, thus every variable inside a subprogram is deleted after the subprogram ends. • Dynamic data structures can be used. • Recursion is supported. • Subprograms’ return values’ size must be known at compile time. Activation record Whenever a new subprogram is called, the compiler evaluates the size of its return value and creates a memory record (called activation record for that subroutine invocation); then the record is pushed into the stack. Local variables In a subprogram, every new variable which is declared and assigned a value is stored as a record in the stack. Return from subprogram When a subprogram returns, the stack is popped until that subprogram’s activation record is reached, the activation record’s value is updated and the caller receives the return value. Activation tree By analyzing the code, it is possible to see all points where subprograms are

pag 153 Andrea Tino - 2013 called, there we have that the current subprogram is temporary left behing and the stack will host a new activation record. When considering a subprogram and all subprograms it can call in its code, we can build a tree having the parent subprogram as root (caller) and called subprograms as leaves (callee). When performing this operation for all subprograms, we can create a call tree, that tree is the activation tree: showing how many activation records will be created throughout the program run. Consider the following example dummy program: subprogram MyProc() begin # subprogram0 var a = MyProc1(2); var b = MyProc2(3); var c = MyProc3(4); print a + b + c; end subprogram MyProc1(val) begin # subprogram1 return val + 10; end subprogram MyProc2(val) begin # subprogram2 return MyProc1(val) + 15; end subprogram MyProc3(val) begin # subprogram3 return MyProc1(val) + MyProc2(val) + 25; end This code generates the activation tree shown here. Please mind that each call is a separate entity even if the same subprogram is considered. Two calls to the same subprograms generate two different activation records. That’s why we have an activation tree rather than an activation graph. Activation trees are traversed from the root to the leaves and from left to right. It means that each node/subprogram in the tree must terminate before its right sibling can start running. The tree will also provide a way to figure out the stack configuration when a subprogram is running. When the node of the current running subprogram has been located, we can tell for sure that activation records in the stack each parent subprogram will be located at the bottom of each child. In the example, if leaf-node 2 is the current active subprogram invocation, then the stack will contain the activation records of subprograms 0, 3 and 2 starting from the bottom to the top of the stack. The figure below shows both the activation tree for the example above, and the stack layout considering that subprogram 2 is being executed.

pag 154 Andrea Tino - 2013 Locals Return addr Parameters Locals Return addr Parameters Stack Pointer (SP) Frame Pointer (FP) Subroutine 2 Subroutine 3 32 0 1 1 1 2 Diag. 1 - Stack layout Diag. 2 - Activation tree The stack structure A subroutine being called causes that stack to host its activation record. Everything concerning the subroutine becomes part of the record. Two pointers are always kept by the system: • The Stack Pointer (SP): The pointer to top of the stack: thus the address of the top element in the stack. • The Frame Pointer (FP): Also called Stack Base Pointer, it is a more convenient pointer to the core information of the current executing subprogram. Upon calls, pointers’ values are re-defined to point to the correct locations. Activation sequence When a subprogram is called, some operations are performed. The sequence of actions is called activation sequence and results in the creation of the activation record for that specific subprogram call. The most important actions are: 1. Parameters data to be passed to callee from caller are placed at the beginning of the activation record. The return value is also initialized and placed here. 2. Data whose size is known is placed in the cerntral part of the activation record. 3. Data whose size is unknown placed at the end of the activation record. During an invocation, after the activation record has been created, the callee performs all operations and, when returning, updates the return value of the activation record. The caller will find the return value in the activation record, at the end of the sequence the activation record will be popped out from the stack together with all data inserted by the callee. To recover execution,

pag 155 Andrea Tino - 2013 the activation record’s Return Address can be used. Stack allocation: advanced concepts Remember that the stack memorizes activation records for currently working subprograms. Because of this, remember that objects’ lifetime is related to subprograms invocation. Lexically nested subprogram and nested scope management It is possible to nest subprogram definitions. This is valid in the special case of Algol, but nesting can happen at every level (for example classes in C++). We will consider subprogram nesting, but the following will be valid for any language construct in the context of a compiler handling memory using stack allocation. Furthermore, this description provides a good overview of how scoping can be managed from a generic point of view. Consider the following code: subprogram MyProc(var) begin # subprogram0 subprogram MyNested1(nes_var) begin # subprogram01 return var * nex_var * 10; end subprogram MyNested2(nes_var) begin # subprogram02 return var + nes_var + 100; end return MyNested(var) + MyNested2(var); # using internal resources end Subprogram 0 has access to two subprograms, namely subprogram 1 and 2, but cannot access their locals because they belong to an inner scope. On the other hand is normal for subprograms 1 and 2 to have access to subprogram 0’s locals: they belong to the same scope after all. How to manage this? When a subprogram is running, in its activation record (in particular, the area where locals are stored) we must add something pointing to locals area of other activation records: the enclosing subprograms! The role of the access link This pointer is called Access Link (AL) and it points to the innermost subprogram’s activation record’s locals area enclosing the current one. The access link is used to point to all local data of the enclosing scope for a subprogram. For example, subprogram 1’s activation record (when created), will have its access link set to subprogram 0’s activation record (locals area). When having many nested subprograms, the innermost subprogram (let np ∈ be its nesting depth) can access all locals of all enclosing scopes! In that case, when it needs a resource from one subprogram (say its depth is nq ∈ , thus we have: nq ≤ np ), its access link alone will not suffice: a chain of calls through np − nq access links will be needed. Handling recursion Recursion can be thought as a particular type of nesting where the enclosing routine is the same. In that case the access link of the nested activation record will point to the

pag 156 Andrea Tino - 2013 same address of the enclosing activation record’s access link. Subprograms as resources In the example we could see how a subprogram had the possibility to call other subprograms; however some subprograms can be defined inside the subprogram. These nested subprograms are sort of special private routines. The address to their code is saved among locals in the enclosing subprogram’s activation record. In the example, subprogram 0’s activation record has subprogram 1 and 2’s code addresses saved inside it. The role of display When the nesting depth gets higher, it takes time for innermost subprograms to access resources of subprograms whose nesting level is low. The complexity for access operations is O n( ) on the number of nesting levels (the nesting depth). To solve this problem, some implementations take advantage of the display: a technique which stores, for one subprogram, the links to all enclosing subprograms’ activation records locals area. Locals Ret. addr Params Access Link Locals Ret. addr Params Access Link Locals Ret. addr Params Access Link Locals Ret. addr Params Display Locals Ret. addr Params Display Locals Ret. addr Params Display Diag. 1 - With display Diag. 2 - Without display This makes resource lookup much faster. Heap allocation This strategy is a lot more permissive than stack allocation:

pag 157 Andrea Tino - 2013 • Objects’ lifetime is not related to routine invocation. • Recursion is supported. • Dynamic data structures can be created. • Routines can return data structures whose size is not known at compile time. With heap allocation, the momery is seen as a contiguos interval of addressable locations. When a certain amount of space is needed, the compiler the compiler inserts special calls to reserve a contiguos amount of space in order to use it. Allocation and deallocation When an object is to be stored, the compiler insert calls in the output code to reserve a contiguos interval of locations. A pointer to the first address of the reserved interval is needed in order to reference the object. Deallocation is responsible for freeing memory locations, the pointer to the object is needed as well. One condition is important: the amount of reserved space must be known! Local variables in the source program will be the pointers to these memory locations. Preserving consistency The problem of memory management is keeping memory consintent. All locations are pointed by variables. However, objects can have other variables inside them pointing to other objects. It means that all memory blocks in the heap generate a graph of references. When one object is freed, references to other objects will disappear. Two questions: • What happens to all objects that pointed to a freed block? Their references must be set to null or a run-time error will occur. • What happens when one object is freed and it pointed to another object? The latter remains in memory, but the number of objects pointing to it has decreased. If no more objects points to it, that object becomes unreacheable! • What happens when one object becomes unreacheable? It is no longer useful! It must be deallocated! But how? No other objects in the program has a link to its location! The problem of unreacheable objects is very serious. In a purely compiled language, unreacheable objects represent a non-solveable situation! Once an object becomes unreacheable, unless the compiler uses particular techniques, that memory location will be wasted! Dummy deallocation A very basic approach to deallocation is, when deallocating an object, all object pointed by that object are to be deallocated too! But what happens to shared objects? With this strategy, no shared objects are allowed, when two objects need to point a common object, a copy of the later is created, this is a solution, although very demanding in term of memory. Cells ad liveness Two important concepts can be considered in heap allocation: • Cell Is a contiguos interval of addressable memory locations in the heap, reserved for a

pag 158 Andrea Tino - 2013 particular object. Every cell has a address that makes it possible to access that cell’s contents. • Roots: Pointers to cells can be stored in separate memory areas (stacks, global memory, etc.) or into reserved areas in the heap itself. These locations are called roots! They represent the access points to stored data in the heap. • Liveness: A cell is said to be live if its address is contained in one root or into another live cell. Garbage When a cell becomes unreacheable, it is live no more. That cell becomes garbage. Garbage Collection (GC) GC is a (not so) modern technique to clean the heap from garbage. There are many approaches to GC, they all depend on the type of language/compiler. That being said, always remember that GC is not part of the language. It is not something carried out by an external component, related to the language and the compiler. We can have two possibilities: • The compiler takes responsibility for handling memory. It means tht every memory operation will trigger a memory check routine (scheduled by the compiler upon object creation). This is a very demanding strategy. • The compiler simply provides means to allocate and deallocate objects, it does not ensure memory consistency. Garbage Coollector GC is performed by a separate component when the program runs. There is a problem: every compiler treats resources in the heap using its own rules and policies. It is not possible to create a common GC for all possible languages! The GC must be part of the language architecture, it is not part of the compiler, but it must be able to access cells in the heap (thus the GC needs to understand allocation policies of the compiler). The final conclusion is that every language architecture has its own GC (when a GC is part of the architecture)! Example of garbage Consider the following C++ code: void create_garbage() { int* var1 = new int(12); /* creating var1 --> cell1 */ int* var2 = new int(20); /* creating var2 --> cell2 */ var1 = var2; /* now var1 --> cell1 <-- var2, cell2 becomes unreacheable */ delete var1; /* now var2 --> ???, possible runtime error */ } C++ is a language whose architecture does not include a GC. For this reason memory consistency is up to the programmar, a task which can be quite troublesome when a program gets complex and bigger with many objects, classes and data structures (and shared objects). Managed languages Today many languages rely on GC, the following table is meant to provide a brief overview of GCs out there:

pag 159 Andrea Tino - 2013 Language Architecture Access point Description C#, VB.NET, F# Microsoft .NET Framework System.GC An advanced garbage collector as a separate low priority process when .NET applications run. Java Oracle Java Lang. java.gc A separate low priority thread. ActionScript Adobe Flash Player as.utils The Flash Player acts like a virtual machine. Today, OOP and functional languages are the most important categories of languages targeted by GC architectures. For procedural languages, separate components exists as libraries to use as a program extension to support GC. Garbage Collection No matter what strategy of dynamic allocation a compiler uses, if a GC is included in the language architecture, every GC can act differently depending on objectives. The ideal GC Since a lot of languages take advantage of GC, today we have a very wide range of garbage collectors. Which one is the best? Some princiles are to be observed by a GC to be a good GC: • The GC should be completely transparent to the program when running. The user should not realize that a GC is working background. • The GC must have access to all data handled by the program, but should not interfere with it when resources are being used. • A GC must locate and work only on garbage cells. • A GC should be fast when accessing resources in order to reduce overhead. • A GC should keep the heap consistent. GC or not GC? That is the question! One of the first languages using GC was Lisp. First implementations were sadly known for being slow and troublesome, halting on every GC cycle. However, today GC has stopped being a problem: computer architectures got faster, but GC algorithms got more efficient as well. There

pag 160 Andrea Tino - 2013 are still applications were a GC is incompatible (think of real time applications), but for normal programs the typical dilemma is: [Stat] The GC dilemma: Which is more important? Software which is free of memory leaks and allocation errors; or software which is a little faster but less reliable? Today’s GCs can provide solutions which can be a very good trade-off between the two counterparts of the afore-mentioned dilemma. We are now going to describe the most common GC strategies used today in many computer languages. GC through Reference Counting It was one of the most popular approaches to GC, it was adopted by Java (Sun Microsystems) and Microsoft COM (Component Object Model) as well. The idea The basic idea is equipping every heap cell (thus, objects) with a counter. This number will count the number of other objects pointing to that cell. If a graph of linked objects were to be drawn, the counter for each object would be equal to the number of incoming connections for that node. When the counter becomes zero, the cell must be collected. Problems The problem is that the GC complexity is distributed all over the heap on every single cell. When a cell is collected or a refernce updated/deleted, counters must be updated. So, given an object, how to located all objects linking the former? The problem is not simple. Implementation Thus every cell comes, together with a address, with a field called rc, the field stores a counter: the reference counter. To correctly handle counters for all objects in the heap, the following functions are considered: • Object creation: When a new object is created, its reference counter must be initialized to 1. • Object assignment: When an object is assigned to a variable, a new reference to that object is created. The counter must be incremented. • Object de-assignment: When a reference to an object is removed, the reference counter of that object must be decremented and zero-check be performed. • Object deletion: When an existing object is deleted, all reference counters of objects which were pointed by it must be decremented and zero-check performed for each one of them. • Zero-check: When issued, the checking routine checks whether the reference counter is 0 or not, in case it is 0, the cell address is inserted in the freelist. Everytime a new object is created or deleted or updated, the compiler emits calls to these functions in order to handle references.

pag 161 Andrea Tino - 2013 The freelist If the GC had to free memory everytime an object’s reference counter reaches 0, then performances of a program would decrease noticeably. The approach is actually another one. Everytime the counter reaches 0 for an cell, that cell address is insertred into a list called freelist. At regular intervals, the GC issues a garbage collection procedure which frees every cell referenced by entries in the list. Every collected cell will cause the corresponding entry in the list to be removed as well. An object is removed when its entry in the freelist is removed! Smarter object deletion: late decrement We said before that when an object is removed, all objects wh were pointed by the former must have their reference counter decremented by one. However this approach can lead to serious performance decay when the removed object reference many other objects. A common approach is to postpone the decrement to when the object is physically removed from the freelist. Collection frequency The freelist keeps track of cells to be collected. But when issueing the garbage collection routine? • The routine can be called periodically after a fixed amount of time. • It is possible to call the routine when the memory is full. • It is possible to call the routine when the free list reaches a certain size or when the memory is full. Drawbacks of Reference Counting Although quite common in early times, this approach started showing some problems. We are going to describe the most relevant ones. Memory First of all, every cell must contain additional memory to host the reference counter. It doesn’t look like a big concern, but try to imagine the amount of ovjects typically created during the execution of a program: for each one of them, additional memory is used. The memory overhead becomes significant when applications get memory demanding. 1 word (8 bits) is the usual size reserved for the reference counter; when the counter reaches 255, all increments will cause the counter to remain the same. Cycles (non) tolerance The most important problem is only one: reference counting cannot handle cyclic references among cells. The problem is that cycle references generate memory leak condition upon deletion of all entry points. An entry point in a reference cycle is one cell (not part of the cycle) which references cells outside the cycle. When all entry points for a cycle are deleted, no one will be able to reach those cells anymore. The problem is that an object is unreacheable when its reference counter reaches 0, but in this case, in a cycle a group of cells become unreacheable though having their reference counter not equal to zero. That’s why the following should always be considered:

pag 162 Andrea Tino - 2013 [Lem] Unreacheability in reference counting GC: In reference counting GC, unreacheable cells are cells whose reference counter becomes zero, or groups of cells, part of a reference cycle, whose reference counters all reached value 1. Trying to handle cycles Cycles can lead to memory leaks. How to deal with this? Reference counting alone cannot manage the situation, something more is necessary. Possible approaches can be: • The programmar must ensure that reference cycles are not created by the codes he writes. The compiler should detect cycles and return an error. • Combining reference counting with another garbage collection strategy. Advantages of Reference Counting Reference counting is not so bad after all. Some advantages can be considered. Easy to implement The strongest point sure its that reference counting is quite easy to implement. This makes the whole GC lightweight and fast as well. Homogeneous overhead As stated previously, overhead is distributed homogeneously on all computation. The GC does not start a collection cycle while blocking the program to check for garbage and collect it. The program itself updates references while running and garbage is automatically put in the freelist always by the program (thanks to some special routines emitted by the compiler when handling memory management instructions). The greatest part of the job is made by the running program, not by the GC! Space and time overhead As stated before, memory overhead is experienced because of the space required, for each cell, by the reference counter. Time overhead is experienced as well everytime reference counters must be updates. A particularly serious overhead is the one related to reference updates. When the pointer to an object is changed, a reference counter for an object must be decremented and the reference counter for the other one must be incremented. GC forgetfulness It is not that rare for the GC to forget updating reference counters. There are many reasons for this, but the consequences is the most important matter: • Forgetting to increment: When a reference counter is not incremented for an object, stange bugs can be experienced. Fragmentetion fault are to be experienced too, as the object might be collected when other objects still point to it! • Forgetting to decrement: This always leads to meory leaks as the object will persist in memory for the entire duration of the program run. If an object is removed and another object being pointed by it has not its reference counter updated, the latter will never be inserted in the freelist as a direc consequence.

pag 163 Andrea Tino - 2013 GC through Mark & Sweep Looks like first implementation of Lisp adopted this particular GC strategy. The idea The approach puts GC responsabilities all on the GC (while reference counter put some responsabilities on the program). During every collection cycle the GC acts on the reference graph generated by all references among cells/objects in the heap looking for unreacheable cells. Unreacheable cells, when found, are inserted in the freelist. The algorithm Each cell in the heap is equipped with one bit: the mark-bit whose value is initialized to false. When a collection cycle is issued, the following operations are performed: • Marking phase: Starting from the roots, the GC traverses all cells and for each reached cell, its mark-bit is set to true. At the end of the process, a part or the whole reference graph gets traversed and marked. • Sweeping phase: The GC looks in the heap for all cells that did not receive the marking (the mark-bit is false). These cells could not be reached by the marking phase, it means they are unreacheable cells. Thus, they are inserted in the freelist. In first implementations, every GC cycle caused the program to stop. Today, however, it is possible to have the GC run on a different low-priority thread. The idea The appro Drawbacks of Mark & Sweep The algorithm is very powerful, but it introduces some serious overheads. Responsabilities Compared to reference counting, the algorithm is much more complex because it handles everything. Graph traversal The marking phase is seriously demanding from the point of view of computation as it causes the reference graph to be traversed interely suring every GC scan. Traversing a graph is not something simple. Furthermore, graph traversal is not performed once, it is performed as many times as the number of roots. Cost evaluation Graph traversal can be performed in a wide range of strategies. One of the most powerful algorithms is DFS(#) which has complexity O E( ) for a generic graph V,E( ) (graph theory). For big graphs, the algorithm might require a lot of time. However we can relate number of connections in the graph with the number of nodes by using a multiplicative factor, so we can say that the DFS algorithm can execute in time O nV E( ). (#) Depth First Search. http://en.wikipedia.org/wiki/Depth-first_search

pag 164 Andrea Tino - 2013 We need to make graph traversal cost evaluation more precise as not all nodes are effectively traversed: only marked nodes will be traversed, thus a subset of the graph. So consider set W ⊆ V as a subset of nodes, DFS will execute with cost O nW W( ) (a new factor is needed). Now let us consider another important component for the total cost of each GC scan: the heap size H ( H is the set of cells): thus the number of cells in the heap. The GC needs to evaluate all cells when looking for unmarked nodes (nodes in the graph are cells in the heap). It means that the sweep phase has cost O H( ). Again, we need to fine tune our evaluation as not all cells will be really evaluated, only unmarked nodes will be taken care by the GC, thus the cost becomes: O nH H( ). So what about the whole cost? We have that a GC scan has cost: O nW W + nH H( ). Advantages of Mark & Sweep All advantages introduced by Mark & Sweep are related to both the program and the compiler when we make comparisons with reference counting. Responsabilities The algorithm now takes responsability for everything. The compiler does not need to emit special memory management procedures when handling memory instructions in the source code. This makes the compiler more lightweight. Also the compiler will do its job: compiling the code instead of worrying about something that should be controlled by the GC. Faster applications Here the program does not need to perform extra operations when allocating new variables or deleting objects. Its work is much easier. This removes all those overheads introduced by reference counting: the program will simply create/remove cells and references without any further concern. Tolerance to cycles Differently from reference counting, the algorithm will always find all unreacheable cells also in case of reference cycles. Improving Mark & Sweep Some improvements can be considered for this algorithm. Actually literature is quite full of articles on this matter due to the fact that Mark & Sweep represents today a very good solution Sizing the heap How to decide the total size of the heap? This is actually something decided by the operating system, however, through system calls, the GC can decide whether shrinking or expanding the current heap. There is a way for dimensioning the correct amount of space needed in the heap. Everything revolves around one quantity: the cost per free space unit or relative cost defined as the ratio between the GC scan cost and the amount of free space (space that will be freed):

pag 165 Andrea Tino - 2013 η = nW W + nH H H − W If heap is almost full then H − W  0 and the relative cost is very high; the crelative cost is low, on the other hand, when the heap is almost empty. It is possible to set thresholds for the relative cost, once the top threshold is reached, the C can ask for more space for the heap; if the bottom threshold is reached, the GC can ask for shrinking the heap. It is possible to use ratio W H instead of the relative cost (which is a better heuristics). GC killer Considering that during every GC scan, the GC must traverse the entire graph, a recursive DFS approach is to be avoided! Worst case scenario is that the recursive DFS will generate as many nested calls as the number of nodes in the graph, the GC would generate as many activation records in its memory to handle recursion: the GC would occupy more space than the program being collected itself! The solution is using the iterative DFS implementation and save pointers to marked nodes into a data structure to have the lowest possible amount of memory overhead. Pointer reversal This is a very powerful technique to let the GC avoid using more memory to store unmarked cells. Before diving into this strategy, let us forst consider how Mark & Sweep works, during the marking phase, by simply making cells/nodes fall into one of the following categories: • Objects that have not been marked. • Objects that have been marked, but can point to unmarked objects. • Objects that have been marked and point to marked objects only. As marking proceeds, objects change state from the first category to the second and from the second to the third. The GC needs to keep track of objects in the second category. By performing DFS on the graph and flipping pointers as nodes are travrsed, there will be no need of memorizing

Creating a compiler for your own language

More Related Content

What's hot

Similar to Creating a compiler for your own language

More from Andrea Tino

Recently uploaded

Creating a compiler for your own language