 
 - Compiler Design - Home
- Compiler Design - Overview
- Compiler Design - Architecture
- Phases
- Compiler Design - Phases
- Compiler Design - Global Optimization
- Compiler Design - Local Optimization
- Lexical Analysis
- Compiler Design - Lexical Analysis
- Compiler Design - Regular Expressions
- Compiler Design - Finite Automata
- Compiler Design - Language Elements
- Compiler Design - Lexical Tokens
- Compiler Design - FSM
- Compiler Design - Lexical Table
- Compiler Design - Sequential Search
- Compiler Design - Binary Search Tree
- Compiler Design - Hash Table
- Syntax Analysis
- Compiler Design - Syntax Analysis
- Compiler Design - Parsing Types
- Compiler Design - Grammars
- Compiler Design - Classes Grammars
- Compiler Design - Pushdown Automata
- Compiler Design - Ambiguous Grammar
- Parsing
- Compiler Design - Top-Down Parser
- Compiler Design - Bottom-Up Parser
- Compiler Design - Simple Grammar
- Compiler Design - Quasi-Simple Grammar
- Compiler Design - LL(1) Grammar
- Error Recovery
- Compiler Design - Error Recovery
- Semantic Analysis
- Compiler Design - Semantic Analysis
- Compiler Design - Symbol Table
- Run Time
- Compiler Design - Run-time Environment
- Code Generation
- Compiler Design - Code Generation
- Converting Atoms to Instructions
- Compiler Design - Transfer of Control
- Compiler Design - Register Allocation
- Forward Transfer of Control
- Reverse Transfer of Control
- Code Optimization
- Compiler Design - Code Optimization
- Compiler Design - Intermediate Code
- Basic Blocks and DAGs
- Control Flow Graph
- Compiler Design - Peephole Optimization
- Implementing Translation Grammars
- Compiler Design - Attributed Grammars
Compiler Design - Regular Expressions
The lexical analyzer needs to scan and identify only a finite set of valid string/token/lexeme that belong to the language in hand. It searches for the pattern defined by the language rules.
Regular expressions have the capability to express finite languages by defining a pattern for finite strings of symbols. The grammar defined by regular expressions is known as regular grammar. The language defined by regular grammar is known as regular language.
Regular expression is an important notation for specifying patterns. Each pattern matches a set of strings, so regular expressions serve as names for a set of strings. Programming language tokens can be described by regular languages. The specification of regular expressions is an example of a recursive definition. Regular languages are easy to understand and have efficient implementation.
There are a number of algebraic laws that are obeyed by regular expressions, which can be used to manipulate regular expressions into equivalent forms.
Operations
The various operations on languages are:
-  Union of two languages L and M is written as L U M = {s | s is in L or s is in M} 
-  Concatenation of two languages L and M is written as LM = {st | s is in L and t is in M} 
-  The Kleene Closure of a language L is written as L* = Zero or more occurrence of language L. 
Notations
If r and s are regular expressions denoting the languages L(r) and L(s), then
- Union : (r)|(s) is a regular expression denoting L(r) U L(s) 
- Concatenation : (r)(s) is a regular expression denoting L(r)L(s) 
- Kleene closure : (r)* is a regular expression denoting (L(r))* 
- (r) is a regular expression denoting L(r) 
Precedence and Associativity
- *, concatenation (.), and | (pipe sign) are left associative
- * has the highest precedence
- Concatenation (.) has the second highest precedence.
- | (pipe sign) has the lowest precedence of all.
Representing valid tokens of a language in regular expression
If x is a regular expression, then:
-  x* means zero or more occurrence of x. i.e., it can generate { e, x, xx, xxx, xxxx, } 
-  x+ means one or more occurrence of x. i.e., it can generate { x, xx, xxx, xxxx } or x.x* 
-  x? means at most one occurrence of x i.e., it can generate either {x} or {e}. 
[a-z] is all lower-case alphabets of English language.
[A-Z] is all upper-case alphabets of English language.
[0-9] is all natural digits used in mathematics.
Representing occurrence of symbols using regular expressions
letter = [a z] or [A Z]
digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 or [0-9]
sign = [ + | - ]
Representing language tokens using regular expressions
Decimal = (sign)?(digit)+
Identifier = (letter)(letter | digit)*
The only problem left with the lexical analyzer is how to verify the validity of a regular expression used in specifying the patterns of keywords of a language. A well-accepted solution is to use finite automata for verification.