Introduction • Regular Expression(RE) – is a language for specifying text search strings. – First developed by Kleene (1956) – Requires a: • Pattern – specification formula using a special language that specifies simple classes of strings. • Corpus – a body of text to search through.
3.
Introduction • The regularexpression is used for specifying: – text strings in situations like Web-search example, and in other – information retrieval applications, but also plays an important role in – word-processing, – computation of frequencies from corpora, and other such tasks.
4.
Introduction • Regular Expressionscan be implemented via finite-state automaton. • Finite-State Automaton (FSA) is one of the most significant tools of computational linguistics. • Its variations: – Finite-state transducers – Hidden Markov Models, and – N-gram grammars Important components of the Speech Recognition and Synthesis, spell-checking, and information-extraction applications.
5.
Regular Expressions • Formally,a regular expression is an algebraic notation for characterizing a set of strings. – Thus they can be used to specify search strings as well as to define a language in a formal way. • Regular Expression requires – A pattern that we want to search for, and – A corpus of text to search through. • Thus when we give a search pattern, we will assume that the search engine returns the line of the document is returned. This is what the UNIX grep command does. • We will underline the exact part of the pattern that matches the regular expression. • A search can be designed to return all matches to a regular expression or only the first match.
6.
Basic Regular ExpressionPatterns • The simplest kind of regular expression is a sequence of simple characters: – /woodchuck/ – /Buttercup/ – /!/ RE Example Patterns Matched /woodchucks/ /a/ /Claire says,/ /song/ /!/ “interesting links to woodchucks and lemurs” “Mary Ann stopped by Mona’s” “Dagmar, my gift please,” Claire says,” “all our pretty songs” “You’ve left the burglar behind again!” said Nori
7.
Basic Regular ExpressionPatterns • Regular Expressions are case sensitive – /s/, is not the same as – /S/ • /woodchucks/ will not match “Woodchucks” • Disjunction: “[“ and “]”. RE Match Example Pattern /[wW]oodchuck/ Woodchuck or woodchuck “Woodchuck” /[abc]/ ‘a’, ‘b’, or ‘c’ “In uomini, in soldati” /[1234567890]/ Any digit “plenty of 7 to 5”
8.
Basic Regular ExpressionPatterns • It is inconvenient to search for example: /[ABCDEFGHIJKLMONPQRSTUVWXYZ]/ • Specifying range in Regular Expressions: “-” RE Match Example Patterns Matched /[A-Z]/ An uppercase letter “we should call it ‘Drenched Blossoms’” /[a-z]/ A lower case letter “my beans were impatient to be hoed!” /[0-9]/ A single digit “Chapter 1: Down the Rabbit Hole”
9.
Basic Regular ExpressionPatterns • Negative Specification – what pattern can not be: “^” – If the first symbol after the open square brace “[” is “^” the resulting pattern is negated. – Example /[^a]/ matches any single character (including special characters) except a. RE Match (single characters) Example Patterns Matched /[^A-Z]/ Not an uppercase letter “Oyfn pripetchik” /[^Ss]/ Neither ‘S’ nor ‘s’ “I have no exquisite reason for ’t” /[^.]/ Not a period “our resident Djinn” /[e^]/ Either ‘e’ or ‘^’ “look up ^ now” /a^b/ Pattern ‘a^b’ “look up a^b now”
10.
Basic Regular ExpressionPatterns • How do we specify both woodchuck and woodchucks? – Optional character specification: /?/ – /?/ means “the preceding character or nothing”. RE Match Example Patterns Matched /woodchucks?/ woodchuck or woodchucks “woodchuck” colou?r color or colour “colour”
11.
Basic Regular ExpressionPatterns • Question-mark “?” can be though of as “zero or one instance of the previous character”. • It is a way to specify how many of something that we want. • Sometimes we need to specify regular expressions that allow repetitions of things. • For example, consider the language of (certain) sheep, which consists of strings that look like the following: – baa! – baaa? – baaaa? – baaaaa? – baaaaaa? – …
12.
Basic Regular ExpressionPatterns • Any number of repetitions is specified by “*” which means “any string of 0 or more”. • Examples: – /aa*/ - a followed by zero or more a’s – /[ab]*/ - zero or more a’s or b’s. This will match aaaa or abababa or bbbb
13.
Basic Regular ExpressionPatterns • We know enough to specify part of our regular expression for prices: multiple digits. – Regular expression for individual digit: • /[0-9]/ – Regular expression for an integer: • /[0-9][0-9]*/ – Why is not just /[0-9]*/? • Because it is annoying to specify “at least once” RE since it involves repetition of the same pattern there is a special character that is used for “at least once”: “+” – Regular expression for an integer becomes then: • /[0-9]+/ – Regular expression for sheep language: • /baa*!/, or • /ba+!/
14.
Basic Regular ExpressionPatterns • One very important special character is the period: /./, a wildcard expression that matches any single character (except carriage return). • Example: Find any line in which a particular word (for example Veton) appears twice: – /Veton.*Veton/ RE Match Example Pattern /beg.n/ Any character between beg and n begin beg’n, begun
15.
Anchors • Anchors arespecial characters that anchor regular expressions to particular places in a string. – The most common anchors are: • “^” – matches the start of a line • “$” – matches the end of the line – Examples: • /^The/ - matches the word “The” only at the start of the line. • Three uses of “^”: 1. /^xyz/ - Matches the start of the line 2. [^xyz] – Negation 3. /^/ - Just to mean a caret /⌴$/ - “⌴” Stands for space “character”; matches a space at the end of line. /^The dog.$/ - matches a line that contains only the phrase “The dog”.
16.
Anchors • /b/ -matches a word boundary • /B/ - matches a non-boundary • /btheb/ - matches the word “the” but not the word “other”. • Word is defined as a any sequence of digits, underscores or letters. • /b99/ - will match the string 99 in “There are 99 bottles of beer on the wall” but NOT “There are 299 bottles of beer on the wall” and it will match the string “$99” since 99 follows a “$” which is not a digit, underscore, or a letter.
17.
Disjunction, Grouping andPrecedence. • Suppose we need to search for texts about pets; specifically we may be interested in cats and dogs. If we want to search for either “cat” or the string “dog” we can not use any of the constructs we have introduced so far (why not “[]”?). • New operator that defines disjunction, also called the pipe symbol is “|”. • /cat|dog/ - matches either cat or the string dog.
18.
Grouping • In manyinstances it is necessary to be able to group the sequence of characters to be treated as one set. • Example: Search for guppy and guppies. – /gupp(y|ies)/ • Useful in conjunction to “*” operator. – /*/ - applies to single character and not to a whole sequence. • Example: Match “Column 1 Column 2 Column 3 …” – /Column⌴[0-9]+⌴*/ - will match “Column # …“ – /(Column⌴[0-9]+⌴*)*/ - will match “Column 1 Column 2 Column 3 …”
Simple Example • ProblemStatement: Want to write RE to find cases of the English article “the”. 1. /the/ - It will miss “The” 2. /[tT]he/ - It will match “amalthea”, “Bethesda”, “theology”, etc. 3. /b[tT]heb/ - Is the correct RE • Problem Statement: If we want to find “the” where it might also have undelines or numbers nearby (“The-” , “the_” or “the25”) one needs to specify that we want instances in which there are no alphabetic letters on either side of “the”: 1. /[^a-zA-Z][tT]he[^a-zA-Z]/ - it will not find “the” if it begins the line. 2. /(^|[^a-zA-Z])[tT]he[^a-zA-Z]/
21.
Simple Example (cont.) •Refining RE by reduction of: – false positives (false acceptance): •Strings that are incorrectly matched. – false negatives (false rejection): •Strings that are incorrectly missed.
22.
A More ComplexExample • Problem Statement: Build an application to help a user by a computer on the Web. – The user might want “any PC with more than 6 GHz and 256 GB of disk space for less than $1000 – To solve the problem must be able to match the expressions like 1000 MHz, 6 GHz and 256 GB as well as $999.99 etc.
23.
Solution – DollarAmounts • Complete regular expression for prices of full dollar amounts: – /$[0-9]+/ • Adding fractions of dollars: – /$[0-9]+.[0-9][0-9]/ or – /$[0-9]+.[0-9] {2}/ • Problem since this RE only will match “$199.99” and not “$199”. To solve this issue must make cents optional and make sure the $ amount is a word: – /b$[0-9]+(.[0-9][0-9])?b/
24.
Solution: Processor Speech •Processor speed in megahertz = MHz or gigahertz = GHz) – /b[0-9]+⌴*(MHz|[Mm]egahertz|GHz|[Gg]igahertz)b/ – ⌴* is used to denote “zero or more spaces”. • Simple pattern to specify operating systems: – /b(Windows *(8|7|Vista|XP)?)b/ ⌴ – /b(Mac|Macintosh|Apple|OS X)b/ ⌴
25.
Solution: Disk Space •Dealing with disk space: – Gb = gigabytes • Memory size: – Mb or MB = megabytes or – Gb or GB = gigabytes • Must allow optional fractions: – /b[0-9]+⌴*(M[Bb]|[Mm]egabytes?)b/ – /b[0-9]+(.[0-9]+)?⌴*(G[Bb]|[Gg]igabytes?)b/
Advanced Operators RE ExpansionMatch Example Patterns d [0-9] Any digit “Party of 5” D [^0-9] Any non-digit “Blue moon” w [a-zA-Z0- 9⌴] Any alphanumeric or space Daiyu W [^w] A non- alphanumeric !!!! s [⌴rtnf] Whitespace (space, tab) “ ” S [^s] Non- whitespace “in Concord” Aliases for common sets of characters
28.
Repetition Metacharacters RE DescriptionExample * Matches any number of occurrences of the previous character – zero or more /ac*e/ - matches “ae”, “ace”, “acce”, “accce” as in “The aerial acceleration alerted the ace pilot” ? Matches at most one occurrence of the previous characters – zero or one. /ac?e/ - matches “ae” and “ace” as in “The aerial acceleration alerted the ace pilot” + Matches one or more occurrences of the previous characters /ac+e/ - matches “ace”, “acce”, “accce” as in “The aerial acceleration alerted the ace pilot” {n} Matches exactly n occurrences of the previous characters. /ac{2}e/ - matches “acce” as in “The aerial acceleration alerted the ace pilot” {n,} Matches n or more occurrences of the previous characters /ac{2,}e/ - matches “acce”, “accce” etc., as in “The aerial acceleration alerted the ace pilot” {n,m} Matches from n to m occurrences of the previous characters. /ac{2,4}e/ - matches “acce”, “accce” and “acccce” , as in “The aerial acceleration alerted the ace pilot” . Matches one occurrence of any characters of the alphabet except the new line character /a.e/ matches aae, aAe, abe, aBe, a1e, etc., as in ““The aerial acceleration alerted the ace pilot” .* Matches any string of characters and until it encounters a new line character
29.
Literal Matching ofSpecial Characters & “” Characters RE Match Example Patterns * An asterisk “*” “K*A*P*L*A*N” . A period “.” “Dr. Këpuska, I presume” ? A question mark “?” “Would you like to light my candle?” nA newline t A tab r A carriage return character Some characters that need to be back-slashed “”
30.
Finate State Automata •The regular expression is more than just a convenient metalangue for text searching. 1. A regular expression is one way of describing a finite- state-automaton (FSA). FSA – are the theoretical foundation of significant number of computational work. Any regular expression can be implemented as FSA (except regular expressions that use the memory feature). 2. Regular expression is one way of characterizing a particular kind of formal language called a regular language. Both FSA and RE can be used to describe regular languages.
31.
FSA, RE andRegular Languages Finite automata Regular languages Regular expressions
32.
Finite-State Automaton forRegular Expressions • Using FSA to Recognize Sheeptalk with RE: /baa+!/ a q1 q2 q3 b a a ! q0 Start State q4 Final State Transitions
33.
FSA Use a ba ! b • The FSA can be used for recognizing (we also say accepting) strings in the following way. First, think of the input as being written on a long tape broken up into cells, with one symbol written in each cell of the tape, as figure below: q0
34.
Recognition Process • Themachine starts in the start state (q0), and iterates the following process: 1. Check the next letter of the input. a. If it matches the symbol on an arc leaving the current state, then i. cross that arc ii. move to the next state, also iii. advance one symbol in the input b. If we are in the accepting state (q4) when we run out of input, the machine has successfully recognized an instance of sheeptalk. 2. If the machine never gets to the final state, a. either because it runs out of input, or b. it gets some input that doesn’t match an arc (as in Fig in previous slide), or c. if it just happens to get stuck in some non-final state, we say the machine rejects or fails to accept an input.
35.
State Transition Table State Input ba ! 0 1 Ø Ø 1 Ø 2 Ø 2 Ø 3 Ø 3 Ø 3 4 4: Ø Ø Ø We’ve marked state 4 with a colon to indicate that it’s a final state (you can have as many final states as you want), and the Ø indicates an illegal or missing transition. We can read the first row as “if we’re in state 0 and we see the input b we must go to state 1. If we’re in state 0 and we see the input a or !, we fail”.
36.
Formal Definition ofAutomaton Q={q0,q1,…,qN} A finite set of N states a finite input alphabet of symbols q0 the start state F the set of final states, F ⊆ Q (q, i) the transition function or transition matrix between states. Given a state q ∈ Q and an input symbol i ∈ , δ(q, i) returns a new state q′ ∈ Q. δ is thus a relation from Q×S to Q;
37.
FSA Example • Q= {q0,q1,q2,q3,q4}, • = {a,b, !}, • F = {q4}, and • (q, i) q0 q1 q2 q3 b a a a ! Start State q4 Final State Transitions
38.
Deterministic Algorithm forRecognizing a String function D-RECOGNIZE(tape,machine) returns accept or reject index←Beginning of tape current-state←Initial state of machine loop if End of input has been reached then if current-state is an accept state then return accept else return reject elsif transition-table[current-state,tape[index]] is empty then return reject else current-state←transition-table[current-state,tape[index]] index←index + 1 end
39.
Tracing Execution forSome Sheep Talk q1 q2 q3 b a a a ! q0 Start State q4 Final State Transitions b a a a ! q0 q1 q2 q3 q3 q4 State Input b a ! 0 1 Ø Ø 1 Ø 2 Ø 2 Ø 3 Ø 3 Ø 3 4 4: Ø Ø Ø
40.
Tracing Execution forSome Sheep Talk (cont.) Before examining the beginning of the tape, the machine is in state q0. Finding a b on input tape, it changes to state q1 as indicated by the contents of transition-table[q0,b] in Fig. It then finds an a and switches to state q2, another a puts it in state q3, a third a leaves it in state q3, where it reads the “!”, and switches to state q4. Since there is no more input, the End of input condition at the beginning of the loop is satisfied for the first time and the machine halts in q4. State q4 is an accepting state, and so the machine has accepted the string baaa! as a sentence in the sheep language.
41.
Fail State • Thealgorithm will fail whenever there is no legal transition for a given combination of state and input. The input abc will fail to be recognized since there is no legal transition out of state q0 on the input a, (i.e., this entry of the transition table has a Ø). • Even if the automaton had allowed an initial a it would have certainly failed on c, since c isn’t even in the sheeptalk alphabet! We can think of these “empty” elements in the table as if they all pointed at one “empty” state, which we might call the fail state or sink state. • In a sense then, we could FAIL STATE view any machine with empty transitions as if we had augmented it with a fail state, and drawn in all the extra arcs, so we always had somewhere to go from any state on any possible input. Just for completeness, next Fig. shows the FSA from previous Figure with the fail state qF filled in.
42.
Adding a FailState to FSA q1 b a a ! q0 Start State q4 Final State a a qF q2 q3 ! b ! b b ! ! b a
43.
Formal Languages • KeyConcept #1. Formal Language: – A model which can both generate and recognize all and only the strings of a formal language acts as a definition of the formal language. – A formal language is a set of strings, each string composed of symbols from a finite symbol-set called an alphabet (the same alphabet used above for defining an automaton!). • The alphabet for a “sheep” language is the set = {a,b, !}. • Given a model m (such as FSA) we can use L(m) to mean “the formal language characterized by m”. • L(m)={baa!,baaa!, baaaa!, baaaaa!,….}
44.
Formal Languages • Automatoncan express an infinite set in a closed form. • Formal Languages are not the same as natural languages. – Formal languages are used to model part of a natural language: • Phonology, • Morphology, or • Syntax • Generative Grammar: – It is used sometimes in linguistics to mean a grammar of a formal language; – The use of an automaton to define a language by generating all possible strings.
45.
Example 2 • Alphabetconsisting of words. Must build an FSA that models the sub-part of English language that deals with amounts of money: – Ten cents, – Three dollars, – One dollar thirty-five cents, … • Such a formal language would model the subset of English that consists of phrases like ten cents, three dollars, one dollar thirty-five cents, etc. 1. Solve the problem of building FSA for numbers 1-99 with which will model cents. 2. Model dollar amounts by adding cents to it.
46.
FSA for thewords for English numbers 1- 99 q0 q1 Start State Twenty Thirty Forty Fifty Sixty Seventy Eighty Ninety One Two Three Four Five Six Seven Eight Nine q2 Final State One Ten Eleven Two Twelve Twenty Three Thirteen Thirty Four Fourteen Forty Five Fifteen Fifty Six Sixteen Sixty Seven Seventeen Seventy Eight Eighteen Eighty Nine Nineteen Ninety
47.
FSA for thesimple Dollars and Cents q0 One Two Three Four Five Six Seven Eight Nine Ten Eleven Twelve Thirteen Fourteen Fifteen Sixteen Seventeen Eighteen Nineteen Twenty Thirty Forty Fifty Sixty Seventy Eighty Ninety q3 cents One Two Three Four Five Six Seven Eight Nine Ten Eleven Twelve Thirteen Fourteen Fifteen Sixteen Seventeen Eighteen Nineteen Twenty Thirty Forty Fifty Sixty Seventy Eighty Ninety q7 q6 cents Twenty q1 One q2 dollars q4 q5 Twenty One Thirty Two Thirty Two Forty Three Forty Three Fifty Four Fifty Four Sixty Five Sixty Five Seventy Six Seventy Six Eighty Seven Eighty Seven Ninety Eight Ninety Eight Nine Nine
48.
Non-Deterministic FSAs q0 q1q2 q3 b a a ! Start State q4 Final State Deterministic FSA a q0 q1 q2 q3 b a a a ! Start State q4 Final State Non-Deterministic FSA
49.
Deterministic vs Non-deterministicFSA • Deterministic FSA is one whose behavior during recognition is fully determined by the state it is in and the symbol it is looking at. • The FSA in the previous slide when FSA is at the state q2 and the input symbol is a we do not know whether to remain in state 2 (self-loop transition) or state 3 (the other transition) . • Clearly the decision dependents on the next input symbols.
50.
Another NFSA for“sheep” language: - transition q0 q1 b a a ! Start State q4 Final State q2 q3 • - transition defines the arc that causes transition without an input symbol. Thus when in state q3 transition to state q2 is allowed without looking at the input symbol or advancing input pointer. • This example is another kind of non-deterministic behavior – we “might not know” whether to follow the - transition or the ! arc.
51.
Using NFSA toAccept Strings • There is a problem of (wrong) choice in non-deterministic FSA. There are three standard solutions to the problem of non-determinism: – Backup: Whenever we come to a choice point, we could put a marker to mark where we were in the input, and what state the automaton was in. Then if it turns out that we took the wrong choice, we could back up and try another path. – Look-ahead: We could look ahead in the input to help us decide which path to take. – Parallelism: Whenever we come to a choice point, we could look at every alternative path in parallel. • We will focus here on the backup approach and defer discussion of the look-ahead and parallelism approaches to later chapters.
52.
Back-up Approach forNFSA Recognizer • The backup approach suggests that we should make choices that might lead to dead-ends, knowing that we can always return to unexplored alternative choices. • There are two key points to this approach: 1. Must know ALL alternatives for each choice point. 2. Store sufficient information about each alternative so that we can return to it when necessary.
53.
Back-up Approach forNFSA Recognizer • When a backup algorithm reaches a point in its processing where no progress can be made: – Runs out of input, or – Has no legal transitions, It returns to a previous choice point and selects one of the unexplored alternatives and continues from there. • To apply this notion to current definition of FSA we need only to store two things for each choice point: 1. The State (or node) 2. Corresponding position on the tape.
54.
Search State • Combinationof the node and the position specifies the search state of the recognition algorithm. • To avoid confusion, the state of automaton is called a node or machine-state. • Two changes are necessary in transition table: 1. To represent nodes that have - transitions we need to add - column, 2. Accommodate multiple transitions to different nodes from the same input symbol. Each cell entry consists of the list of destination nodes rather then a single node.
55.
The Transition tableof a NFSA State Input b a ! 0 1 Ø Ø Ø 1 Ø 2 Ø Ø 2 Ø 2,3 Ø Ø 3 Ø Ø 4 2 4: Ø Ø Ø Ø q0 q1 b a a ! Start State q4 Final State q2 q3
56.
An Algorithm forNFSA Recognition function ND-RECOGNIZE(tape,machine) returns accept or reject agenda←{(Initial state of machine, beginning of tape)} current-search-state←NEXT(agenda) loop if ACCEPT-STATE?(current-search-state) returns true then return accept else agenda← agenda ∪ GENERATE-NEW-STATES(current-search-state) if agenda is empty then return reject else current-search-state←NEXT(agenda) end
57.
An Algorithm forNFSA Recognition (cont.) function GENERATE-NEW-STATES(current-state) returns a set of search-states current-node←the node the current search-state is in index←the point on the tape the current search-state is looking at return a list of search states from transition table as follows: (transition- table[current-node, ], index) ∪ (transition-table[current-node, tape[index]], index + 1) function ACCEPT-STATE?(search-state) returns true or false current-node←the node search-state is in index←the point on the tape search-state is looking at if index is at the end of the tape and current-node is an accept state of machine then return true else return false
58.
Possible execution ofND-RECOGNIZE q1 b a a ! q0 Start State q2 q3 q4 1 Final State State Input b a ! 0 1 Ø Ø Ø 1 Ø 2 Ø Ø 2 Ø 2,3 Ø Ø 3 Ø Ø 4 2 4: Ø Ø Ø Ø b a a a ! q0 b a a a ! q0 q1 2 b a a a ! q1 q2 3 b a a a ! 4 q2 q3 b a a q3 X 5 2 possibilities
59.
Depth-First-Search • Depth-First-Search orLast-In-First-Out (LIFO): – Uses stack data structure to implement the function NEXT. • NEXT returns the state at the front of the agenda. • Pitfall: Under certain circumstances they can enter an infinite loop.
60.
Depth-First Search ofND-RECOGNIZE q1 b a a ! q0 Start State q4 Final State q2 q3 State Input b a ! 0 1 Ø Ø Ø 1 Ø 2 Ø Ø 2 Ø 2,3 Ø Ø 3 Ø Ø 4 2 4: Ø Ø Ø Ø b a a a ! b a a a ! q0 1 q0 q1 2 3 q q 1 2 b a a a ! 4 q2 q3 b a a a ! 5 q3 X b a a a ! 6 b a a a ! 7 q3 b a a a ! q4 8 A depth-first trace of FSA on some sheeptalk b a a a ! 1 possibility is evaluated first q2
61.
Breadth-First Search • Breadth-FirstSearch or First In First Out (FIFO) strategy. – All possible choices explored at once. – Uses a queue data structure to implement NEXT function • Pitfalls: – As with depth-first if the state-space is infinite, the search may never terminate. – More importantly due to growth in the size of the agenda if the state-space is even moderately large, the search may require an impractically large amount of memory.
62.
Breadth-First Search ofND-RECOGNIZE q0 q4 q1 b a a ! Start State Final State q2 q3 State Input b a ! 0 1 Ø Ø Ø 1 Ø 2 Ø Ø 2 Ø 2,3 Ø Ø 3 Ø Ø 4 2 4: Ø Ø Ø Ø b a a a ! b a a a ! b a a a ! 1 q0 q1 2 q1 q2 3 b a a a ! 4 q2 q3 b a a a ! 5 q3 X b a a a ! 4 q2 A breadth-first trace of FSA on some sheeptalk q0 b a a a ! 5 q3 b a a a ! q4 6 b a a a ! q2 5 2 possibilities are evaluated
63.
Advanced Search Algorithms •For larger problems, more complex search techniques such as dynamic programming or A* must be used.
64.
Equivalence of D-FSAand N-FSA Automata • For many languages N-FSA’s are easier to construct then D-FSA’s. • However, any language that can be described with N- FSA can also be described by some D-FSA. – D-FSA typically has about as many states as N-FSA, – D-FSA typically has more transitions – Worst case scenario D-FSA can have no more than 2n states describing a language specified by N-FSA with n-states.
65.
Regular Languages andFSA • The class of languages that is definable by Regular Expressions is exactly the same as the class of languages that are characterizable by finite-state automata: – Those languages are called Regular Languages.
66.
Formal Definition ofRegular Languages (important) • - alphabet = set of symbols in a language. • - empty string • Ø – empty set. • The regular languages (or regular sets) over is then formally defined as follows: 1. Ø is a regular language 2. ∀a ∈ ∪ , {a} is a regular language 3. If L1 and L2 are regular languages, then so are: • L1˙L2 = {xy|x ∈ L1, y ∈ L2}, the concatenation of L1 and L2 1 b) L1 ∪ L2, the union or disjunction of L1 and L2 c) L *, the * closure of L , as well as 2 L *, the * closure of L 1 2 • All and only the sets of languages which meet the above properties are regular languages.
67.
Regular Languages andFSAs • All regular languages can be implemented by the three operations which define regular languages: – Concatenation – Disjunction|Union (also called “|”), – * closure. • Example: – (*,+,{n,m}) are just a special case of repetition plus * closure. – All the anchors can be thought of as individual special symbols. – The square braces [] are a kind of disjunction: • [ab] means “a or b”, or • The disjunction of a and b.
68.
Regular Languages andFSAs • Regular languages are also closed under the following operations: – Intersection: if L1 and L2 are regular languages, then so is L1 ∩ L2, the language consisting of the set of strings that are in both L1 and L2. – Difference: if L1 and L2 are regular languages, then so is L1 – L2, the language consisting of the set of strings that are in L1 but not L2. – Complementation: if L1 and L2 are regular languages, then so is *-L1, *-L2 the set of all possible strings that are not in L1, L2. 1 – Reversal: if L1 is regular language, then so is L R, the language consisting of the set of reversals of the strings that are in L1.
69.
Regular Expressions andFSA • The regular expressions are equivalent to finite-state automaton (Proof: Hopcroft and Ullman 1979). • Proof is inductive. Each primitive operations of a regular expression (concatenation, union, closure) is shown as part of inductive step of the proof: q0 qf (a) r= Automata for the b any reg q0 qf (a) r=Ø q0
70.
Concatenation • FSAs nextto each other by connecting all the final states of FSA1 to the initial state of FSA2 by an -transition q0 qf q0 qf FSA1 FSA2
71.
Closure • Repetition: Allfinal states of the FSA back to the initial states by - transition • Zero occurrences case: Direct link from the initial state to final state
72.
Union • Add asingle new initial state q0, and add new -transitions from it to the former initial states of the two machines to be joined q0 qf q0 qf FSA1 qf q0 FSA2 The union (|) of two FSAs